0% found this document useful (0 votes)
945 views2,825 pages

Azure PDF

Uploaded by

Anton
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
945 views2,825 pages

Azure PDF

Uploaded by

Anton
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2825

Contents

Azure Architecture Center


Architecture icons
Browse all Architectures
Browse Hybrid Cloud Architectures
What's new
Application Architecture Guide
Introduction
Architecture styles
Overview
Big compute
Big data
Event-driven architecture
Microservices
N-tier application
Web-queue-worker
Design principles for Azure applications
Overview
Design for self-healing
Make all things redundant
Minimize coordination
Design to scale out
Partition around limits
Design for operations
Use managed services
Use the best data store for the job
Design for evolution
Build for the needs of business
Technology choices
Choose a compute service
Choose a data store
Understand data store models
Select a data store
Criteria for choosing a data store
Choose a load balancing service
Choose a messaging service
Asynchronous messaging options in Azure
Choose an Apache Kafka host
Best practices for cloud applications
API design
API implementation
Autoscaling
Background jobs
Caching
Content Delivery Network
Data partitioning
Data partitioning strategies (by service)
Message encoding considerations
Monitoring and diagnostics
Retry guidance for specific services
Transient fault handling
Performance tuning
Introduction
Scenario 1 - Distributed transactions
Scenario 2 - Multiple backend services
Scenario 3 - Event streaming
Performance antipatterns
Overview
Busy Database
Busy Front End
Chatty I/O
Extraneous Fetching
Improper Instantiation
Monolithic Persistence
No Caching
Synchronous I/O
Responsible Innovation
Overview
Judgment Call
Harms Modeling
Understand Harm
Assess Types of Harm
Community Jury
Azure for AWS Professionals
Overview
Component information
Accounts
Compute
Databases
Messaging
Networking
Regions and Zones
Resources
Security and Identity
Storage
Service comparison
Azure for GCP Professionals
Overview
Services comparison
Microsoft Azure Well-Architected Framework
Overview
Cost Optimization
About
Principles
Design
Checklist
Cost model
Capture requirements
Azure regions
Azure resources
Governance
Initial estimate
Managed services
Performance and price options
Provision
Checklist
AI + Machine Learning
Big data
Compute
Data stores
Messaging
Networking
Cost for networking services
Web apps
Monitor
Checklist
Budgets and alerts
Reports
Reviews
Optimize
Checklist
Autoscale
Reserved instances
VM instances
Caching
Tradeoffs
Operational Excellence
Overview
Application design
Development
Deployment
Monitoring
Performance
Testing
Checklist
Automation
Automation Overview
Repeatable infrastructure
Configure infrastructure
Automate operational tasks
Performance Efficiency
Overview
Application design
Capacity planning
Load testing
Monitoring
Checklist
Reliability
Overview
Application design
Design overview
Error handling
Failure mode analysis
Backup and recovery
Business metrics
Chaos engineering
Data management
Monitoring and disaster recovery
Recover from a region-wide service disruption
Resiliency testing
Checklist
Security
About
Principles
Design
Governance
Overview
Team roles and responsibilities
Segmentation strategy
Management groups
Administration
Identity
Identity and access management
Best practices
Networking
Network security review
Containment strategies for organizations
Best practices
Storage
Applications and services
Application security considerations
Application classification
Threat analysis
Securing PaaS deployments
Compliance requirements
Configuration and dependencies
Monitor
Security operations
Review and audit
Tools
Network access
Monitor identity risk
DDoS attacks
Optimize
Automate
Replace insecure protocols
Elevate security capabilities
Design Patterns
Overview
Categories
Availability
Data management
Design and implementation
Management and monitoring
Messaging
Performance and scalability
Resiliency
Security
Ambassador
Anti-corruption Layer
Asynchronous Request-Reply
Backends for Frontends
Bulkhead
Cache-Aside
Choreography
Circuit Breaker
Claim Check
Command and Query Responsibility Segregation (CQRS)
Compensating Transaction
Competing Consumers
Compute Resource Consolidation
Deployment Stamps
Event Sourcing
External Configuration Store
Federated Identity
Gatekeeper
Gateway Aggregation
Gateway Offloading
Gateway Routing
Geodes
Health Endpoint Monitoring
Index Table
Leader Election
Materialized View
Pipes and Filters
Priority Queue
Publisher/Subscriber
Queue-Based Load Leveling
Retry
Saga
Scheduler Agent Supervisor
Sequential Convoy
Sharding
Sidecar
Static Content Hosting
Strangler
Throttling
Valet Key
Azure categories
AI + Machine Learning
Overview
Technology guide
Cognitive services
Machine learning
Machine learning at scale
Natural language processing
R developer's guide to Azure
Architectures
AI at the edge
AI enrichment in Cognitive Search
AI for Earth
Auditing and risk management
Autonomous systems
Disconnected AI at the edge
Baseball decision analysis with ML.NET and Blazor
Batch scoring for deep Learning
Batch scoring with Python
Batch scoring with Spark on Databricks
Batch scoring with R
Business Process Management
Real-time recommendation API
Chatbot for hotel reservations
E-commerce chatbot
Enterprise chatbot disaster recovery
Enterprise-grade conversational bot
Enterprise productivity chatbot
FAQ chatbot
Content Research
Contract Management
Customer churn prediction
Customer feedback
Defect prevention
Distributed deep learning training
Digital Asset Management
Energy supply optimization
Energy demand forecasting
Image classification
Image classification with CNN's
Information discovery with NLP
Interactive voice response bot
Digital text processing
Machine teaching
MLOps for Python models
MLOps technical paper
Upscale ML lifecycle with MLOps
MLOps maturity model
Azure ML service selection guide
Model training with AKS
Movie recommendations
Marketing optimization
Personalized offers
Personalized marketing solutions
Population health management
Hospital patient predictions
Vehicle telematics
Predictive maintenance
Predictive marketing
Quality assurance
Real-time scoring Python models
Real-time scoring R models
Remote patient monitoring
Retail assistant with visual capabilities
Retail product recommendations
Scalable personalization
Speech services
Speech to text conversion
Training Python models
Vision classifier model
Visual assistant
Deploy AI and machine learning at the edge by using Azure Stack Edge
Analytics
Architectures
Advanced analytics
Anomaly detector process
App integration using Event Grid
Automated enterprise BI
Big data analytics with Azure Data Explorer
Content Delivery Network analytics
Data warehousing and analytics
Demand forecasting
Demand forecasting for marketing
Demand forecasting for price optimization
Demand forecasting for shipping
Discovery Hub for analytics
Hybrid big data with HDInsight
Monitoring solution with Azure Data Explorer
ETL using HDInsight
Interactive analytics with Azure Data Explorer
IoT telemetry analytics with Azure Data Explorer
Interactive price analytics
Mass ingestion of news feeds on Azure
Oil and Gas tank level forecasting
Partitioning in Event Hubs and Kafka
Predicting length of stay in hospitals
Predictive aircraft engine monitoring
Real Time analytics on big data
Stream processing with Azure Databricks
Stream processing with Azure Stream Analytics
Tiering applications & data for analytics
Blockchain
Architectures
Blockchain workflow application
Decentralized trust between banks
Supply chain track and trace
Compute
HPC Overview
Architectures
3D video rendering
Computer-aided engineering
Digital image modeling
HPC risk analysis
HPC and big compute
Cloud-based HPC cluster
Hybrid HPC with HPC Pack
Linux virtual desktops with Citrix
Move Azure resources across regions
Run a Linux VM on Azure
Run a Windows VM on Azure
Solaris emulator on Azure VMs
Migrate IBM applications with TmaxSoft OpenFrame
Reservoir simulations
CFD simulations
Containers
AKS Solution Journey
AKS Baseline Cluster
AKS Cluster Best Practices
AKS Workload Best Practices
AKS day-2 operations guide
Triage practices
Introduction
1- Cluster health
2- Node and pod health
3- Workload deployments
4- Admission controllers
5- Container registry connectivity
Common Issues
AKS Example Solutions
Microservices architecture on AKS
Microservices with AKS and Azure DevOps
Secure DevOps for AKS
Building a telehealth system
CI/CD pipeline for container-based workloads
Databases
Guides
Overview
Relational data
Extract, transform, and load (ETL)
Online analytical processing (OLAP)
Online transaction processing (OLTP)
Data Warehousing
Non-relational data
Non-relational data stores
Free-form text search
Time series data
Working with CSV and JSON files
Big Data
Big Data architectures
Batch processing
Real time processing
Technology choices
Analytical data stores
Analytics and reporting
Batch processing
Data lakes
Data storage
Data store comparison
Pipeline orchestration
Real-time message ingestion
Search data stores
Stream processing
Application tenancy in SaaS Databases
Tenancy models
Monitor Azure Databricks jobs
Overview
Send Databricks application logs to Azure Monitor
Use dashboards to visualize Databricks metrics
Troubleshoot performance bottlenecks
Transfer data to and from Azure
Extend on-premises data solutions to Azure
Securing data solutions
Architectures
Apache Cassandra
Azure data platform
Big data analytics with Azure Data Explorer
Campaign optimization with HDInsight Spark
Campaign optimization with SQL Server
DataOps for modern data warehouse
Data streaming
Data cache
Digital campaign management
Digital marketing using Azure MySQL
Finance management using Azure MySQL
Finance management using Azure PostgreSQL
Gaming using Azure MySQL
Gaming using Cosmos DB
Globally distributed apps using Cosmos DB
Hybrid ETL with Azure Data Factory
Intelligent apps using Azure MySQL
Intelligent apps using Azure PostgreSQL
Interactive querying with HDInsight
Loan charge-off prediction with HDInsight Spark
Loan charge-off prediction with SQL Server
Loan credit risk modeling
Loan credit risk with SQL Server
Messaging
Modern data warehouse
N-tier app with Cassandra
Ops automation using Event Grid
Oracle migration to Azure
Personalization using Cosmos DB
Retail and e-commerce using Azure MySQL
Retail and e-commerce using Azure PostgreSQL
Retail and e-commerce using Cosmos DB
Running Oracle Databases on Azure
Serverless apps using Cosmos DB
Streaming using HDInsight
Windows N-tier applications
Developer Options
Microservices
Overview
Guides
Domain modeling for microservices
Domain analysis
Tactical DDD
Identify microservice boundaries
Design a microservices architecture
Introduction
Choose a compute option
Interservice communication
API design
API gateways
Data considerations
Design patterns for microservices
Operate microservices in production
Monitor microservices in Azure Kubernetes Service (AKS)
CI/CD for microservices
CI/CD for microservices on Kubernetes
Migrate to a microservices architecture
Migrate a monolith application to microservices
Modernize enterprise applications with Service Fabric
Migrate from Cloud Services to Service Fabric
Serverless applications
Serverless Functions overview
Serverless Functions examples
Plan for serverless architecture
Serverless Functions decision and planning
Serverless application assessment
Technical workshops and training
Proof of concept or pilot
Develop and deploy serverless apps
Serverless Functions app development
Serverless Functions code walkthrough
CI/CD for a serverless frontend
Serverless Functions app operations
Serverless Functions app security
Architectures
Big data analytics with Azure Data Explorer
CI/CD pipeline using Azure DevOps
Event-based cloud automation
Microservices on Azure Service Fabric
Multicloud with the Serverless Framework
Serverless applications using Event Grid
Serverless event processing
Unified logging for microservices apps
DevOps
Checklist
Guides
Extending Resource Manager templates
Overview
Update a resource
Conditionally deploy a resource
Use an object as a parameter
Property transformer and collector
Architectures
CI/CD pipeline for chatbots with ARM templates
CI/CD for Azure VMs
CI/CD for Azure Web Apps
CI/CD for Containers
CI/CD using Jenkins and AKS
DevSecOps in Azure
DevSecOps in GitHub
DevTest and DevOps for IaaS
DevTest and DevOps for PaaS
DevTest and DevOps for microservices
DevTest Image Factory
CI/CD using Jenkins and Terraform
Hybrid DevOps
Java CI/CD using Jenkins and Azure Web Apps
Jenkins on Azure
SharePoint for Dev-Test
Real time location sharing
Run containers in a hybrid environment
High Availability
Overview
Architectures
IaaS - Web application with relational database
Hybrid
Guides
FSLogix for the enterprise
Architectures
Manage configurations for Azure Arc enabled servers
Azure Automation in a hybrid environment
Using Azure file shares in a hybrid environment
Azure Functions in a hybrid environment
Back up files and applications on Azure Stack Hub
Azure Automation Update Management
Deploy AI and machine learning at the edge by using Azure Stack Edge
Design a hybrid Domain Name System solution with Azure
Hybrid file services
Hybrid availability and performance monitoring
Hybrid Security Monitoring using Azure Security Center and Azure Sentinel
Manage hybrid Azure workloads using Windows Admin Center
Azure Arc hybrid management and deployment for Kubernetes clusters
Connect standalone servers by using Azure Network Adapter
Disaster Recovery for Azure Stack Hub virtual machines
On-premises data gateway for Azure Logic Apps
Run containers in a hybrid environment
Connect an on-premises network to Azure
Connect an on-premises network to Azure using ExpressRoute
Cross cloud scaling
Cross-platform chat
Extend an on-premises network using ExpressRoute
Extend an on-premises network using VPN
Hybrid connections
Troubleshoot a hybrid VPN connection
Connect using Windows Virtual Desktop
Windows Virtual Desktop for enterprises
Multiple Active Directory forests
Multiple forests with Azure AD DS
Identity
Guides
Identity in multitenant applications
Introduction
The Tailspin scenario
Authentication
Claims-based identity
Tenant sign-up
Application roles
Authorization
Secure a web API
Cache access tokens
Client assertion
Federate with a customer's AD FS
Architectures
AD DS resource forests in Azure
Deploy AD DS in an Azure virtual network
Extend on-premises AD FS to Azure
Hybrid Identity
Integrate on-premises AD domains with Azure AD
Integrate on-premises AD with Azure
Integration
Architectures
Basic enterprise integration on Azure
Enterprise business intelligence
Enterprise integration using queues and events
Publishing internal APIs to external users
Web and Mobile front-ends
Custom Business Processes
Line of Business Extension
On-premises data gateway for Azure Logic Apps
Internet of Things
Guides
Azure IoT Edge Vision
Overview
Camera selection
Hardware acceleration
Machine learning
Image storage
Alert persistence
User interface
Architectures
IoT reference architecture
Condition Monitoring
IoT and data analytics
IoT using Cosmos DB
Predictive Maintenance for Industrial IoT
IoT Connected Platform
Contactless IoT interfaces
COVID-19 IoT Safe Solutions
Lighting and disinfection system
Light and power for emerging markets
Predictive maintenance with IoT
Process real-time vehicle data using IoT
Safe Buildings with IoT and Azure
Secure access to IoT apps with Azure AD
Voice assistants and IoT devices
IoT using Azure Data Explorer
Buy online, pickup in store
Industrial IoT Analytics
Architecture
Recommended services
Data visualization
Considerations
IoT concepts
Introduction to IoT solutions
Devices, platform, and applications
Attestation, authentication, and provisioning
Field and cloud edge gateways
Application-to-device commands
Builders, developers, and operators
IoT patterns
Measure and control loop
Monitor and manage loop
Analyze and optimize loop
Solution scaling with application stamps
Management and Governance
Architectures
Archive on-premises data to cloud
Back up cloud applications
Back up on-premises applications
Centralize app configuration and security
Computer forensics
High availability for BCDR
Data Sovereignty & Data Gravity
Enterprise-scale disaster recovery
Updating Windows VMs in Azure
Highly available SharePoint Server 2016
SMB disaster recovery with Azure Site Recovery
SMB disaster recovery with Double-Take DR
Manage configurations for Azure Arc enabled servers
Azure Automation in a hybrid environment
Back up files and applications on Azure Stack Hub
Azure Automation Update Management
Hybrid availability and performance monitoring
Manage hybrid Azure workloads using Windows Admin Center
Azure Arc hybrid management and deployment for Kubernetes clusters
Disaster Recovery for Azure Stack Hub virtual machines
Media
Architectures
Instant broadcasting with serverless
Live streaming digital media
Video-on-demand digital media
Migration
Architectures
Adding modern front-ends to legacy apps
Banking system
Banking cloud transformation
Patterns and implementations
JMeter implementation reference
Lift and shift LOB apps
Lift and shift with AKS
Oracle database migration
Migration decision process
Cross-cloud connectivity
Lift and shift to Azure VMs
Refactor
Rearchitect
Serverless computing LOB apps
Unlock Legacy Data with Azure Stack
Decompose apps with Service Fabric
SQL 2008 R2 failover cluster in Azure
Mixed Reality
Architectures
Mixed reality design reviews
Facilities management with mixed reality
Training powered by mixed reality
Mobile
Architectures
Custom mobile workforce app
Scalable apps with Azure MySQL
Scalable apps using Azure PostgreSQL
Social app for with authentication
Task-based consumer mobile app
Networking
Guides
Add IP spaces to peered virtual networks
Azure Firewall Architecture Guide
Architectures
Virtual network peering and VPN gateways
Deploy highly available NVAs
High availability for IaaS apps
Hub-spoke network topology in Azure
Implement a secure hybrid network
Segmenting Virtual Networks
Azure Automation Update Management
Design a hybrid Domain Name System solution with Azure
Hybrid availability and performance monitoring
Connect standalone servers by using Azure Network Adapter
SAP
Overview
Architectures
SAP HANA on Azure (Large Instances)
SAP HANA Scale-up on Linux
SAP NetWeaver on Windows on Azure
SAP S/4HANA in Linux on Azure
SAP BW/4HANA in Linux on Azure
SAP NetWeaver on SQL Server
SAP deployment using an Oracle DB
Dev/test for SAP
Security
Architectures
Azure AD in Security Operations
Cyber threat intelligence
Highly-secure IaaS apps
Homomorphic encryption with SEAL
Real-time fraud detection
Secure OBO refresh tokens
Securely managed web apps
Web app private database connectivity
Virtual network integrated microservices
Virtual Network security options
Hybrid Security Monitoring using Azure Security Center and Azure Sentinel
Storage
Architectures
Media rendering
Medical data storage
HIPAA/HITRUST Health Data and AI
Using Azure file shares in a hybrid environment
Hybrid file services
Web
Architectures
Basic web application
Deployment in App Service Environments
Standard deployment
High availability deployment
E-commerce front end
E-commerce website running in ASE
Highly available SharePoint farm
Highly available multi-region web application
Hybrid SharePoint farm with Microsoft 365
Intelligent product search engine for e-commerce
Magento e-commerce in AKS
Migrate a web app using Azure APIM
Multi-region N-tier application
Multitenant SaaS
Multi-tier web application built for HA/DR
SAP S/4 HANA for Large Instances
Scalable e-commerce web app
Scalable Episerver marketing website
Scalable Sitecore marketing website
Scalable Umbraco CMS web app
Scalable and secure WordPress on Azure
Scalable order processing
Scalable web app
More Scalable web apps
Serverless web app
Simple branded website
Simple digital marketing website
Web app monitoring on Azure
Web and mobile applications with MySQL, Cosmos DB, and Redis
Dynamics Business Central as a Service on Azure
Azure Functions in a hybrid environment
Cloud Adoption Framework
Azure architecture icons
11/13/2020 • 2 minutes to read • Edit Online

Helping our customers design and architect new solutions is core to the Azure Architecture Center’s mission.
Architecture diagrams like those included in our guidance can help communicate design decisions and the
relationships between components of a given workload. On this page you will find an official collection of Azure
architecture icons including Azure product icons to help you build a custom architecture diagram for your next
solution.
Do’s
Use the icon to illustrate how products can work together
In diagrams, we recommend to include the product name somewhere close to the icon
Use the icons as they would appear within Azure
Don’ts
Don’t crop, flip or rotate icons
Don’t distort or change icon shape in any way
Don’t use Microsoft product icons to represent your product or service

Example architecture diagram

Browse all Azure architectures to view other examples.

Icon updates
As of November 2020, the folder structure of our collection of Azure architecture icons has changed. The FAQs and
Terms of Use PDF files appear in the first level when you download the SVG icons below. The files in the icons
folder are the same except there is no longer a CXP folder. If you encounter any issues, let us know.

Terms
Microsoft permits the use of these icons in architectural diagrams, training materials, or documentation. You may
copy, distribute, and display the icons only for the permitted use unless granted explicit permission by Microsoft.
Microsoft reserves all other rights.
I agree to the above terms
D O W NLO AD SVG
ICO NS
Architecture diagrams, reference architectures, example scenarios, and solutions for common workloads on Azure.
AI + Machine Learning | Analytics | Blockchain | Compute | Containers | Databases | DevOps | Developer Tools | Identity |
Integration | Internet of Things | Management and Governance | Media | Migration | Mixed Reality | Mobile | Networking | Security
Storage | Web | Windows Virtual Desktop

AI + Machine Learning

AI at the Edge with Azure Stack Hub


9/01/2020
2 min read
Move AI models to the edge with a solution architecture that includes Azure Stack Hub.

AI at the Edge with Azure Stack Hub - disconnected


9/01/2020
2 min read
Move AI models to the edge with a solution architecture that includes Azure Stack. A step-by-step workflow will help you
harness the power of edge AI when disconnected from the internet.

AI enrichment with Azure Cognitive Search


6/01/2020
4 min read
Learn how to use Azure Cognitive Search pre-built skills and custom extensibility to enrich large unstructured data sets into
indexable structured data.

AI for Earth
6/23/2020
7 min read
Learn how to use the AI for Earth APIs to help build solutions for environmental problems. See other AI for Earth projects
and initiatives.

Auditing, risk, and compliance management


12/16/2019
2 min read
Developers could use knowledge mining to help attorneys quickly identify entities of importance from discovery
documents and flag important ideas across documents
Batch scoring for deep learning models
2/06/2019
7 min read
This reference architecture shows how to apply neural style transfer to a video, using Azure Machine Learning.

Batch scoring of Python models on Azure


1/30/2019
5 min read
Build a scalable solution for batch scoring models on a schedule in parallel using Azure Machine Learning.

Batch scoring of Spark models on Azure Databricks


11/20/2019
5 min read
Build a scalable solution for batch scoring an Apache Spark classification model on a schedule using Azure Databricks.

Batch scoring with R Models on Azure


3/29/2019
5 min read
Perform batch scoring with R models using Azure Batch and a data set based on retail store sales forecasting.

Build a Real-time Recommendation API on Azure


12/12/2018
6 min read
Use machine learning to automate recommendations using Azure Databricks and Azure Data Science Virtual Machines
(DSVM) to train a model on Azure.

Build an enterprise-grade conversational bot


1/24/2019
13 min read
How to build an enterprise-grade conversational bot (chatbot) using the Azure Bot Framework.

Build web and mobile applications


6/24/2020
2 min read
The reference architecture below is inspired by HSBC� s digital payment platform, PayMe for Business.

Business Process Management


12/16/2019
2 min read
In industries where bidding competition is fierce, or when the diagnosis of a problem must be quick or in near real-time,
companies can use knowledge mining to avoid costly mistakes

Commerce Chatbot
12/16/2019
1 min read
Together, the Azure Bot Service and Language Understanding service enable developers to create conversational interfaces
for various scenarios like banking, travel, and entertainment. For example, a hotel's concierge can use a bot to enhance
traditional e-mail and phone call interactions by validating a customer via Azure Active Directory and using Cognitive
Services to better contextually process customer requests using text and voice. The Speech recognition service can be
added to support voice commands.

Commerce Chatbot
12/16/2019
1 min read
Together, the Azure Bot Service and Language Understanding service enable developers to create conversational interfaces
for various scenarios like banking, travel, and entertainment. For example, a hotel's concierge can use a bot to enhance
traditional e-mail and phone call interactions by validating a customer via Azure Active Directory and using Cognitive
Services to better contextually process customer requests using text and voice. The Speech recognition service can be
added to support voice commands.

Content Research
12/16/2019
2 min read
Knowledge mining with a search index makes it easy for customers and employees to locate what they are looking for
faster.

Contract Management
12/16/2019
2 min read
Knowledge mining can help organizations to scour thousands of pages of sources to create an accurate bid.

Customer Churn Prediction


12/16/2019
2 min read
Customer Churn Prediction uses Cortana Intelligence Suite components to predict churn probability and helps find patterns
in existing data associated with the predicted churn rate.

Customer Feedback and Analytics


12/16/2019
2 min read
Knowledge mining can help customer support teams quickly find the right answer for a customer inquiry or assess
customer sentiment at scale.

Defect prevention with predictive maintenance


12/16/2019
2 min read
Learn how to use Azure Machine Learning to predict failures before they happen with real-time assembly line data.

Deploy AI and machine learning at the edge by using Azure Stack Edge
7/16/2020
5 min read
Run intelligence workloads directly on IoT devices by using Azure Stack Edge

Digital Asset Management


12/16/2019
2 min read
Knowledge mining through a search index makes it easy for end customers and employees to locate what they are looking
for faster.

Distributed training of deep learning models on Azure


6/05/2019
7 min read
This reference architecture shows how to conduct distributed training of deep learning models across clusters of GPU-
enabled VMs using Azure Machine Learning.
Energy Supply Optimization
12/16/2019
3 min read
This solution provides an Azure-based smart solution, leveraging external open-source tools, that determines the optimal
energy unit commitments from various types of energy resources for an energy grid.

Enterprise Productivity Chatbot


12/16/2019
1 min read
Azure Bot Service can be easily combined with Language Understanding to build powerful enterprise productivity bots,
allowing organizations to streamline common work activities by integrating external systems, such as Microsoft 365
calendar, customer cases stored in Dynamics CRM and much more.

Enterprise chatbot disaster recovery


6/22/2020
3 min read
Learn how to set up disaster recovery and high availability for Azure services in an enterprise-grade conversational bot.

FAQ Chatbot with data champion model


12/16/2019
1 min read
The QnA Maker tool makes it super easy for the content owners to maintain their knowledge base of QnAs. Combined with
Bot Service and LUIS, it's easy to setup an FAQ chatbot which responds from differnet knowledge bases depending on the
intent of the query.

Forecast Energy and Power Demand


12/16/2019
2 min read
Learn how Microsoft Azure can help accurately forecast spikes in demand for energy products and services to give your
company a competitive advantage.

Image classification on Azure


7/05/2018
4 min read
Learn how to build image processing into your applications by using Azure services such as the Computer Vision API and
Azure Functions.

Image classification with Convolutional Neural Networks


12/16/2019
2 min read
Explore transfer learning, convolutional neural networks, and gradient-boosting decision tree algorithms.

Information discovery with deep learning and NLP


12/16/2019
2 min read
See how deep learning and natural language processing can be used effectively with the Microsoft AI platform.

Interactive Voice Response Bot


12/16/2019
3 min read
This solution creates an intelligent interactive voice response (IVR) application that processes customer order requests for
bicycles and bicycle accessories. Businesses with no existing IVR solution can easily get started automating requests, or,
where existing human-operated systems exist, this solution can be extended to incorporate existing functionality and
workflows.

Keyword search/speech-to-text/OCR digital media


12/16/2019
2 min read
A speech-to-text solution allows you to identify speech in static video files so you can manage it as standard content, such
as allowing employees to search within training videos for spoken words or phrases, and then enabling them to quickly
navigate to the specific moment in the video.

MLOps for Python models using Azure Machine Learning


5/09/2019
8 min read
Implement continuous integration (CI), continuous delivery (CD), and retraining for Azure Machine Learning.

Machine Learning model training with AKS


12/16/2019
1 min read
Machine Learning with AKS

Machine teaching
7/05/2020
22 min read
Learn how machine teaching incorporates artificial intelligence, machine learning, deep reinforcement learning, and subject
matter expertise.

Machine teaching with the Microsoft Autonomous Systems platform


7/05/2020
7 min read
Learn how the Microsoft Autonomous Systems platform uses machine teaching, deep reinforcement learning, and
simulations to build and deploy autonomous systems with Bonsai.

Movie recommendations on Azure


1/09/2019
4 min read
Use machine learning to automate movie, product, and other recommendations using machine learning and an Azure Data
Science Virtual Machine (DSVM) to train a model on Azure.

Optimize Marketing with Machine Learning


12/16/2019
2 min read
Learn how to build a machine-learning model with SQL Server 2016 with R Services to optimize and manage marketing
campaigns.

Personalized Offers
12/16/2019
3 min read
In today's highly competitive and connected environment, modern businesses can no longer survive with generic, static
online content. Furthermore, marketing strategies using traditional tools are often expensive, hard to implement, and do
not produce the desired return on investment. These systems often fail to take full advantage of the data collected to create
a more personalized experience for the user.
Personalized marketing solutions
12/16/2019
2 min read
Find essential technology to market your products with personalized offers. Individualize your marketing for greater
customer response using big-data insights.

Population Health Management for Healthcare


12/16/2019
3 min read
Population Health Management is an important tool that is increasingly being used by health care providers to manage and
control the escalating costs. The crux of Population Health Management is to use data to improve health outcomes. Tracking,
monitoring, and bench marking are the three bastions of Population Health Management, aimed at improving clinical and
health outcomes while managing and reducing cost.

Predict Length of Stay and Patient Flow


12/16/2019
2 min read
Learn how to predict capacity and patient flow for your hospital or healthcare facility to enhance the quality of care and
improve operational efficiency.

Predictive Insights with Vehicle Telematics


12/16/2019
2 min read
Learn how car dealerships, manufacturers, and insurance companies can use Microsoft Azure to gain predictive insights on
vehicle health and driving habits.

Predictive Maintenance
12/16/2019
3 min read
This Predictive Maintenance solution monitors aircraft and predicts the remaining useful life of aircraft engine components.

Predictive Marketing with Machine Learning


12/16/2019
2 min read
Learn how to build a machine-learning model with Microsoft R Server on Azure HDInsight Spark clusters to recommend
actions to maximize the purchase rate.
Quality Assurance
12/16/2019
3 min read
Quality assurance systems allow businesses to prevent defects throughout their processes of delivering goods or services
to customers. Building such a system that collects data and identifies potential problems along a pipeline can provide
enormous advantages. For example, in digital manufacturing, quality assurance across the assembly line is imperative.
Identifying slowdowns and potential failures before they occur rather than after they are detected can help companies
reduce costs for scrap and rework while improving productivity.

Real-time scoring of Python models


1/28/2019
8 min read
This reference architecture shows how to deploy Python models as web services on Azure to make real-time predictions.

Real-time scoring of R machine learning models


12/10/2019
5 min read
Implement a real-time prediction service in R using Machine Learning Server running in Azure Kubernetes Service (AKS).

Remote Patient Monitoring Solutions


12/16/2019
2 min read
Provide a high level of preventative medical care with remote patient monitoring from Azure. Analyze large amounts of
medical data in a secure environment.

Retail Assistant with Visual Capabilities


12/16/2019
1 min read
The retail assistant or vacation planner can help your customers have interactions with your business bot and provide
suggestions based on the visual information.

Scalable personalization on Azure


5/31/2019
6 min read
Use machine learning to automate content-based personalization for customers.

Speech Services
12/16/2019
1 min read
A custom acoustic model helps Speech Services understand speakers even with background noise or poor phone
connections.

Speech-to-text conversion
3/25/2020
7 min read
This article describes the recommended way to upload audio files and process the speech content to text.

Training Python models on Azure


7/17/2019
6 min read
This reference architecture shows recommended practices for tuning the hyperparameters (training parameters) of a scikit-
learn Python model.

Vision classifier model with Azure Custom Vision Cognitive Service


6/03/2020
2 min read
Create an image classifier with a solution architecture that includes Microsoft AirSim Drone simulator and Azure Custom
Vision Cognitive Service.

Visual Assistant
12/16/2019
1 min read
Visual assistant provides rich information based on content of the image with capabilities such as reading business card,
identifying barcode, and recognizing popular people, places, objects, artworks, and monuments.

back to top

Analytics
Advanced Analytics Architecture
12/16/2019
2 min read
Get near real-time data analytics on streaming services. This big data architecture allows you to combine any data at any
scale with custom machine learning.

Anomaly Detector Process


12/16/2019
1 min read
Learn more about Anomaly Detector with a step-by-step flowchart that details the process. See how anomaly detection
models are selected with time-series data.

Application integration using Event Grid


12/16/2019
1 min read
Event Grid connects your app with other services. For example, create an application topic to send your app's event data to
Event Grid and take advantage of its reliable delivery, advanced routing, and direct integration with Azure. Alternatively, you
can use Event Grid with Logic Apps to process data anywhere, without writing code.

Automated enterprise BI
6/03/2020
13 min read
Automate an extract, load, and transform (ELT) workflow in Azure using Azure Data Factory with Azure Synapse Analytics.

Data warehousing and analytics


11/20/2019
5 min read
This example demonstrates a data pipeline that integrates large amounts of data from multiple sources into a unified
analytics platform in Azure.

Demand Forecasting
12/16/2019
3 min read
Accurately forecasting spikes in demand for products and services can give a company a competitive advantage. This
solution focuses on demand forecasting within the energy sector.
Demand Forecasting + Price Optimization
12/16/2019
2 min read
Predict future customer demand and optimize pricing to maximize profitability using big-data and advanced-analytics
services from Microsoft Azure.

Demand Forecasting and Price Optimization


12/16/2019
3 min read
Pricing is recognized as a pivotal determinant of success in many industries and can be one of the most challenging tasks.
Companies often struggle with several aspects of the pricing process, including accurately forecasting the financial impact
of potential tactics, taking reasonable consideration of core business constraints, and fairly validating the executed pricing
decisions. Expanding product offerings add further computational requirements to make real-time pricing decisions,
compounding the difficulty of this already overwhelming task.

Demand Forecasting for Shipping and Distribution


12/16/2019
5 min read
Use historical demand data to forecast future periods across various customers, products and destinations.

Discovery Hub with Cloud Scale Analytics


12/16/2019
2 min read
Build a modern data estate that is ready for cloud scale analytics with a step-by-step flowchart from Microsoft Azure.

Extend your on-premises big data investments with HDInsight


12/16/2019
1 min read
Extend your on-premises big data investments to the cloud and transform your business using the advanced analytics
capabilities of HDInsight.

Extract, transform, and load (ETL) using HDInsight


12/16/2019
1 min read
Extract, transform, and load your big data clusters on demand with Hadoop MapReduce and Apache Spark.
Interactive Price Analytics
12/16/2019
4 min read
The Pricing Analytics solution uses your transactional history data to show you how the demand for your products
responds to the prices you offer, to recommend pricing changes, and allow you to simulate how changes in price would
affect your demand, at a fine granularity.

Mass ingestion and analysis of news feeds on Azure


2/01/2019
5 min read
Create a pipeline for ingesting and analyzing text, images, sentiment, and other data from RSS news feeds using only Azure
services, including Azure Cosmos DB and Azure Cognitive Services.

Oil and Gas Tank Level Forecasting


12/16/2019
3 min read
Today, most facilities operate reactively to problems in tank levels. This often leads to spills, emergency shutdowns,
expensive remediation costs, regulatory issues, costly repairs and fines. Tank level forecasting helps manage and abate
these and other problems.

Partitioning in Event Hubs and Kafka


10/02/2020
12 min read
Learn about partitioning in Kafka and Event Hubs with Kafka. See how many partitions to use in ingestion pipelines and
how to assign events to partitions.

Predicting Length of Stay in Hospitals


12/16/2019
3 min read
This solution enables a predictive model for Length of Stay for in-hospital admissions. Length of Stay (LOS) is defined in
number of days from the initial admit date to the date that the patient is discharged from any given hospital facility.

Predictive Aircraft Engine Monitoring


12/16/2019
2 min read
Microsoft Azure's Predictive Maintenance solution demonstrates how to combine real-time aircraft data with analytics to
monitor aircraft health.

Product recommendations for retail using Azure


12/16/2019
2 min read
This solution implements a process of aggregating customer data into a complete profile, and uses advanced machine
learning models backed by the reliability and processing power of Azure to provide predictive insights on simulated
customers.

Real Time Analytics on Big Data Architecture


12/16/2019
2 min read
Get deep learning analytics and insights live from streaming data. Review logs from website clickstream in near real-time
for advanced analytics processing.

Stream processing with Azure Databricks


11/30/2018
11 min read
Create an end-to-end stream processing pipeline in Azure using Azure Databricks.

Stream processing with Azure Stream Analytics


11/06/2018
10 min read
This reference architecture shows an end-to-end stream processing pipeline, which ingests data, correlates records, and
calculates a rolling average.

Tier Applications & Data for Analytics


12/16/2019
1 min read
Tier applications and data with a solution architecture that includes Azure Stack. Optimize data analytics with a step-by-step
flowchart and detailed instructions.

back to top

Blockchain
Blockchain Workflow Application
12/16/2019
2 min read
Explore how blockchain is used to digitize workflows and supply chains across organizations with the Blockchain Workflow
Application from Microsoft Azure.

Decentralized trust between banks


9/09/2018
5 min read
Establish a trusted environment for communication and information sharing without resorting to a centralized database.

Supply Chain Track and Trace


12/16/2019
3 min read
Learn how to use the Azure Blockchain Workbench. Build an asset tracking application for supply chain with a step-by-step
flowchart.

back to top

Compute

3D video rendering
7/13/2018
5 min read
Use Azure Batch to manage existing Windows or Linux applications, including AutoDesk Maya and Blender, to run large-
scale 3D video render jobs in Azure.

A computer-aided engineering service


8/22/2018
5 min read
Provide a software-as-a-service (SaaS) platform for computer-aided engineering (CAE) on Azure.

Digital image-based modeling on Azure


1/11/2019
7 min read
Accelerate digital image-based modeling on Azure using Avere and Agisoft PhotoScan

Dynamics Business Central as a Service on Azure


6/25/2020
5 min read
This example shows how to establish the production environment for Business Central in partner private Azure
environment.

HPC Risk Analysis Template - Solution Architecture


12/16/2019
3 min read
Build an HPC risk analysis solution architecture with a step-by-step flowchart from Microsoft Azure that combines
CycleCloud, Avere vFXT and TIBCO GridServer.

HPC System and Big Compute Solutions


12/16/2019
2 min read
Explore Big Compute solutions with Azure Batch. Use HPC cloud systems for cloud-native application and batch processing.

HPC cluster deployed in the cloud


12/16/2019
1 min read
High performance computing (HPC) applications can scale to thousands of compute cores, extend on-premises big
compute, or run as a 100% cloud native solution. This HPC solution including the head node, compute nodes, and storage
nodes, runs in Azure with no hardware infrastructure to maintain.

Hybrid HPC in Azure with HPC Pack


12/16/2019
2 min read
Get a hybrid high performance computing solution built with Windows Server technology. Use Azure HPC Pack to create a
hybrid HPC environment.

Linux virtual desktops with Citrix


9/12/2018
5 min read
Learn about Virtual Desktop Infrastructure. This example uses a Citrix-based solution for Linux desktops on Azure.

Run a Linux VM on Azure


12/13/2018
8 min read
Learn the best practices for running a Linux virtual machine on Azure, which requires some additional components,
including networking and storage resources.

Run a Windows VM on Azure


12/13/2018
7 min read
Learn the best practices for running a Windows virtual machine on Azure, which requires some additional components,
including networking and storage resources.

Run reservoir simulation software on Azure


3/16/2020
5 min read
Run OPM Flow reservoir simulation software and OPM ResInsight visualization software on an Azure HPC compute cluster
and visualization VM.

Running CFD simulations


9/20/2018
5 min read
Learn about running Computational Fluid Dynamics simulations using Azure. Create, manage, and optimize clusters using
Azure CycleCloud.

Serverless batch processing with Durable Functions in Azure Container Instances


5/19/2020
3 min read
Use Azure Functions Durable Functions to orchestrate, deploy, and run serverless batch processing jobs in Azure Container
Instances (ACI) containers.

back to top
Containers

API-first SaaS business model


4/17/2020
2 min read
Adapt, evolve, and allow faster innovation to turn opportunities into strategic advantages.

Baseline architecture for an Azure Kubernetes Service (AKS) cluster


8/01/2020
32 min read
Reference architecture for a baseline infrastructure that deploys an Azure Kubernetes Service (AKS) cluster.

Building a telehealth system on Azure


5/25/2019
7 min read
A healthcare system connecting users, devices, and providers built using Kubernetes and PostgreSQL

Bursting from AKS with ACI


12/16/2019
1 min read
Bursting from AKS with ACI

CI/CD pipeline for container-based workloads


7/05/2018
7 min read
Build a DevOps pipeline for a Node.js web app with Jenkins, Azure Container Registry, Azure Kubernetes Service, Cosmos
DB, and Grafana.

Elastic demand handling with AKS


4/17/2020
2 min read
Achieve fast and reliable service quality during seasonal and other high-traffic demand periods
Instant IoT data streaming with AKS
4/17/2020
2 min read
Ingest and analyze high volumes of IoT data and generate real-time recommendations

Microservices architecture on Azure Kubernetes Service (AKS)


5/07/2020
17 min read
Deploy a microservices architecture on Azure Kubernetes Service (AKS)

Microservices with AKS


12/16/2019
1 min read
Microservices with AKS

Secure DevOps for AKS


9/10/2020
2 min read
Implementing secure DevOps with Kubernetes on Azure, you can achieve the balance between speed and security and
deliver code faster at scale.

back to top

Databases

Azure data platform end-to-end


1/31/2020
7 min read
Use Azure services to ingest, process, store, serve, and visualize data from different sources.

Build cloud native applications


6/26/2020
3 min read
Build cloud native applications with Azure Cosmos DB, Azure Database for PostgreSQL and Azure Cache for Redis
Campaign Optimization with Azure HDInsight Spark Clusters
12/16/2019
4 min read
This solution demonstrates how to build and deploy a machine learning model with Microsoft R Server on Azure HDInsight
Spark clusters to recommend actions to maximize the purchase rate of leads targeted by a campaign. This solution enables
efficient handling of big data on Spark with Microsoft R Server.

Campaign Optimization with SQL Server


12/16/2019
5 min read
This solution demonstrates how to build and deploy a machine learning model with SQL Server 2016 with R Services to
recommend actions to maximize the purchase rate of leads targeted by a campaign.

Data Streaming scenario


12/16/2019
1 min read
Data Streaming scenario

Data cache
12/16/2019
1 min read
data cache, data cache with azure cache for redis, azure database, cosmos db

DataOps for the modern data warehouse


7/14/2020
6 min read
How to apply DevOps principles to data pipelines built according to the modern data (MDW) architectural pattern on
Microsoft Azure

Deliver highly scalable customer service and ERP applications


6/26/2020
3 min read
Deliver highly scalable customer service and ERP applications with Azure SQL and Azure Cosmos DB
Development and test environments for SAP workloads on Azure
7/11/2018
4 min read
Learn how to establish a dev/test environment for SAP NetWeaver on Azure. This architecture is designed for non-
production environments.

Digital Campaign Management


12/16/2019
1 min read
Use Azure Database for PostgreSQL to engage with customers around the world with rich, personalized digital marketing
experiences.

Digital Marketing using Azure Database for MySQL


12/16/2019
1 min read
Use Azure Database for MySQL to engage with customers around the world with rich, personalized digital marketing
experiences.

Finance management apps using Azure Database for MySQL


12/16/2019
1 min read
Use Azure Database for MySQL to securely store critical data and provide high-value analytics and insights over aggregated
data to users.

Finance management apps using Azure Database for PostgreSQL


12/16/2019
1 min read
Use Azure Database for PostgreSQL to securely store critical data and provide high-value analytics and insights over
aggregated data to users.

Gaming using Azure Database for MySQL


12/16/2019
1 min read
Scale your databases elastically to accommodate unpredictable bursts of traffic and deliver low-latency multi-player
experiences.

Gaming using Cosmos DB


12/16/2019
1 min read
Elastically scale your database to accommodate unpredictable bursts of traffic and deliver low-latency multi-player
experiences on a global scale.

Globally distributed applications using Cosmos DB


12/16/2019
1 min read
Guarantee access to users around the world with the high-availability and low-latency capabilities built into Microsoft's
global datacenters.

Hybrid ETL with Azure Data Factory


11/20/2019
5 min read
Hybrid ETL with existing on-premises SQL Server Integration Services (SSIS) deployments and Azure Data Factory.

Intelligent apps using Azure Database for MySQL


12/16/2019
1 min read
Use Azure Database for MySQL to develop sophisticated machine learning and visualization apps for actionable insights
and analytics.

Intelligent apps using Azure Database for PostgreSQL


12/16/2019
1 min read
Use Azure Database for PostgreSQL to develop sophisticated machine learning and visualization apps for actionable
insights and analytics.

Interactive querying with HDInsight


12/16/2019
1 min read
Perform fast, interactive SQL queries at scale over structured or unstructured data with Apache Hive LLAP.
Loan ChargeOff Prediction with Azure HDInsight Spark Clusters
12/16/2019
4 min read
Using Azure HDInsight R Server, a lending institution can leverage machine learning predictive analytics to predict the
likelihood of loans getting charged off and run a report on the analytics result stored in HDFS and hive tables.

Loan ChargeOff Prediction with SQL Server


12/16/2019
3 min read
This solution demonstrates how to build and deploy a machine learning model with SQL Server 2016 with R Services to
predict if a Bank loan will need to be charged off within next 3 months

Loan Credit Risk + Default Modeling


12/16/2019
2 min read
Using SQL Server 2016 with R Services, lenders can predict a borrower's credit risk and default probability to help issue
fewer unprofitable loans.

Loan Credit Risk with SQL Server


12/16/2019
3 min read
Using SQL Server 2016 with R Services, a lending institution can make use of predictive analytics to reduce number of
loans they offer to those borrowers most likely to default, increasing the profitability of their loan portfolio.

Messaging
12/16/2019
1 min read
azure redis cache, web sockets, web communication frameworks, messaging publish and subscribe, azure cache for redis

Modern Data Warehouse Architecture


12/16/2019
2 min read
Explore a cloud data warehouse that uses big data. Modern data warehouse brings together all your data and scales easily
as your data grows.
N-tier application with Apache Cassandra
8/21/2019
11 min read
Run Linux virtual machines for an N-tier architecture with Apache Cassandra in Microsoft Azure.

Ops automation using Event Grid


12/16/2019
1 min read
Event Grid allows you to speed automation and simplify policy enforcement. For example, Event Grid can notify Azure
Automation when a virtual machine is created, or a SQL Database is spun up. These events can be used to automatically
check that service configurations are compliant, put metadata into operations tools, tag virtual machines, or file work items.

Oracle Database Migration to Azure


12/16/2019
2 min read
Oracle DB migrations can be accomplished in multiple ways. This architecture covers one of these options wherein Oracle
Active Data Guard is used to migrate the Database.

Oracle database migration: Cross-cloud connectivity


6/23/2020
2 min read
Create a connection between your existing Oracle Database and your Azure applications.

Oracle database migration: Lift and shift


6/23/2020
2 min read
Lift and shift your Oracle database from an Oracle environment to Azure Virtual Machines.

Oracle database migration: Rearchitect


6/23/2020
3 min read
Rearchitect your Oracle database with Azure SQL Managed Instance
Oracle database migration: Refactor
6/23/2020
3 min read
Refactor your Oracle database with Azure Database Migration Service and move it to PostgreSQL.

Overview of Oracle database migration


6/23/2020
3 min read
Learn about Oracle database migration paths and the methods you use to migrate your schema to SQL or PostgreSQL.

Personalization using Cosmos DB


12/16/2019
1 min read
Generate personalized recommendations for customers in real time, using low-latency and tunable consistency settings for
immediate insights

Retail and e-commerce using Azure MySQL


12/16/2019
1 min read
Build secure and scalable e-commerce solutions that meet the demands of both customers and business using Azure
Database for MySQL.

Retail and e-commerce using Azure PostgreSQL


12/16/2019
1 min read
Build secure and scalable e-commerce solutions that meet the demands of both customers and business using Azure
Database for PostgreSQL.

Retail and e-commerce using Cosmos DB


12/16/2019
1 min read
Support in-depth queries over diverse product catalogs, traffic spikes, and rapidly changing inventory.

Run SAP BW/4HANA with Linux virtual machines on Azure


6/10/2020
11 min read
This example focuses specifically on the SAP BW/4HANA application tier and is suitable for a small-scale production
environment of SAP BW/4HANA on Azure where high availability is a priority.

Run SAP HANA for Linux virtual machines in a scale-up architecture on Azure
6/10/2020
10 min read
Proven practices for running SAP HANA in a high-availability, scale-up environment that supports disaster recovery on
Azure.

Run SAP HANA on Azure (Large Instances)


4/02/2020
11 min read
Proven practices for running SAP HANA in a high availability environment on Azure Large Instances.

Run SAP NetWeaver in Windows on Azure


2/19/2020
21 min read
Proven practices for running SAP NetWeaver in a Windows environment on Azure with high availability.

Running Oracle Databases on Azure


12/16/2019
2 min read
This solution architecture illustrates a canonical architecture to achieve high availability for your Oracle Database Enterprise
Edition in Azure.

SAP NetWeaver on SQL Server


12/16/2019
2 min read
The NetWeaver on SQL Server application solution illustrates how a user request flows through an SAP landscape built on
NetWeaver by utilizing Azure Virtual Machines to host SAP applications and a SQL Server database.

SAP S/4HANA in Linux on Azure


2/19/2020
22 min read
Proven practices for running SAP S/4HANA in a Linux environment on Azure with high availability.

SAP deployment on Azure using an Oracle DB


9/12/2018
5 min read
This example demonstrates a SAP deployment on virtual machines on Azure, along with a High Availability Oracle database.

SAP on Azure Architecture Guide


6/10/2020
11 min read
This guide is based on the Microsoft Azure Well-Architected Framework, but the recommendations are specific to
deployments of SAP solutions.

Saga distributed transactions


7/21/2020
6 min read
Understand and use the saga design pattern to ensure data consistency on distributed transactions in microservices
architectures.

Serverless apps using Cosmos DB


12/16/2019
1 min read
Use Azure Functions and Azure Cosmos DB to build globally distributed, scalable serverless applications.

Streaming using HDInsight


12/16/2019
1 min read
Ingest and process millions of streaming events per second with Apache Kafka, Apache Storm, and Apache Spark Streaming.

Windows N-tier application on Azure


9/25/2020
13 min read
Implement a multi-tier architecture on Azure for availability, security, scalability, and manageability.
back to top

DevOps

Banking system cloud transformation on Azure


8/11/2020
11 min read
Solution for monitoring banking system infrastructure scalability and performance.

Build a CI/CD pipeline for chatbots with ARM templates


3/25/2020
9 min read
Use Azure Pipelines with ARM templates to set up a CI/CD pipeline to Azure App Service and Azure Bot Service for a
chatbot app.

CI/CD for Azure VMs


12/16/2019
2 min read
Azure is a world-class cloud for hosting virtual machines running Windows or Linux. Whether you use ASP.NET, Java,
Node.js, or PHP to develop applications, you'll need a continuous integration and continuous deployment (CI/CD) pipeline
to push changes to these virtual machines automatically.

CI/CD for Azure Web Apps


12/16/2019
2 min read
Azure Web Apps is a fast and simple way to create web apps using ASP.NET, Java, Node.js, or PHP. Deliver value faster to
your customers with a continuous integration and continuous deployment (CI/CD) pipeline that pushes each of your
changes automatically to Web Apps.

CI/CD for Containers


12/16/2019
2 min read
Containers make it easy for you to continuously build and deploy your applications. By orchestrating deployment of those
containers using Kubernetes in Azure Kubernetes Service (AKS), you can achieve replicable, manageable clusters of
containers.
Container CI/CD using Jenkins and Kubernetes on Azure Kubernetes Service (AKS)
12/16/2019
2 min read
Containers make it easy for you to continuously build and deploy applications. By orchestrating the deployment of those
containers using Azure Kubernetes Service (AKS), you can achieve replicable, manageable clusters of containers.

DevOps in a hybrid environment


12/16/2019
3 min read
The tools provided in Azure allow for the implementation of a DevOps strategy that capably manages both cloud and on-
premises environments in tandem.

DevSecOps in Azure
12/16/2019
4 min read
DevSecOps involves utilizing security best practices from the beginning of development, shifting the focus on security away
from auditing at the end and towards development in the beginning

DevSecOps in GitHub
9/01/2020
8 min read
Learn how GitHub tools make security practices an integral part of DevOps while maintaining efficiency. See how to use
these tools within an Azure framework.

DevTest Image Factory


12/16/2019
2 min read
Create, maintain, and distribute custom images with the DevTest Image Factory, an automated image development and
management solution from Azure DevTest Labs.

DevTest and DevOps for IaaS solutions


9/17/2020
5 min read
Learn how to configure a DevTest and DevOps infrastructure for development, testing, and deploying IaaS-based software.
DevTest and DevOps for PaaS solutions
9/17/2020
6 min read
Learn how to configure a DevTest and DevOps infrastructure for development, testing, and deploying PaaS-based software.

DevTest and DevOps for microservice solutions


9/17/2020
5 min read
Learn how to configure a DevTest and DevOps infrastructure for development, testing, and deploying microservice-based
software.

Immutable Infrastructure CI/CD using Jenkins and Terraform on Azure Virtual Architecture overview
12/16/2019
2 min read
Azure is a world-class cloud for hosting virtual machines running Windows or Linux. Whether you use Java, Node.js, Go, or
PHP to develop your applications, you'll need a continuous integration and continuous deployment (CI/CD) pipeline to push
changes to these virtual machines automatically.

JMeter implementation reference for load testing pipeline solution


8/11/2020
3 min read
Scalable cloud load testing pipeline creates and destroys infrastructure on-demand for stress testing.

Java CI/CD using Jenkins and Azure Web Apps


12/16/2019
2 min read
Azure App Service is a fast and simple way to create web apps using Java, Node, PHP or ASP.NET, as well as support for
custom language runtimes using Docker. A continuous integration and continuous deployment (CI/CD) pipeline that pushes
each of your changes automatically to Azure app services allows you to deliver value faster to your customers.

Run a Jenkins server on Azure


4/30/2018
10 min read
Recommended architecture that shows how to deploy and operate a scalable, enterprise-grade Jenkins server on Azure
secured with single sign-on (SSO).

Run containers in a hybrid environment


6/29/2020
6 min read
Run containers from an enterprise container registry on-premises and in Azure

SharePoint Farm for Development Testing


12/16/2019
2 min read
Learn how to deploy a SharePoint farm for use as a development testing environment with a step-by-step flowchart from
Azure.

Sharing location in real time using low-cost serverless Azure services


12/09/2019
6 min read
SignalR configured in server-less mode to work with Azure Function triggered by Service Bus. All of it using .NET Core. This
scenario is best used for real time messaging applications where users require a low-cost but robust messaging service.

Unified logging for microservices applications


7/13/2020
8 min read
Learn about logging, tracing, and monitoring for microservices apps. Run synthetic logging for testing, and create
structured logs for analysis.

back to top

Developer Tools

Azure Machine Learning decision guide for optimal tool selection


7/07/2020
6 min read
How to choose the best services for building an end-to-end machine learning pipeline from experimentation to
deployment.
Design a CI/CD pipeline using Azure DevOps
12/06/2018
5 min read
This architecture and design guidance example is a continuous integration and deployment pipeline for a two-tier .NET web
application to the Azure App Service.

Event-based cloud automation


1/14/2020
12 min read
Recommended architecture for implementing cloud automation using serverless technologies.

Machine Learning Operations maturity model


7/07/2020
2 min read
Detailed explanation of the MLOps maturity model stages and defining characteristics of each stage.

Machine learning operations (MLOps) framework to upscale machine learning Lifecycle with Azure Machine Learning
6/01/2020
26 min read
How the machine learning operations (MLOps) maturity model was applied to implement a machine learning solution for
predicting product shipping levels.

Microservices architecture on Azure Service Fabric


6/13/2019
22 min read
Use this reference architecture to see microservices deployed to Azure Service Fabric. This cluster configuration can be a
starting point for most deployments.

Multicloud solutions with the Serverless Framework


3/30/2020
7 min read
Develop and deploy serverless microservices across cloud providers with the open-source Serverless Framework and
Serverless Multicloud Library.

Serverless application architectures using Event Grid


12/16/2019
1 min read
Event Grid connects data sources and event handlers. For example, use Event Grid to instantly trigger a serverless function
to run image analysis each time a new photo is added to a blob storage container.

Serverless event processing


10/16/2018
7 min read
Reference architecture for serverless event ingestion and processing using Azure Functions.

back to top

Identity

Create an AD DS resource forest in Azure


5/02/2018
6 min read
How to create a trusted Active Directory domain in Azure.

Deploy AD DS in an Azure virtual network


5/02/2018
8 min read
Use this reference architecture to extend an on-premises Active Directory domain to Azure to provide distributed
authentication services.

Extend on-premises AD FS to Azure


12/18/2018
13 min read
Implement a secure hybrid network architecture with Active Directory Federation Service authorization in Azure.

Hybrid Identity
9/01/2020
1 min read
The Hybrid Identity blueprint enables teams to manage applications and user identity consistently across clouds with the
utilization of Azure Stack Hub

Integrate on-premises AD domains with Azure AD


8/28/2019
13 min read
Implement a secure hybrid network architecture using Azure Active Directory (Azure AD).

Integrate on-premises AD with Azure


7/30/2019
4 min read
Compare reference architectures for integrating on-premises Active Directory with Azure.

Multitenant SaaS on Azure


7/16/2020
11 min read
Build a multitenant SaaS solution on Azure, designed for high availability, scalability, data security and isolation using App
Service, Azure Kubernetes Service, and SQL Elastic Pools.

back to top

Integration

Basic enterprise integration on Azure


12/03/2018
9 min read
Recommended architecture for implementing a simple enterprise integration pattern using Azure Logic Apps and Azure API
Management.

Enterprise business intelligence


11/20/2019
12 min read
See an extract, load, and transform pipeline that moves data from an on-premises SQL Server database into Azure Synapse
and transforms the data for analysis.
Enterprise integration using queues and events
12/03/2018
5 min read
Recommended architecture for implementing an enterprise integration pattern with Azure Logic Apps, Azure API
Management, Azure Service Bus, and Azure Event Grid.

Modernize .NET applications


6/26/2020
3 min read
Modernize .NET applications with Azure SQL DB and Azure App Service

On-premises data gateway for Azure Logic Apps


7/16/2020
3 min read
Using a data gateway to connect on-premises data sources to Azure Logic Apps

Patterns and implementations


8/11/2020
4 min read
Details about the patterns and implementations used when the commercial software engineer team created the banking
system cloud transformation solution.

Publishing internal APIs to external users


3/12/2019
8 min read
In this scenario, an organization consolidates multiple APIs internally using Azure API Management deployed inside a
Virtual Network.

back to top

Internet of Things

Attestation, authentication, and provisioning


8/21/2020
3 min read
Learn about how devices confirm their identity with, authenticate with, and can be provisioned within Azure IoT Hubs.

Azure IoT reference architecture


9/10/2020
12 min read
Recommended architecture for IoT applications on Azure using PaaS (platform-as-a-service) components

Builders, developers, and operators


8/10/2020
3 min read
Learn about the device builder, application developer, and solution operator roles and how they interact in an IoT solution.

COVID-19 Safe Solutions with IoT Edge


6/06/2020
3 min read
Learn how to use CCTV, IoT Edge analytics and machine learning, and cloud services to monitor social distancing, mask/PPE
use, and occupancy requirements.

Cognizant Safe Buildings with IoT and Azure


6/09/2020
4 min read
Deploy Cognizant Safe Buildings, IoT, and Azure services to protect buildings from COVID-19 outbreaks.

Condition Monitoring for Industrial IoT


8/05/2020
6 min read
This example demonstrates how manufacturers can connect their assets to the cloud using OPC UA and the Industrial
Components.

Contactless IoT interfaces with Azure intelligent edge


6/06/2020
3 min read
Combine intelligent and perceptive edge devices with the storage and computing power of the cloud to create touch-free
interfaces.
Controlling IoT devices using a Voice Assistant
3/23/2020
2 min read
Create seamless conversational interfaces with all of your internet-accessible devices-from your connected television or
fridge to devices in a connected power plant. By combining Azure Speech Service, Language Understanding Service (LUIS)
and Azure Bot Framework, developers can create natural, human-like conversational interfaces to control smart devices.

Field and cloud edge gateways


8/21/2020
1 min read
Learn about Iot gateways, communications protocols, and provisioning.

IoT Connected Platform for COVID-19 detection and prevention


6/12/2020
3 min read
Deploy a connected ecosystem of intelligent IoT Edge devices, Azure services, and cloud-powered apps to create safe and
healthy public spaces.

IoT and data analytics


8/29/2018
5 min read
Use IoT devices and data analytics to provide comprehensive management and operation of construction projects.

IoT application-to-device commands


8/10/2020
5 min read
Learn about how applications can use cloud-to-device messaging or direct methods to send commands to IoT devices.

IoT connected light, power, and internet for emerging markets


8/17/2020
4 min read
Learn how Veriown uses solar-powered IoT devices with Azure services to provide low-cost, clean power, light, and internet
connectivity to remote customers.
IoT solution architecture
8/10/2020
2 min read
Understand the topological relationship between IoT devices, platform, and applications. Learn about Iot gateways,
communications protocols, and provisioning.

IoT solutions conceptual overview


8/10/2020
5 min read
Get an overview of the functional interactions between events, insights, and actions in Azure Internet of Things (IoT)
solutions.

IoT using Cosmos DB


3/23/2020
5 min read
Scale instantly and elastically to accommodate diverse and unpredictable IoT workloads without sacrificing ingestion or
query performance.

Predictive maintenance with the intelligent IoT Edge


7/28/2020
6 min read
Learn how a railway company implemented predictive safety maintenance using machine learning on the Azure intelligent
IoT Edge platform.

Predictive maintenance with the intelligent IoT Edge


7/28/2020
6 min read
Learn how a railway company implemented predictive safety maintenance using machine learning on the Azure intelligent
IoT Edge platform.

Process real-time vehicle data using IoT


3/10/2020
7 min read
This example builds a real-time data ingestion/processing pipeline to ingest and process messages from IoT devices into a
big data analytic platform in Azure.

Retail - Buy online, pickup in store (BOPIS)


10/01/2020
6 min read
Learn about how Azure IoT can help a retail solution for stores implementing buy online, pickup in store scenarios.

Scale IoT solutions with application stamps


8/10/2020
4 min read
Learn about scaling up IoT device populations with application stamps, and strategies for moving devices and applications
between stamps.

Secure your IoT SaaS app with the Microsoft identity platform
3/12/2020
13 min read
How a developer can use Azure Active Directory and the Microsoft identity platform to implement secure authentication
and authorization for SaaS apps that manage cloud-connected IoT devices.

UVEN smart and secure disinfection and lighting


7/15/2020
5 min read
See how BrainLit's UVEN system uses IoT and Azure Sphere to provide smart, safe, secure virus disinfection and healthy,
human-optimized lighting.

back to top

Management and Governance

Azure Arc hybrid management and deployment for Kubernetes clusters


7/23/2020
8 min read
Using Azure Arc to register Kubernetes clusters hosted outside of Azure, and using Azure tools to manage these clusters
along with clusters hosted in Azure Kubernetes Service.
Azure Automation Update Management
7/16/2020
14 min read
Manage operating system updates with Update Management in Azure Automation

Azure Automation in a hybrid environment


7/23/2020
13 min read
Azure Automation leveraged for both on-premises and cloud environments

Back up files and applications on Azure Stack Hub


8/06/2020
26 min read
Backup and restoration of files and applications of VM-based workloads

Build high availability into your BCDR strategy


12/16/2019
1 min read
Virtual machines (VMs) are physically separated across zones, and a virtual network is created using load balancers at each
site. These locations are close enough for high availability replication, so your applications stay running, despite any issues
at the physical locations.

Custom Data Sovereignty & Data Gravity Requirements


12/16/2019
1 min read
Learn how to set custom data sovereignty and data gravity requirements in the cloud with Azure Stack. Find a step-by-step
flow chart to implement this solution.

Disaster Recovery for Azure Stack Hub virtual machines


8/06/2020
29 min read
Optimized approach to disaster recovery of virtual machine (VM)-based user workloads hosted on Azure Stack Hub

Enterprise deployment using App Services Environment


5/08/2020
18 min read
Recommended architecture for deploying an enterprise application using App Services Environment.

Enterprise-scale disaster recovery


12/16/2019
2 min read
A large enterprise architecture for SharePoint, Dynamics CRM, and Linux web servers hosted on an on-premises datacenter
with failover to Azure infrastructure.

High availability and disaster recovery scenarios for IaaS apps.


4/06/2020
7 min read
Choose among high-availability and disaster recovery options like availability sets, availability zones, and regional DR for
deploying IaaS apps to the cloud.

High availability enterprise deployment using App Services Environment


5/08/2020
7 min read
Recommended architecture for deploying an enterprise application using App Services Environment in multiple availability
zones.

Hybrid availability and performance monitoring


7/27/2020
11 min read
Using Azure Monitor to monitor performance and availability for operating system workloads running on-premises, in
third-party cloud providers, and in Microsoft Azure

Manage configurations for Azure Arc enabled servers


7/28/2020
7 min read
Use Azure Arc to extend Azure Resource Manager capabilities to Windows and Linux machines on any infrastructure.

Manage hybrid Azure workloads using Windows Admin Center


7/29/2020
11 min read
Deploy Windows Admin Center to manage environments on-premises and in Azure.

Plan deployment for updating Windows VMs in Azure


3/15/2020
10 min read
A discussion of how best to configure your environment for Windows Server Update Services (WSUS).

Run a highly available SharePoint Server 2016 farm in Azure


7/26/2018
13 min read
Recommended architecture for deploying a high availability SharePoint Server 2016 farm in Azure.

SMB disaster recovery with Azure Site Recovery


12/16/2019
1 min read
Small and medium businesses can inexpensively implement disaster recovery to the cloud by using Azure Site Recovery or
a partner solution like Double-Take DR.

SMB disaster recovery with Double-Take DR


12/16/2019
1 min read
Small and medium businesses can inexpensively implement disaster recovery to the cloud by using a partner solution like
Double-Take DR.

back to top

Media

Instant Broadcasting on Serverless Architecture


12/16/2019
1 min read
Simplify one-to-many real-time communication and updates using serverless code
Live streaming digital media
12/16/2019
2 min read
A live streaming solution allows you to capture video in real-time and broadcast it to consumers in real time, such as
streaming interviews, conferences, and sporting events online.

Video-on-demand digital media


12/16/2019
2 min read
A basic video-on-demand solution that gives you the capability to stream recorded video content such as movies, news
clips, sports segments, training videos, and customer support tutorials to any video-capable endpoint device, mobile
application, or desktop browser. Video files are uploaded to Azure Blob storage, encoded to a multi-bitrate standard format,
and then distributed via all major adaptive bit-rate streaming protocols (HLS, MPEG-DASH, Smooth) to the Azure Media
Player client.

back to top

Migration

Adding a mobile front-end to a legacy app


12/16/2019
1 min read
The solution demonstrates modernizing an existing application by consolidating data from multiple business systems into
one place and surfacing it through web and mobile frontends. This is targeted at improving employee productivity and to
enable faster decision making.

Lift and shift to containers with AKS


12/16/2019
1 min read
Lift and shift to containers with AKS

Migrate IBM mainframe applications to Azure with TmaxSoft OpenFrame


9/18/2020
5 min read
Find out how to migrate IBM zSeries mainframe applications to Azure. Learn how to use TmaxSoft OpenFrame for this task.
Understand the lift and shift approach.
Modern Customer Support Portal
12/16/2019
1 min read
This architecture shows how to easily connect multiple business systems to enable customer support.

Serverless Computing Solution for LOB Apps


12/16/2019
1 min read
The solution demonstrates a business process for customer onboarding. This serverless architecture enables you to build
and run applications without having to worry about the underlying infrastructure and the associated management and
maintenance. By using it, you can dramatically improve developer productivity.

Unlock Legacy Data with Azure Stack


12/16/2019
1 min read
Follow a step-by-step flowchart to unlock and preserve legacy data from mainframe applications using Azure Stack.

Using Service Fabric to decompose applications


9/20/2018
6 min read
Learn about Azure Service Fabric as a platform for decomposing an unwieldy monolithic application into multiple,
manageable microservices.

back to top

Mixed Reality

Design Review Powered by Mixed Reality


12/16/2019
2 min read
Too often, product designers waste time and money with inefficient design review-2D images lose essential detail and
context, and physical prototypes are extremely expensive. With this mixed reality scenario, clients, designers, and on-site
engineers can easily share and review designs as 3D holograms in the context of their environment, accelerating design
decisions and reducing time to market.

Facilities management powered by mixed reality and IoT


12/16/2019
2 min read
Improve uptime and operations in hospitality, manufacturing, retail, and more with mixed reality and IoT. This scenario
shows how you can visualize a virtual replica of your physical space with real-time data in the context of your environment.
It is built on Azure Spatial Anchors and Azure Digital Twins. After reviewing this solution architecture, dive right in and
explore our reference sample.

Training and procedural guidance powered by mixed reality


12/16/2019
3 min read
Enable your team and employees to learn new processes and materials faster, with fewer errors, and greater confidence by
providing persistent holographic instructions mapped to precise locations in their physical workspace.

back to top

Mobile

Custom Business Processes


8/24/2020
2 min read
This example demonstrates how you can deploy portals that automate manual or paper-based processes and surface rich
user experience. Leverage Azure API management and Azure Functions to connect custom APIs which tap into your legacy
systems.

Custom Mobile Workforce App


12/16/2019
2 min read
Learn how the custom mobile workforce management app architecture is built and implemented with a step-by-step
diagram that illustrates the integration of Active Directory, SAP, and Azure App Service.

Line of Business Extension


8/24/2020
2 min read
This example shows how you can modernize your legacy systems that cannot support new processes and provide better
user experience.
Scalable web and mobile applications using Azure Database for MySQL
12/16/2019
1 min read
Use Azure Database for MySQL to rapidly build engaging, performant, and scalable cross-platform and native apps for iOS,
Android, Windows, or Mac.

Scalable web and mobile applications using Azure Database for PostgreSQL
12/16/2019
1 min read
Use Azure Database for PostgreSQL to rapidly build engaging, performant, and scalable cross-platform and native apps for
iOS, Android, Windows, or Mac.

Social App for Mobile and Web with Authentication


12/16/2019
3 min read
View a detailed, step-by-step diagram depicting the build process and implementation of the mobile client app architecture
that offers social image sharing with a companion web app and authentication abilities, even while offline.

Task-Based Consumer Mobile App


12/16/2019
3 min read
Learn how the task-based consumer mobile app architecture is created with a step-by-step flow chart that shows the
integration with Azure App Service Mobile Apps, Visual Studio, and Xamarin to simplify the build process.

Web and Mobile Front Ends


8/24/2020
2 min read
This example scenario demonstrates how you can accelerate frontend and business process development with drag and
drop visual designer.

back to top

Networking

Azure Firewall Architecture Guide


5/15/2020
7 min read
Use proven customer engagement practices to design firewalls and proxies in Azure for traditional infrastructure
workloads.

Choose between virtual network peering and VPN gateways


12/13/2019
3 min read
Compares two ways to connect virtual networks in Azure.

Connect an on-premises network to Azure


7/02/2018
2 min read
Compare reference architectures for connecting an on-premises network to Azure.

Connect an on-premises network to Azure using ExpressRoute


10/22/2017
5 min read
Implement a highly available and secure site-to-site network architecture that spans an Azure virtual network and an on-
premises network connected using ExpressRoute with VPN gateway failover.

Connect standalone servers by using Azure Network Adapter


7/23/2020
12 min read
Connect standalone servers on-premises and in other cloud providers networks to Azure networks

Deploy highly available NVAs


12/08/2018
7 min read
Learn how to deploy network virtual appliances for high availability in Azure. This article includes example architectures for
ingress, egress, and both.

Design a hybrid Domain Name System solution with Azure


7/23/2020
8 min read
Hybrid Domain Name System (DNS) solution to resolve names for workloads that are hosted on-premises and in Microsoft
Azure

Extend an on-premises network using ExpressRoute


7/23/2019
13 min read
Implement a secure site-to-site network architecture that spans an Azure virtual network and an on-premises network
connected using Azure ExpressRoute.

Extend an on-premises network using VPN


1/24/2020
11 min read
A secure site-to-site network architecture that spans an Azure virtual network and an on-premises network connected using
a VPN.

Firewall and Application Gateway for virtual networks


6/26/2020
14 min read
Learn about options and best practices for using Azure Firewall and Azure Application Gateway security in virtual networks.

Hub-spoke network topology in Azure


9/30/2020
7 min read
Learn how to implement a hub-spoke topology in Azure, where the hub is a virtual network and the spokes are virtual
networks that peer with the hub.

Hybrid Connection
9/01/2020
2 min read
Azure's Hybrid Connection is a foundational blueprint that is applicable to most Azure Stack Hub solutions, allowing you to
establish connectivity for any application that involves communications between the Azure public cloud and on-premises
Azure Stack Hub components.

Implement a secure hybrid network


1/07/2020
9 min read
See a secure hybrid network that extends an on-premises network to Azure with a perimeter network between the on-
premises network and an Azure virtual network.

Secure and govern workloads with network level segmentation


4/30/2020
7 min read
Using network level segmentation to secure virtual networks.

Troubleshoot a hybrid VPN connection


10/08/2018
8 min read
Troubleshoot a VPN gateway connection between an on-premises network and Azure.

back to top

Security

Azure Active Directory IDaaS in Security Operations


5/20/2020
9 min read
Learn how security operations teams can use Azure Active Directory identity and access features in an adaptive, integrated
zero-trust security architecture.

Centralized app configuration and security


5/13/2020
3 min read
Use Azure App Configuration and Azure Key Vault to create a centralized and secured configuration service for apps.

Computer forensics Chain of Custody in Azure


10/23/2020
10 min read
Create an infrastructure and workflow to ensure a valid digital evidence Chain of Custody (CoC) for computer forensics in
Azure.
Homomorphic encryption with SEAL
7/13/2020
4 min read
Learn about homomorphic encryption, and get an overview and example of how to use the Microsoft SEAL encryption
library.

Hybrid Security Monitoring using Azure Security Center and Azure Sentinel
8/19/2020
15 min read
Integration of monitoring the security configuration and telemetry of on-premises and Azure operating system workloads

Real-time fraud detection


7/05/2018
4 min read
Detect fraudulent activity in real-time using Azure Event Hubs and Stream Analytics.

Secure OAuth 2.0 On-Behalf-Of refresh tokens for web services


7/15/2020
5 min read
Store OAuth 2.0 On-Behalf-Of (OBO) refresh tokens securely using Azure Key Vault and Azure Functions managed identity
for key rotation and token refresh.

Securely managed web applications


5/09/2019
8 min read
Learn about deploying secure applications using the Azure App Service Environment, the Azure Application Gateway
service, and Web Application Firewall.

Security considerations for highly sensitive IaaS apps in Azure


4/15/2020
14 min read
Learn about VM security, encryption, NSGs, DMZs, access control, and other security considerations for highly sensitive IaaS
and hybrid apps.
Threat indicators for cyber threat intelligence in Azure Sentinel
4/13/2020
13 min read
Import threat indicators, view logs, create rules to generate security alerts and incidents, and visualize threat intelligence
data with Azure Sentinel.

back to top

Storage

Archive on-premises data to cloud


12/16/2019
1 min read
Archive your on-premises data to Azure Blob storage.

Back up on-premises applications and data to cloud


12/16/2019
2 min read
Back up on-premises applications and data with Azure Backup and Blob storage applications. Read documentation on
implementing these archiving solutions today.

HIPAA and HITRUST compliant health data AI


12/16/2019
2 min read
Manage HIPAA and HITRUST compliant health data and medical records with the highest level of built-in security.

HPC Media Rendering


11/04/2020
2 min read
Optimize the media rendering process with a step-by-step HPC solution architecture from Azure that combines Azure
CycleCloud and HPC Cache.

Hybrid file services


7/28/2020
10 min read
Use Azure File Sync and Azure Files to extend file services hosting capabilities across cloud and on-premises file share
resources.

Medical Data Storage Solutions


12/16/2019
2 min read
Store healthcare data effectively and affordably with cloud-based solutions from Azure. Manage medical records with the
highest level of built-in security.

Using Azure file shares in a hybrid environment


7/26/2020
10 min read
Azure file shares can replace traditional file servers. With identity-based authentication you can control access to Azure file
shares by using AD DS users and groups.

back to top

Web

Architect scalable e-commerce web app


12/16/2019
1 min read
The e-commerce website includes simple order processing workflows with the help of Azure services. Using Azure
Functions and Web Apps, developers can focus on building personalized experiences and let Azure take care of the
infrastructure.

Azure Functions in a hybrid environment


6/29/2020
5 min read
Azure Functions being utilized from on-premises virtual machines

Basic web application


12/12/2017
11 min read
Learn about proven practices for a web application that uses Azure App Service and Azure SQL Database by using this
reference architecture.

Cross Cloud Scaling Architecture


12/16/2019
1 min read
Learn how to improve cross cloud scalability with solution architecture that includes Azure Stack. A step-by-step flowchart
details instructions for implementation.

Cross-platform Chat
12/16/2019
1 min read
Accelerate development of reliable, high-performing chat applications

E-commerce front end


7/13/2018
6 min read
This example scenario implements an e-commerce front end using Azure platform as a service tools, which allow you to
handle increases in transactions.

E-commerce website running in secured App Service Environment


12/16/2019
2 min read
Secure your e-commerce website's sensitive user and payment data using the Microsoft Azure App Service Environment.

Highly available SharePoint farm


12/16/2019
2 min read
Learn how to deploy a highly available SharePoint farm for intranet capabilities with a step-by-step solution architecture
template from Azure.

Highly available multi-region web application


8/14/2019
8 min read
Recommended architecture for a highly available web application running in multiple regions in Azure.
Hybrid SharePoint Farm with Microsoft 365
12/16/2019
2 min read
Deliver highly available intranet capability by deploying SharePoint and sharing hybrid workloads with Microsoft 365.
Setup this solution with step-by-step instructions.

Intelligent product search engine for e-commerce


9/14/2018
6 min read
This example scenario shows how using Azure Search can dramatically increase the relevance of search results for your e-
commerce customers.

Migrate a web app using Azure APIM


9/13/2018
5 min read
In this scenario, an e-commerce company in the travel industry migrates a legacy web application by using Azure API
Management.

Multi-region N-tier application


6/18/2019
10 min read
Deploy an application on Azure virtual machines in multiple regions for high availability and resiliency.

Multi-tier web application built for HA/DR


11/16/2018
5 min read
Create a multitier web application built for high availability and disaster recovery on Azure using Azure virtual machines,
availability sets, availability zones, Azure Site Recovery, and Azure Traffic Manager.

SAP S/4 HANA for Large Instances


12/16/2019
3 min read
Learn more about SAP HANA on Azure for large instances that includes high reliability and disaster recovery. Find out how
NFS storage is used for large instances of SAP HANA.

Scalable Episerver marketing website


12/16/2019
2 min read
Let your business run multi-channel digital marketing websites on one platform and spin up and spin down campaigns on
demand. Take advantage of the comprehensive capabilities of Episerver to manage every aspect of your site and campaign
performance.

Scalable Sitecore marketing website


12/16/2019
2 min read
With the Sitecore Experience Platform (xP), you have at your fingertips the complete data, integrated tools, and automation
capabilities to engage your customers throughout an iterative life cycle-the technology foundation necessary to win
customers for life.

Scalable Umbraco CMS web app


12/16/2019
2 min read
Medium Umbraco CMS web app configured to scale and optimal for high-traffic sites. It uses two web apps, one for your
front-end app and the other for your back-office app, deployed in a single region with autoscaling enabled.

Scalable Web Apps


12/16/2019
1 min read
scalable web apps, azure redis cache, session data cache, user cookie cache, azure cache for redis

Scalable and secure WordPress on Azure


9/18/2018
5 min read
This example shows a highly scalable and secure installation of WordPress. The scenario was used for a large convention
and scaled to meet spike traffic.

Scalable order processing


7/10/2018
6 min read
This example demonstrates a highly scalable, resilient architecture for online order processing, using managed Azure
services such as Cosmos DB and HDInsight.

Scalable web application


10/03/2019
7 min read
Use the proven practices in this reference architecture to improve scalability and performance in an Azure App Service web
application..

Serverless web application


5/28/2019
16 min read
This reference architecture shows a serverless web application, which serves static content from Azure Blob Storage and
implements an API using Azure Functions.

Simple branded website


12/16/2019
2 min read
Quickly build and launch digital campaigns that automatically scale based on customer demand.

Simple digital marketing website


12/16/2019
2 min read
Start simple with the content management system that enables you to easily maintain the messaging on your website in
real-time, from a browser, with no coding skills.

Web application monitoring on Azure


12/12/2018
7 min read
Learn about the monitoring services and a dataflow model for use with multiple data sources. Many monitoring tools and
services work with Azure deployments.

back to top
Windows Virtual Desktop

Windows Virtual Desktop for the enterprise


7/16/2020
8 min read
Explore Windows Virtual Desktop, and learn to build virtual desktop infrastructure solutions at enterprise scale.

back to top
Architecture diagrams, reference architectures, example scenarios, and solutions for common hybrid cloud workloads.
Apps | Data | Identity | Management | Networking | Security

Apps

AI at the Edge with Azure Stack Hub


9/01/2020
2 min read
Move AI models to the edge with a solution architecture that includes Azure Stack Hub.

AI at the Edge with Azure Stack Hub - disconnected


9/01/2020
2 min read
Move AI models to the edge with a solution architecture that includes Azure Stack. A step-by-step workflow will help you
harness the power of edge AI when disconnected from the internet.

Azure Functions in a hybrid environment


6/29/2020
5 min read
Azure Functions being utilized from on-premises virtual machines

Azure Machine Learning decision guide for optimal tool selection


7/07/2020
6 min read
How to choose the best services for building an end-to-end machine learning pipeline from experimentation to deployment.

Cross Cloud Scaling Architecture


12/16/2019
1 min read
Learn how to improve cross cloud scalability with solution architecture that includes Azure Stack. A step-by-step flowchart
details instructions for implementation.

Cross-platform Chat
12/16/2019
1 min read
Accelerate development of reliable, high-performing chat applications

Deploy AI and machine learning at the edge by using Azure Stack Edge
7/16/2020
5 min read
Run intelligence workloads directly on IoT devices by using Azure Stack Edge

DevOps in a hybrid environment


12/16/2019
3 min read
The tools provided in Azure allow for the implementation of a DevOps strategy that capably manages both cloud and on-
premises environments in tandem.

DevSecOps in Azure
12/16/2019
4 min read
DevSecOps involves utilizing security best practices from the beginning of development, shifting the focus on security away
from auditing at the end and towards development in the beginning

HPC Risk Analysis Template - Solution Architecture


12/16/2019
3 min read
Build an HPC risk analysis solution architecture with a step-by-step flowchart from Microsoft Azure that combines
CycleCloud, Avere vFXT and TIBCO GridServer.

Hybrid HPC in Azure with HPC Pack


12/16/2019
2 min read
Get a hybrid high performance computing solution built with Windows Server technology. Use Azure HPC Pack to create a
hybrid HPC environment.

Hybrid SharePoint Farm with Microsoft 365


12/16/2019
2 min read
Deliver highly available intranet capability by deploying SharePoint and sharing hybrid workloads with Microsoft 365. Setup
this solution with step-by-step instructions.

Machine Learning Operations maturity model


7/07/2020
2 min read
Detailed explanation of the MLOps maturity model stages and defining characteristics of each stage.

Machine learning operations (MLOps) framework to upscale machine learning Lifecycle with Azure Machine Learning
6/01/2020
26 min read
How the machine learning operations (MLOps) maturity model was applied to implement a machine learning solution for
predicting product shipping levels.

Modern Customer Support Portal


12/16/2019
1 min read
This architecture shows how to easily connect multiple business systems to enable customer support.

Multicloud solutions with the Serverless Framework


3/30/2020
7 min read
Develop and deploy serverless microservices across cloud providers with the open-source Serverless Framework and
Serverless Multicloud Library.

Run containers in a hybrid environment


6/29/2020
6 min read
Run containers from an enterprise container registry on-premises and in Azure

Sharing location in real time using low-cost serverless Azure services


12/09/2019
6 min read
SignalR configured in server-less mode to work with Azure Function triggered by Service Bus. All of it using .NET Core. This
scenario is best used for real time messaging applications where users require a low-cost but robust messaging service.

back to top

Data

Archive on-premises data to cloud


12/16/2019
1 min read
Archive your on-premises data to Azure Blob storage.

Back up on-premises applications and data to cloud


12/16/2019
2 min read
Back up on-premises applications and data with Azure Backup and Blob storage applications. Read documentation on
implementing these archiving solutions today.

Extend your on-premises big data investments with HDInsight


12/16/2019
1 min read
Extend your on-premises big data investments to the cloud and transform your business using the advanced analytics
capabilities of HDInsight.

Hybrid ETL with Azure Data Factory


11/20/2019
5 min read
Hybrid ETL with existing on-premises SQL Server Integration Services (SSIS) deployments and Azure Data Factory.

Hybrid file services


7/28/2020
10 min read
Use Azure File Sync and Azure Files to extend file services hosting capabilities across cloud and on-premises file share
resources.
On-premises data gateway for Azure Logic Apps
7/16/2020
3 min read
Using a data gateway to connect on-premises data sources to Azure Logic Apps

Partitioning in Event Hubs and Kafka


10/02/2020
12 min read
Learn about partitioning in Kafka and Event Hubs with Kafka. See how many partitions to use in ingestion pipelines and
how to assign events to partitions.

Publishing internal APIs to external users


3/12/2019
8 min read
In this scenario, an organization consolidates multiple APIs internally using Azure API Management deployed inside a
Virtual Network.

Using Azure file shares in a hybrid environment


7/26/2020
10 min read
Azure file shares can replace traditional file servers. With identity-based authentication you can control access to Azure file
shares by using AD DS users and groups.

back to top

Identity

Create an AD DS resource forest in Azure


5/02/2018
6 min read
How to create a trusted Active Directory domain in Azure.

Extend on-premises AD FS to Azure


12/18/2018
13 min read
Implement a secure hybrid network architecture with Active Directory Federation Service authorization in Azure.

Hybrid Identity
9/01/2020
1 min read
The Hybrid Identity blueprint enables teams to manage applications and user identity consistently across clouds with the
utilization of Azure Stack Hub

Integrate on-premises AD domains with Azure AD


8/28/2019
13 min read
Implement a secure hybrid network architecture using Azure Active Directory (Azure AD).

Integrate on-premises AD with Azure


7/30/2019
4 min read
Compare reference architectures for integrating on-premises Active Directory with Azure.

back to top

Management

Azure Arc hybrid management and deployment for Kubernetes clusters


7/23/2020
8 min read
Using Azure Arc to register Kubernetes clusters hosted outside of Azure, and using Azure tools to manage these clusters
along with clusters hosted in Azure Kubernetes Service.

Azure Automation Update Management


7/16/2020
14 min read
Manage operating system updates with Update Management in Azure Automation

Azure Automation in a hybrid environment


7/23/2020
13 min read
Azure Automation leveraged for both on-premises and cloud environments

Back up files and applications on Azure Stack Hub


8/06/2020
26 min read
Backup and restoration of files and applications of VM-based workloads

Build high availability into your BCDR strategy


12/16/2019
1 min read
Virtual machines (VMs) are physically separated across zones, and a virtual network is created using load balancers at each
site. These locations are close enough for high availability replication, so your applications stay running, despite any issues
at the physical locations.

Disaster Recovery for Azure Stack Hub virtual machines


8/06/2020
29 min read
Optimized approach to disaster recovery of virtual machine (VM)-based user workloads hosted on Azure Stack Hub

Enterprise-scale disaster recovery


12/16/2019
2 min read
A large enterprise architecture for SharePoint, Dynamics CRM, and Linux web servers hosted on an on-premises datacenter
with failover to Azure infrastructure.

Hybrid availability and performance monitoring


7/27/2020
11 min read
Using Azure Monitor to monitor performance and availability for operating system workloads running on-premises, in
third-party cloud providers, and in Microsoft Azure

Manage configurations for Azure Arc enabled servers


7/28/2020
7 min read
Use Azure Arc to extend Azure Resource Manager capabilities to Windows and Linux machines on any infrastructure.

Manage hybrid Azure workloads using Windows Admin Center


7/29/2020
11 min read
Deploy Windows Admin Center to manage environments on-premises and in Azure.

SMB disaster recovery with Azure Site Recovery


12/16/2019
1 min read
Small and medium businesses can inexpensively implement disaster recovery to the cloud by using Azure Site Recovery or
a partner solution like Double-Take DR.

back to top

Networking

Connect an on-premises network to Azure


7/02/2018
2 min read
Compare reference architectures for connecting an on-premises network to Azure.

Connect an on-premises network to Azure using ExpressRoute


10/22/2017
5 min read
Implement a highly available and secure site-to-site network architecture that spans an Azure virtual network and an on-
premises network connected using ExpressRoute with VPN gateway failover.

Connect standalone servers by using Azure Network Adapter


7/23/2020
12 min read
Connect standalone servers on-premises and in other cloud providers networks to Azure networks
Design a hybrid Domain Name System solution with Azure
7/23/2020
8 min read
Hybrid Domain Name System (DNS) solution to resolve names for workloads that are hosted on-premises and in Microsoft
Azure

Extend an on-premises network using ExpressRoute


7/23/2019
13 min read
Implement a secure site-to-site network architecture that spans an Azure virtual network and an on-premises network
connected using Azure ExpressRoute.

Extend an on-premises network using VPN


1/24/2020
11 min read
A secure site-to-site network architecture that spans an Azure virtual network and an on-premises network connected using
a VPN.

Hub-spoke network topology in Azure


9/30/2020
7 min read
Learn how to implement a hub-spoke topology in Azure, where the hub is a virtual network and the spokes are virtual
networks that peer with the hub.

Hybrid Connection
9/01/2020
2 min read
Azure's Hybrid Connection is a foundational blueprint that is applicable to most Azure Stack Hub solutions, allowing you to
establish connectivity for any application that involves communications between the Azure public cloud and on-premises
Azure Stack Hub components.

Implement a secure hybrid network


1/07/2020
9 min read
See a secure hybrid network that extends an on-premises network to Azure with a perimeter network between the on-
premises network and an Azure virtual network.
Secure and govern workloads with network level segmentation
4/30/2020
7 min read
Using network level segmentation to secure virtual networks.

Troubleshoot a hybrid VPN connection


10/08/2018
8 min read
Troubleshoot a VPN gateway connection between an on-premises network and Azure.

back to top

Security

Hybrid Security Monitoring using Azure Security Center and Azure Sentinel
8/19/2020
15 min read
Integration of monitoring the security configuration and telemetry of on-premises and Azure operating system workloads

back to top
What's new in the Azure Architecture Center
11/13/2020 • 4 minutes to read • Edit Online

New and updated articles in the Azure Architecture Center

October 2020
New Articles
SQL Server 2008 R2 failover cluster in Azure
Kafka on Azure
Secure application's configuration and dependencies
Application classification for security
Application threat analysis
AKS triage - cluster health
AKS triage - container registry connectivity
AKS triage - admission controllers
AKS triage - workload deployments
AKS triage - node health
Azure Kubernetes Service (AKS) operations triage
Alerts in IoT Edge Vision
Camera selection for IoT Edge Vision
Hardware for IoT Edge Vision
Image storage in IoT Edge Vision
Azure IoT Edge Vision
Machine learning in IoT Edge Vision
User interface in IoT Edge Vision
Configure infrastructure
Repeatable Infrastructure
Automation overview of goals, best practices, and types in Azure
Automated Tasks
Partitioning in Event Hubs and Kafka
Retail - Buy online, pickup in store (BOPIS)
Magento e-commerce platform in Azure Kubernetes Service (AKS)
Web app private connectivity to Azure SQL database
Updated Articles
Security with identity and access management (IAM) in Azure (#2a1154709)
Regulatory compliance (#2a1154709)
Computer forensics Chain of Custody in Azure (#909b776f4)
Overview of the performance efficiency pillar (#ed89cf6ab)
Azure Kubernetes Service (AKS) solution journey (#b63ab6a9f)
GCP to Azure Services Comparison (#a45091c00)
Resiliency patterns (#b5201626c)
Choosing an Azure compute service (#a64329288)
Security patterns (#13add8a06)
September 2020
New Articles
Administrative account security
Enforce governance to reduce risks
Security management groups
Regulatory compliance
Team roles and responsibilities
Segmentation strategies
Tenancy model for SaaS applications
Migrate IBM mainframe applications to Azure with TmaxSoft OpenFrame
Monitor tools in Azure | Microsoft Docs
Applications and services
Security storage in Azure | Microsoft Docs
Azure Active Directory IDaaS in Security Operations
Capture cost requirements for an Azure
Tradeoffs for costs
Microsoft Azure Well-Architected Framework
Overview of the security pillar
Storage, data, and encryption in Azure | Microsoft Docs
Move Azure resources across regions
Monitor identity risk
FSLogix for the enterprise
Network monitoring
Stromasys Charon-SSP Solaris emulator on Azure VMs
Azure Kubernetes Service (AKS) solution journey
IoT analyze and optimize loops
IoT measure and control loops
IoT monitor and manage loops
Hybrid Architectures
Azure Arc hybrid management and deployment for Kubernetes clusters
Manage configurations for Azure Arc enabled servers
Azure Automation in a hybrid environment
Using Azure file shares in a hybrid environment
Azure Functions in a hybrid environment
Connect standalone servers by using Azure Network Adapter
Back up files and applications on Azure Stack Hub
Disaster Recovery for Azure Stack Hub virtual machines
Azure Automation Update Management
Deploy AI and machine learning at the edge by using Azure Stack Edge
On-premises data gateway for Azure Logic Apps
Run containers in a hybrid environment
Design a hybrid Domain Name System solution with Azure
Hybrid file services
Hybrid availability and performance monitoring
Hybrid Security Monitoring using Azure Security Center and Azure Sentinel
Manage hybrid Azure workloads using Windows Admin Center
DevSecOps in GitHub
Network security review
Network security strategies
Security with identity and access management (IAM) in Azure
Azure Messaging cost estimates
Web application cost estimates
Updated Articles
DevTest and DevOps for IaaS solutions (#a2a167058)
DevTest and DevOps for microservice solutions (#a2a167058)
DevTest and DevOps for PaaS solutions (#a2a167058)
Baseline architecture for an Azure Kubernetes Service (AKS) cluster (#9b20a025d)
DevSecOps in Azure (#511e6ee92)

August 2020
New Articles
Multiple forests with AD DS, Azure AD, and Azure AD DS
Multiple forests with AD DS and Azure AD
Virtual network integrated serverless microservices
Big data analytics with Azure Data Explorer
Content Delivery Network analytics
Azure Data Explorer interactive analytics
IoT analytics with Azure Data Explorer
Azure Data Explorer monitoring
Custom Business Processes
Web and Mobile Front Ends
Line of Business Extension
Attestation, authentication, and provisioning
Field and cloud edge gateways
Condition Monitoring for Industrial IoT
Predictive Maintenance for Industrial IoT
Banking system cloud transformation on Azure
JMeter implementation reference for load testing pipeline solution
Patterns and implementations
IoT connected light, power, and internet for emerging markets
Compute
Scale IoT solutions with application stamps
Builders, developers, and operators
IoT application-to-device commands
IoT solution architecture
IoT solutions conceptual overview
Use cached data
Criteria for choosing a data store
Data store decision tree
Updated Articles
Network security and containment in Azure | Microsoft Docs (#75626e3a7)
Data store cost estimates (#2b0e692f9)
Retry guidance for Azure services (#6c8a169c9)
Chaos engineering (#f742721a9)
Understand data store models (#4fbdd828a)
Cost governance for an Azure workload (#a3452805a)
Event-based cloud automation (#d2cca2011)
Serverless event processing (#d2cca2011)

July 2020
New Articles
Predictive maintenance with the intelligent IoT Edge
Multitenant SaaS on Azure
Saga distributed transactions
Secure OAuth 2.0 On-Behalf-Of refresh tokens for web services
Baseline architecture for an Azure Kubernetes Service (AKS) cluster
Azure Icons
Windows Virtual Desktop for the enterprise
UVEN smart and secure disinfection and lighting
Azure Industrial IoT guidance
Considerations in Azure Industrial IoT solution
Azure Industrial IoT - data analysis
Services in an Azure Industrial IoT solution
DataOps for the modern data warehouse
Unified logging for microservices applications
Homomorphic encryption with SEAL
Azure Machine Learning decision guide for optimal tool selection
Machine Learning Operations maturity model
Machine learning operations (MLOps) framework to upscale machine learning Lifecycle with Azure Machine
Learning
Cognizant Safe Buildings with IoT and Azure
Machine teaching with the Microsoft Autonomous Systems platform
Machine teaching
Updated Articles
Compare machine learning products from Microsoft (#84a5aab4e)
Autoscaling guidance (#16d331c41)
Storage, data, and encryption in Azure | Microsoft Docs (#bf81a3ea4)
Monitoring application health for reliability in Azure (#f7d496352)
Failure mode analysis (#f7d496352)
Azure IoT reference architecture (#682844123)
Run a highly available SharePoint Server 2016 farm in Azure (#fefe37ee4)
Azure Application Architecture Guide
11/2/2020 • 3 minutes to read • Edit Online

This guide presents a structured approach for designing applications on Azure that are scalable, secure, resilient,
and highly available. It is based on proven practices that we have learned from customer engagements.

Introduction
The cloud is changing how applications are designed and secured. Instead of monoliths, applications are
decomposed into smaller, decentralized services. These services communicate through APIs or by using
asynchronous messaging or eventing. Applications scale horizontally, adding new instances as demand requires.
These trends bring new challenges. Application state is distributed. Operations are done in parallel and
asynchronously. Applications must be resilient when failures occur. Malicious actors continuously target
applications. Deployments must be automated and predictable. Monitoring and telemetry are critical for gaining
insight into the system. This guide is designed to help you navigate these changes.

T RA DIT IO N A L O N - P REM ISES M O DERN C LO UD

Monolithic Decomposed
Designed for predictable scalability Designed for elastic scale
Relational database Polyglot persistence (mix of storage technologies)
Synchronized processing Asynchronous processing
Design to avoid failures (MTBF) Design for failure (MTTR)
Occasional large updates Frequent small updates
Manual management Automated self-management
Snowflake servers Immutable infrastructure

How this guide is structured


The Azure Application Architecture Guide is organized as a series of steps, from the architecture and design to
implementation. For each step, there is supporting guidance that will help you with the design of your application
architecture.
Architecture style
N-tier Micro services W eb -qu eu e-
w o rker

Even t driven B ig Co mp u te B ig Data

Technology choices
Co mp u te Data sto res Messagin g

Application architecture
Referen ce Design Design B est
arch itectu res p rin cip les p attern s p ractices

Microsoft Azure Well-Architected Framework


Co st O p eratio n al Perfo rman ce
o p timizatio n excellen ce efficien cy Reliab ility Secu rity
Architecture styles
The first decision point is the most fundamental. What kind of architecture are you building? It might be a
microservices architecture, a more traditional N-tier application, or a big data solution. We have identified several
distinct architecture styles. There are benefits and challenges to each.
Learn more: Architecture styles

Technology choices
Knowing the type of architecture you are building, now you can start to choose the main technology pieces for the
architecture. The following technology choices are critical:
Compute refers to the hosting model for the computing resources that your applications run on. For more
information, see Choose a compute service.
Data stores include databases but also storage for message queues, caches, logs, and anything else that an
application might persist to storage. For more information, see Choose a data store.
Messaging technologies enable asynchronous messages between components of the system. For more
information, see Choose a messaging service.
You will probably have to make additional technology choices along the way, but these three elements (compute,
data, and messaging) are central to most cloud applications and will determine many aspects of your design.

Design the architecture


Once you have chosen the architecture style and the major technology components, you are ready to tackle the
specific design of your application. Every application is different, but the following resources can help you along the
way:
Reference architectures
Depending on your scenario, one of our reference architectures may be a good starting point. Each reference
architecture includes recommended practices, along with considerations for scalability, availability, security,
resilience, and other aspects of the design. Most also include a deployable solution or reference implementation.
Design principles
We have identified 10 high-level design principles that will make your application more scalable, resilient, and
manageable. These design principles apply to any architecture style. Throughout the design process, keep these 10
high-level design principles in mind. For more information, see Design principles.
Design patterns
Software design patterns are repeatable patterns that are proven to solve specific problems. Our catalog of Cloud
design patterns addresses specific challenges in distributed systems. They address aspects such as availability,
resiliency, performance, and security. You can find our catalog of design patterns here.
Best practices
Our best practices articles cover various design considerations including API design, autoscaling, data partitioning,
caching, and so forth. Review these and apply the best practices that are appropriate for your application.
Security best practices
Our security best practices describe how to ensure that the confidentiality, integrity, and availability of your
application aren't compromised by malicious actors.

Quality pillars
A successful cloud application will focus on five pillars of software quality: Cost optimization, Operational
excellence, Performance efficiency, Reliability, and Security.
Leverage the Microsoft Azure Well-Architected Framework to assess your architecture across these five pillars.

Next steps
Architecture styles
Architecture styles
11/2/2020 • 5 minutes to read • Edit Online

An architecture style is a family of architectures that share certain characteristics. For example, N-tier is a common
architecture style. More recently, microservice architectures have started to gain favor. Architecture styles don't
require the use of particular technologies, but some technologies are well-suited for certain architectures. For
example, containers are a natural fit for microservices.
We have identified a set of architecture styles that are commonly found in cloud applications. The article for each
style includes:
A description and logical diagram of the style.
Recommendations for when to choose this style.
Benefits, challenges, and best practices.
A recommended deployment using relevant Azure services.

A quick tour of the styles


This section gives a quick tour of the architecture styles that we've identified, along with some high-level
considerations for their use. Read more details in the linked topics.
N -tier
N-tier is a traditional architecture for enterprise applications. Dependencies are managed by dividing the
application into layers that perform logical functions, such as presentation, business logic, and data access. A layer
can only call into layers that sit below it. However, this horizontal layering can be a liability. It can be hard to
introduce changes in one part of the application without touching the rest of the application. That makes frequent
updates a challenge, limiting how quickly new features can be added.
N-tier is a natural fit for migrating existing applications that already use a layered architecture. For that reason, N-
tier is most often seen in infrastructure as a service (IaaS) solutions, or application that use a mix of IaaS and
managed services.
Web-Queue -Worker
For a purely PaaS solution, consider a Web-Queue-Worker architecture. In this style, the application has a web
front end that handles HTTP requests and a back-end worker that performs CPU-intensive tasks or long-running
operations. The front end communicates to the worker through an asynchronous message queue.
Web-queue-worker is suitable for relatively simple domains with some resource-intensive tasks. Like N-tier, the
architecture is easy to understand. The use of managed services simplifies deployment and operations. But with
complex domains, it can be hard to manage dependencies. The front end and the worker can easily become large,
monolithic components that are hard to maintain and update. As with N-tier, this can reduce the frequency of
updates and limit innovation.
Microservices
If your application has a more complex domain, consider moving to a Microser vices architecture. A
microservices application is composed of many small, independent services. Each service implements a single
business capability. Services are loosely coupled, communicating through API contracts.
Each service can be built by a small, focused development team. Individual services can be deployed without a lot
of coordination between teams, which encourages frequent updates. A microservice architecture is more complex
to build and manage than either N-tier or web-queue-worker. It requires a mature development and DevOps
culture. But done right, this style can lead to higher release velocity, faster innovation, and a more resilient
architecture.
Event-driven architecture
Event-Driven Architectures use a publish-subscribe (pub-sub) model, where producers publish events, and
consumers subscribe to them. The producers are independent from the consumers, and consumers are
independent from each other.
Consider an event-driven architecture for applications that ingest and process a large volume of data with very low
latency, such as IoT solutions. The style is also useful when different subsystems must perform different types of
processing on the same event data.
Big Data, Big Compute
Big Data and Big Compute are specialized architecture styles for workloads that fit certain specific profiles. Big
data divides a very large dataset into chunks, performing parallel processing across the entire set, for analysis and
reporting. Big compute, also called high-performance computing (HPC), makes parallel computations across a
large number (thousands) of cores. Domains include simulations, modeling, and 3-D rendering.

Architecture styles as constraints


An architecture style places constraints on the design, including the set of elements that can appear and the
allowed relationships between those elements. Constraints guide the "shape" of an architecture by restricting the
universe of choices. When an architecture conforms to the constraints of a particular style, certain desirable
properties emerge.
For example, the constraints in microservices include:
A service represents a single responsibility.
Every service is independent of the others.
Data is private to the service that owns it. Services do not share data.
By adhering to these constraints, what emerges is a system where services can be deployed independently, faults
are isolated, frequent updates are possible, and it's easy to introduce new technologies into the application.
Before choosing an architecture style, make sure that you understand the underlying principles and constraints of
that style. Otherwise, you can end up with a design that conforms to the style at a superficial level, but does not
achieve the full potential of that style. It's also important to be pragmatic. Sometimes it's better to relax a
constraint, rather than insist on architectural purity.
The following table summarizes how each style manages dependencies, and the types of domain that are best
suited for each.

A RC H IT EC T URE ST Y L E DEP EN DEN C Y M A N A GEM EN T DO M A IN T Y P E

N-tier Horizontal tiers divided by subnet Traditional business domain. Frequency


of updates is low.

Web-Queue-Worker Front and backend jobs, decoupled by Relatively simple domain with some
async messaging. resource intensive tasks.

Microservices Vertically (functionally) decomposed Complicated domain. Frequent updates.


services that call each other through
APIs.

Event-driven architecture. Producer/consumer. Independent view IoT and real-time systems


per sub-system.
A RC H IT EC T URE ST Y L E DEP EN DEN C Y M A N A GEM EN T DO M A IN T Y P E

Big data Divide a huge dataset into small Batch and real-time data analysis.
chunks. Parallel processing on local Predictive analysis using ML.
datasets.

Big compute Data allocation to thousands of cores. Compute intensive domains such as
simulation.

Consider challenges and benefits


Constraints also create challenges, so it's important to understand the trade-offs when adopting any of these
styles. Do the benefits of the architecture style outweigh the challenges, for this subdomain and bounded context.
Here are some of the types of challenges to consider when selecting an architecture style:
Complexity . Is the complexity of the architecture justified for your domain? Conversely, is the style too
simplistic for your domain? In that case, you risk ending up with a "big ball of mud", because the
architecture does not help you to manage dependencies cleanly.
Asynchronous messaging and eventual consistency . Asynchronous messaging can be used to
decouple services, and increase reliability (because messages can be retried) and scalability. However, this
also creates challenges in handling eventual consistency, as well as the possibility of duplicate messages.
Inter-ser vice communication . As you decompose an application into separate services, there is a risk
that communication between services will cause unacceptable latency or create network congestion (for
example, in a microservices architecture).
Manageability . How hard is it to manage the application, monitor, deploy updates, and so on?
Big compute architecture style
11/2/2020 • 3 minutes to read • Edit Online

The term big compute describes large-scale workloads that require a large number of cores, often numbering in
the hundreds or thousands. Scenarios include image rendering, fluid dynamics, financial risk modeling, oil
exploration, drug design, and engineering stress analysis, among others.

Here are some typical characteristics of big compute applications:


The work can be split into discrete tasks, which can be run across many cores simultaneously.
Each task is finite. It takes some input, does some processing, and produces output. The entire application runs
for a finite amount of time (minutes to days). A common pattern is to provision a large number of cores in a
burst, and then spin down to zero once the application completes.
The application does not need to stay up 24/7. However, the system must handle node failures or application
crashes.
For some applications, tasks are independent and can run in parallel. In other cases, tasks are tightly coupled,
meaning they must interact or exchange intermediate results. In that case, consider using high-speed
networking technologies such as InfiniBand and remote direct memory access (RDMA).
Depending on your workload, you might use compute-intensive VM sizes (H16r, H16mr, and A9).

When to use this architecture


Computationally intensive operations such as simulation and number crunching.
Simulations that are computationally intensive and must be split across CPUs in multiple computers (10-
1000s).
Simulations that require too much memory for one computer, and must be split across multiple computers.
Long-running computations that would take too long to complete on a single computer.
Smaller computations that must be run 100s or 1000s of times, such as Monte Carlo simulations.

Benefits
High performance with "embarrassingly parallel" processing.
Can harness hundreds or thousands of computer cores to solve large problems faster.
Access to specialized high-performance hardware, with dedicated high-speed InfiniBand networks.
You can provision VMs as needed to do work, and then tear them down.

Challenges
Managing the VM infrastructure.
Managing the volume of number crunching
Provisioning thousands of cores in a timely manner.
For tightly coupled tasks, adding more cores can have diminishing returns. You may need to experiment to find
the optimum number of cores.

Big compute using Azure Batch


Azure Batch is a managed service for running large-scale high-performance computing (HPC) applications.
Using Azure Batch, you configure a VM pool, and upload the applications and data files. Then the Batch service
provisions the VMs, assign tasks to the VMs, runs the tasks, and monitors the progress. Batch can automatically
scale out the VMs in response to the workload. Batch also provides job scheduling.

Big compute running on Virtual Machines


You can use Microsoft HPC Pack to administer a cluster of VMs, and schedule and monitor HPC jobs. With this
approach, you must provision and manage the VMs and network infrastructure. Consider this approach if you
have existing HPC workloads and want to move some or all it to Azure. You can move the entire HPC cluster to
Azure, or you can keep your HPC cluster on-premises but use Azure for burst capacity. For more information, see
Batch and HPC solutions for large-scale computing workloads.
HPC Pack deployed to Azure
In this scenario, the HPC cluster is created entirely within Azure.
The head node provides management and job scheduling services to the cluster. For tightly coupled tasks, use an
RDMA network that provides very high bandwidth, low latency communication between VMs. For more
information, see Deploy an HPC Pack 2016 cluster in Azure.
Burst an HPC cluster to Azure
In this scenario, an organization is running HPC Pack on-premises, and uses Azure VMs for burst capacity. The
cluster head node is on-premises. ExpressRoute or VPN Gateway connects the on-premises network to the Azure
VNet.
Big data architecture style
11/2/2020 • 10 minutes to read • Edit Online

A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or
complex for traditional database systems.

Data Storage Batch


Processing
Data Analytical Analytics
Sources Data Store and
Reporting
Real-Time Message Stream
Ingestion Processing

Orchestration

Big data solutions typically involve one or more of the following types of workload:
Batch processing of big data sources at rest.
Real-time processing of big data in motion.
Interactive exploration of big data.
Predictive analytics and machine learning.
Most big data architectures include some or all of the following components:
Data sources : All big data solutions start with one or more data sources. Examples include:
Application data stores, such as relational databases.
Static files produced by applications, such as web server log files.
Real-time data sources, such as IoT devices.
Data storage : Data for batch processing operations is typically stored in a distributed file store that can
hold high volumes of large files in various formats. This kind of store is often called a data lake. Options for
implementing this storage include Azure Data Lake Store or blob containers in Azure Storage.
Batch processing : Because the data sets are so large, often a big data solution must process data files
using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually these
jobs involve reading source files, processing them, and writing the output to new files. Options include
running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an
HDInsight Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster.
Real-time message ingestion : If the solution includes real-time sources, the architecture must include a
way to capture and store real-time messages for stream processing. This might be a simple data store,
where incoming messages are dropped into a folder for processing. However, many solutions need a
message ingestion store to act as a buffer for messages, and to support scale-out processing, reliable
delivery, and other message queuing semantics. Options include Azure Event Hubs, Azure IoT Hubs, and
Kafka.
Stream processing : After capturing real-time messages, the solution must process them by filtering,
aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to an
output sink. Azure Stream Analytics provides a managed stream processing service based on perpetually
running SQL queries that operate on unbounded streams. You can also use open source Apache streaming
technologies like Storm and Spark Streaming in an HDInsight cluster.
Analytical data store : Many big data solutions prepare data for analysis and then serve the processed
data in a structured format that can be queried using analytical tools. The analytical data store used to serve
these queries can be a Kimball-style relational data warehouse, as seen in most traditional business
intelligence (BI) solutions. Alternatively, the data could be presented through a low-latency NoSQL
technology such as HBase, or an interactive Hive database that provides a metadata abstraction over data
files in the distributed data store. Azure Synapse Analytics provides a managed service for large-scale,
cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which can also
be used to serve data for analysis.
Analysis and repor ting : The goal of most big data solutions is to provide insights into the data through
analysis and reporting. To empower users to analyze the data, the architecture may include a data modeling
layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services. It might also
support self-service BI, using the modeling and visualization technologies in Microsoft Power BI or Microsoft
Excel. Analysis and reporting can also take the form of interactive data exploration by data scientists or data
analysts. For these scenarios, many Azure services support analytical notebooks, such as Jupyter, enabling
these users to leverage their existing skills with Python or R. For large-scale data exploration, you can use
Microsoft R Server, either standalone or with Spark.
Orchestration : Most big data solutions consist of repeated data processing operations, encapsulated in
workflows, that transform source data, move data between multiple sources and sinks, load the processed
data into an analytical data store, or push the results straight to a report or dashboard. To automate these
workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.
Azure includes many services that can be used in a big data architecture. They fall roughly into two categories:
Managed services, including Azure Data Lake Store, Azure Data Lake Analytics, Azure Synapse Analytics, Azure
Stream Analytics, Azure Event Hub, Azure IoT Hub, and Azure Data Factory.
Open source technologies based on the Apache Hadoop platform, including HDFS, HBase, Hive, Pig, Spark,
Storm, Oozie, Sqoop, and Kafka. These technologies are available on Azure in the Azure HDInsight service.
These options are not mutually exclusive, and many solutions combine open source technologies with Azure
services.

When to use this architecture


Consider this architecture style when you need to:
Store and process data in volumes too large for a traditional database.
Transform unstructured data for analysis and reporting.
Capture, process, and analyze unbounded streams of data in real time, or with low latency.
Use Azure Machine Learning or Microsoft Cognitive Services.

Benefits
Technology choices . You can mix and match Azure managed services and Apache technologies in HDInsight
clusters, to capitalize on existing skills or technology investments.
Performance through parallelism . Big data solutions take advantage of parallelism, enabling high-
performance solutions that scale to large volumes of data.
Elastic scale . All of the components in the big data architecture support scale-out provisioning, so that you can
adjust your solution to small or large workloads, and pay only for the resources that you use.
Interoperability with existing solutions . The components of the big data architecture are also used for IoT
processing and enterprise BI solutions, enabling you to create an integrated solution across data workloads.

Challenges
Complexity . Big data solutions can be extremely complex, with numerous components to handle data
ingestion from multiple data sources. It can be challenging to build, test, and troubleshoot big data processes.
Moreover, there may be a large number of configuration settings across multiple systems that must be used in
order to optimize performance.
Skillset . Many big data technologies are highly specialized, and use frameworks and languages that are not
typical of more general application architectures. On the other hand, big data technologies are evolving new
APIs that build on more established languages. For example, the U-SQL language in Azure Data Lake Analytics is
based on a combination of Transact-SQL and C#. Similarly, SQL-based APIs are available for Hive, HBase, and
Spark.
Technology maturity . Many of the technologies used in big data are evolving. While core Hadoop
technologies such as Hive and Pig have stabilized, emerging technologies such as Spark introduce extensive
changes and enhancements with each new release. Managed services such as Azure Data Lake Analytics and
Azure Data Factory are relatively young, compared with other Azure services, and will likely evolve over time.
Security . Big data solutions usually rely on storing all static data in a centralized data lake. Securing access to
this data can be challenging, especially when the data must be ingested and consumed by multiple applications
and platforms.

Best practices
Leverage parallelism . Most big data processing technologies distribute the workload across multiple
processing units. This requires that static data files are created and stored in a splittable format. Distributed
file systems such as HDFS can optimize read and write performance, and the actual processing is performed
by multiple cluster nodes in parallel, which reduces overall job times.
Par tition data . Batch processing usually happens on a recurring schedule — for example, weekly or
monthly. Partition data files, and data structures such as tables, based on temporal periods that match the
processing schedule. That simplifies data ingestion and job scheduling, and makes it easier to troubleshoot
failures. Also, partitioning tables that are used in Hive, U-SQL, or SQL queries can significantly improve
query performance.
Apply schema-on-read semantics . Using a data lake lets you to combine storage for files in multiple
formats, whether structured, semi-structured, or unstructured. Use schema-on-read semantics, which
project a schema onto the data when the data is processing, not when the data is stored. This builds
flexibility into the solution, and prevents bottlenecks during data ingestion caused by data validation and
type checking.
Process data in-place . Traditional BI solutions often use an extract, transform, and load (ETL) process to
move data into a data warehouse. With larger volumes data, and a greater variety of formats, big data
solutions generally use variations of ETL, such as transform, extract, and load (TEL). With this approach, the
data is processed within the distributed data store, transforming it to the required structure, before moving
the transformed data into an analytical data store.
Balance utilization and time costs . For batch processing jobs, it's important to consider two factors: The
per-unit cost of the compute nodes, and the per-minute cost of using those nodes to complete the job. For
example, a batch job may take eight hours with four cluster nodes. However, it might turn out that the job
uses all four nodes only during the first two hours, and after that, only two nodes are required. In that case,
running the entire job on two nodes would increase the total job time, but would not double it, so the total
cost would be less. In some business scenarios, a longer processing time may be preferable to the higher
cost of using underutilized cluster resources.
Separate cluster resources . When deploying HDInsight clusters, you will normally achieve better
performance by provisioning separate cluster resources for each type of workload. For example, although
Spark clusters include Hive, if you need to perform extensive processing with both Hive and Spark, you
should consider deploying separate dedicated Spark and Hadoop clusters. Similarly, if you are using HBase
and Storm for low latency stream processing and Hive for batch processing, consider separate clusters for
Storm, HBase, and Hadoop.
Orchestrate data ingestion . In some cases, existing business applications may write data files for batch
processing directly into Azure storage blob containers, where they can be consumed by HDInsight or Azure
Data Lake Analytics. However, you will often need to orchestrate the ingestion of data from on-premises or
external data sources into the data lake. Use an orchestration workflow or pipeline, such as those supported
by Azure Data Factory or Oozie, to achieve this in a predictable and centrally manageable fashion.
Scrub sensitive data early . The data ingestion workflow should scrub sensitive data early in the process,
to avoid storing it in the data lake.

IoT architecture
Internet of Things (IoT) is a specialized subset of big data solutions. The following diagram shows a possible logical
architecture for IoT. The diagram emphasizes the event-streaming components of the architecture.

The cloud gateway ingests device events at the cloud boundary, using a reliable, low latency messaging system.
Devices might send events directly to the cloud gateway, or through a field gateway . A field gateway is a
specialized device or software, usually colocated with the devices, that receives events and forwards them to the
cloud gateway. The field gateway might also preprocess the raw device events, performing functions such as
filtering, aggregation, or protocol transformation.
After ingestion, events go through one or more stream processors that can route the data (for example, to
storage) or perform analytics and other processing.
The following are some common types of processing. (This list is certainly not exhaustive.)
Writing event data to cold storage, for archiving or batch analytics.
Hot path analytics, analyzing the event stream in (near) real time, to detect anomalies, recognize patterns
over rolling time windows, or trigger alerts when a specific condition occurs in the stream.
Handling special types of non-telemetry messages from devices, such as notifications and alarms.
Machine learning.
The boxes that are shaded gray show components of an IoT system that are not directly related to event streaming,
but are included here for completeness.
The device registr y is a database of the provisioned devices, including the device IDs and usually device
metadata, such as location.
The provisioning API is a common external interface for provisioning and registering new devices.
Some IoT solutions allow command and control messages to be sent to devices.

This section has presented a very high-level view of IoT, and there are many subtleties and challenges to
consider. For a more detailed reference architecture and discussion, see the Microsoft Azure IoT Reference
Architecture (PDF download).

Next steps
Learn more about big data architectures.
Event-driven architecture style
11/13/2020 • 3 minutes to read • Edit Online

An event-driven architecture consists of event producers that generate a stream of events, and event
consumers that listen for the events.

Event Consumers

Event Producers Event Ingestion Event Consumers

Event Consumers

Events are delivered in near real time, so consumers can respond immediately to events as they occur. Producers
are decoupled from consumers — a producer doesn't know which consumers are listening. Consumers are also
decoupled from each other, and every consumer sees all of the events. This differs from a Competing Consumers
pattern, where consumers pull messages from a queue and a message is processed just once (assuming no
errors). In some systems, such as IoT, events must be ingested at very high volumes.
An event driven architecture can use a pub/sub model or an event stream model.
Pub/sub : The messaging infrastructure keeps track of subscriptions. When an event is published, it sends
the event to each subscriber. After an event is received, it cannot be replayed, and new subscribers do not
see the event.
Event streaming : Events are written to a log. Events are strictly ordered (within a partition) and durable.
Clients don't subscribe to the stream, instead a client can read from any part of the stream. The client is
responsible for advancing its position in the stream. That means a client can join at any time, and can
replay events.
On the consumer side, there are some common variations:
Simple event processing . An event immediately triggers an action in the consumer. For example, you
could use Azure Functions with a Service Bus trigger, so that a function executes whenever a message is
published to a Service Bus topic.
Complex event processing . A consumer processes a series of events, looking for patterns in the event
data, using a technology such as Azure Stream Analytics or Apache Storm. For example, you could
aggregate readings from an embedded device over a time window, and generate a notification if the
moving average crosses a certain threshold.
Event stream processing . Use a data streaming platform, such as Azure IoT Hub or Apache Kafka, as a
pipeline to ingest events and feed them to stream processors. The stream processors act to process or
transform the stream. There may be multiple stream processors for different subsystems of the
application. This approach is a good fit for IoT workloads.
The source of the events may be external to the system, such as physical devices in an IoT solution. In that case,
the system must be able to ingest the data at the volume and throughput that is required by the data source.
In the logical diagram above, each type of consumer is shown as a single box. In practice, it's common to have
multiple instances of a consumer, to avoid having the consumer become a single point of failure in system.
Multiple instances might also be necessary to handle the volume and frequency of events. Also, a single
consumer might process events on multiple threads. This can create challenges if events must be processed in
order or require exactly-once semantics. See Minimize Coordination.

When to use this architecture


Multiple subsystems must process the same events.
Real-time processing with minimum time lag.
Complex event processing, such as pattern matching or aggregation over time windows.
High volume and high velocity of data, such as IoT.

Benefits
Producers and consumers are decoupled.
No point-to-point integrations. It's easy to add new consumers to the system.
Consumers can respond to events immediately as they arrive.
Highly scalable and distributed.
Subsystems have independent views of the event stream.

Challenges
Guaranteed delivery. In some systems, especially in IoT scenarios, it's crucial to guarantee that events are
delivered.
Processing events in order or exactly once. Each consumer type typically runs in multiple instances, for
resiliency and scalability. This can create a challenge if the events must be processed in order (within a
consumer type), or if the processing logic is not idempotent.
Additional considerations
The amount of data to include in an event can be a significant consideration that affects both performance and
cost. Putting all the relevant information needed for processing in the event itself can simplify the processing
code and save additional lookups. Putting the minimal amount of information in an event, like just a couple of
identifiers, will reduce transport time and cost, but requires the processing code to look up any additional
information it needs.
Microservices architecture style
11/2/2020 • 5 minutes to read • Edit Online

A microservices architecture consists of a collection of small, autonomous services. Each service is self-contained
and should implement a single business capability.

What are microservices?


Microservices are small, independent, and loosely coupled. A single small team of developers can write and
maintain a service.
Each service is a separate codebase, which can be managed by a small development team.
Services can be deployed independently. A team can update an existing service without rebuilding and
redeploying the entire application.
Services are responsible for persisting their own data or external state. This differs from the traditional
model, where a separate data layer handles data persistence.
Services communicate with each other by using well-defined APIs. Internal implementation details of each
service are hidden from other services.
Services don't need to share the same technology stack, libraries, or frameworks.
Besides for the services themselves, some other components appear in a typical microservices architecture:
Management/orchestration . This component is responsible for placing services on nodes, identifying failures,
rebalancing services across nodes, and so forth. Typically this component is an off-the-shelf technology such as
Kubernetes, rather than something custom built.
API Gateway . The API gateway is the entry point for clients. Instead of calling services directly, clients call the API
gateway, which forwards the call to the appropriate services on the back end.
Advantages of using an API gateway include:
It decouples clients from services. Services can be versioned or refactored without needing to update all of
the clients.
Services can use messaging protocols that are not web friendly, such as AMQP.
The API Gateway can perform other cross-cutting functions such as authentication, logging, SSL
termination, and load balancing.

Benefits
Agility. Because microservices are deployed independently, it's easier to manage bug fixes and feature
releases. You can update a service without redeploying the entire application, and roll back an update if
something goes wrong. In many traditional applications, if a bug is found in one part of the application, it
can block the entire release process. New features may be held up waiting for a bug fix to be integrated,
tested, and published.
Small, focused teams . A microservice should be small enough that a single feature team can build, test,
and deploy it. Small team sizes promote greater agility. Large teams tend be less productive, because
communication is slower, management overhead goes up, and agility diminishes.
Small code base . In a monolithic application, there is a tendency over time for code dependencies to
become tangled Adding a new feature requires touching code in a lot of places. By not sharing code or data
stores, a microservices architecture minimizes dependencies, and that makes it easier to add new features.
Mix of technologies . Teams can pick the technology that best fits their service, using a mix of technology
stacks as appropriate.
Fault isolation . If an individual microservice becomes unavailable, it won't disrupt the entire application,
as long as any upstream microservices are designed to handle faults correctly (for example, by
implementing circuit breaking).
Scalability . Services can be scaled independently, letting you scale out subsystems that require more
resources, without scaling out the entire application. Using an orchestrator such as Kubernetes or Service
Fabric, you can pack a higher density of services onto a single host, which allows for more efficient
utilization of resources.
Data isolation . It is much easier to perform schema updates, because only a single microservice is
affected. In a monolithic application, schema updates can become very challenging, because different parts
of the application may all touch the same data, making any alterations to the schema risky.

Challenges
The benefits of microservices don't come for free. Here are some of the challenges to consider before embarking
on a microservices architecture.
Complexity . A microservices application has more moving parts than the equivalent monolithic
application. Each service is simpler, but the entire system as a whole is more complex.
Development and testing . Writing a small service that relies on other dependent services requires a
different approach than a writing a traditional monolithic or layered application. Existing tools are not
always designed to work with service dependencies. Refactoring across service boundaries can be difficult.
It is also challenging to test service dependencies, especially when the application is evolving quickly.
Lack of governance . The decentralized approach to building microservices has advantages, but it can
also lead to problems. You may end up with so many different languages and frameworks that the
application becomes hard to maintain. It may be useful to put some project-wide standards in place,
without overly restricting teams' flexibility. This especially applies to cross-cutting functionality such as
logging.
Network congestion and latency . The use of many small, granular services can result in more
interservice communication. Also, if the chain of service dependencies gets too long (service A calls B,
which calls C...), the additional latency can become a problem. You will need to design APIs carefully. Avoid
overly chatty APIs, think about serialization formats, and look for places to use asynchronous
communication patterns.
Data integrity . With each microservice responsible for its own data persistence. As a result, data
consistency can be a challenge. Embrace eventual consistency where possible.
Management . To be successful with microservices requires a mature DevOps culture. Correlated logging
across services can be challenging. Typically, logging must correlate multiple service calls for a single user
operation.
Versioning . Updates to a service must not break services that depend on it. Multiple services could be
updated at any given time, so without careful design, you might have problems with backward or forward
compatibility.
Skillset . Microservices are highly distributed systems. Carefully evaluate whether the team has the skills
and experience to be successful.

Best practices
Model services around the business domain.
Decentralize everything. Individual teams are responsible for designing and building services. Avoid
sharing code or data schemas.
Data storage should be private to the service that owns the data. Use the best storage for each service and
data type.
Services communicate through well-designed APIs. Avoid leaking implementation details. APIs should
model the domain, not the internal implementation of the service.
Avoid coupling between services. Causes of coupling include shared database schemas and rigid
communication protocols.
Offload cross-cutting concerns, such as authentication and SSL termination, to the gateway.
Keep domain knowledge out of the gateway. The gateway should handle and route client requests without
any knowledge of the business rules or domain logic. Otherwise, the gateway becomes a dependency and
can cause coupling between services.
Services should have loose coupling and high functional cohesion. Functions that are likely to change
together should be packaged and deployed together. If they reside in separate services, those services end
up being tightly coupled, because a change in one service will require updating the other service. Overly
chatty communication between two services may be a symptom of tight coupling and low cohesion.
Isolate failures. Use resiliency strategies to prevent failures within a service from cascading. See Resiliency
patterns and Designing reliable applications.

Next steps
For detailed guidance about building a microservices architecture on Azure, see Designing, building, and
operating microservices on Azure.
N-tier architecture style
11/2/2020 • 5 minutes to read • Edit Online

An N-tier architecture divides an application into logical layers and physical tiers .

Remote
Service
Middle
Tier 1

Client WAF Web Cache


Tier
Data
Tier
Messaging Middle
Tier 2

Layers are a way to separate responsibilities and manage dependencies. Each layer has a specific responsibility. A
higher layer can use services in a lower layer, but not the other way around.
Tiers are physically separated, running on separate machines. A tier can call to another tier directly, or use
asynchronous messaging (message queue). Although each layer might be hosted in its own tier, that's not
required. Several layers might be hosted on the same tier. Physically separating the tiers improves scalability and
resiliency, but also adds latency from the additional network communication.
A traditional three-tier application has a presentation tier, a middle tier, and a database tier. The middle tier is
optional. More complex applications can have more than three tiers. The diagram above shows an application with
two middle tiers, encapsulating different areas of functionality.
An N-tier application can have a closed layer architecture or an open layer architecture :
In a closed layer architecture, a layer can only call the next layer immediately down.
In an open layer architecture, a layer can call any of the layers below it.
A closed layer architecture limits the dependencies between layers. However, it might create unnecessary network
traffic, if one layer simply passes requests along to the next layer.

When to use this architecture


N-tier architectures are typically implemented as infrastructure-as-service (IaaS) applications, with each tier
running on a separate set of VMs. However, an N-tier application doesn't need to be pure IaaS. Often, it's
advantageous to use managed services for some parts of the architecture, particularly caching, messaging, and
data storage.
Consider an N-tier architecture for:
Simple web applications.
Migrating an on-premises application to Azure with minimal refactoring.
Unified development of on-premises and cloud applications.
N-tier architectures are very common in traditional on-premises applications, so it's a natural fit for migrating
existing workloads to Azure.

Benefits
Portability between cloud and on-premises, and between cloud platforms.
Less learning curve for most developers.
Natural evolution from the traditional application model.
Open to heterogeneous environment (Windows/Linux)

Challenges
It's easy to end up with a middle tier that just does CRUD operations on the database, adding extra latency
without doing any useful work.
Monolithic design prevents independent deployment of features.
Managing an IaaS application is more work than an application that uses only managed services.
It can be difficult to manage network security in a large system.

Best practices
Use autoscaling to handle changes in load. See Autoscaling best practices.
Use asynchronous messaging to decouple tiers.
Cache semistatic data. See Caching best practices.
Configure the database tier for high availability, using a solution such as SQL Server Always On availability
groups.
Place a web application firewall (WAF) between the front end and the Internet.
Place each tier in its own subnet, and use subnets as a security boundary.
Restrict access to the data tier, by allowing requests only from the middle tier(s).

N-tier architecture on virtual machines


This section describes a recommended N-tier architecture running on VMs.

Each tier consists of two or more VMs, placed in an availability set or virtual machine scale set. Multiple VMs
provide resiliency in case one VM fails. Load balancers are used to distribute requests across the VMs in a tier. A
tier can be scaled horizontally by adding more VMs to the pool.
Each tier is also placed inside its own subnet, meaning their internal IP addresses fall within the same address
range. That makes it easy to apply network security group rules and route tables to individual tiers.
The web and business tiers are stateless. Any VM can handle any request for that tier. The data tier should consist
of a replicated database. For Windows, we recommend SQL Server, using Always On availability groups for high
availability. For Linux, choose a database that supports replication, such as Apache Cassandra.
Network security groups restrict access to each tier. For example, the database tier only allows access from the
business tier.

NOTE
The layer labeled "Business Tier" in our reference diagram is a moniker to the business logic tier. Likewise, we also call the
presentation tier the "Web Tier." In our example, this is a web application, though multi-tier architectures can be used for
other topologies as well (like desktop apps). Name your tiers what works best for your team to communicate the intent of
that logical and/or physical tier in your application - you could even express that naming in resources you choose to
represent that tier (e.g. vmss-appName-business-layer).

For more information about running N-tier applications on Azure:


Run Windows VMs for an N-tier application
Windows N-tier application on Azure with SQL Server
Microsoft Learn module: Tour the N-tier architecture style
Azure Bastion
Additional considerations
N-tier architectures are not restricted to three tiers. For more complex applications, it is common to have
more tiers. In that case, consider using layer-7 routing to route requests to a particular tier.
Tiers are the boundary of scalability, reliability, and security. Consider having separate tiers for services with
different requirements in those areas.
Use virtual machine scale sets for autoscaling.
Look for places in the architecture where you can use a managed service without significant refactoring. In
particular, look at caching, messaging, storage, and databases.
For higher security, place a network DMZ in front of the application. The DMZ includes network virtual
appliances (NVAs) that implement security functionality such as firewalls and packet inspection. For more
information, see Network DMZ reference architecture.
For high availability, place two or more NVAs in an availability set, with an external load balancer to
distribute Internet requests across the instances. For more information, see Deploy highly available
network virtual appliances.
Do not allow direct RDP or SSH access to VMs that are running application code. Instead, operators should
log into a jumpbox, also called a bastion host. This is a VM on the network that administrators use to
connect to the other VMs. The jumpbox has a network security group that allows RDP or SSH only from
approved public IP addresses.
You can extend the Azure virtual network to your on-premises network using a site-to-site virtual private
network (VPN) or Azure ExpressRoute. For more information, see Hybrid network reference architecture.
If your organization uses Active Directory to manage identity, you may want to extend your Active
Directory environment to the Azure VNet. For more information, see Identity management reference
architecture.
If you need higher availability than the Azure SLA for VMs provides, replicate the application across two
regions and use Azure Traffic Manager for failover. For more information, see Run Windows VMs in
multiple regions or Run Linux VMs in multiple regions.
Web-Queue-Worker architecture style
11/2/2020 • 3 minutes to read • Edit Online

The core components of this architecture are a web front end that serves client requests, and a worker that
performs resource-intensive tasks, long-running workflows, or batch jobs. The web front end communicates with
the worker through a message queue .
Remote
Service
Identity
Provider

Web Front Cache


Client End
Worker Database

Queue

CDN Static
Content

Other components that are commonly incorporated into this architecture include:
One or more databases.
A cache to store values from the database for quick reads.
CDN to serve static content
Remote services, such as email or SMS service. Often these are provided by third parties.
Identity provider for authentication.
The web and worker are both stateless. Session state can be stored in a distributed cache. Any long-running work
is done asynchronously by the worker. The worker can be triggered by messages on the queue, or run on a
schedule for batch processing. The worker is an optional component. If there are no long-running operations, the
worker can be omitted.
The front end might consist of a web API. On the client side, the web API can be consumed by a single-page
application that makes AJAX calls, or by a native client application.

When to use this architecture


The Web-Queue-Worker architecture is typically implemented using managed compute services, either Azure App
Service or Azure Cloud Services.
Consider this architecture style for:
Applications with a relatively simple domain.
Applications with some long-running workflows or batch operations.
When you want to use managed services, rather than infrastructure as a service (IaaS).

Benefits
Relatively simple architecture that is easy to understand.
Easy to deploy and manage.
Clear separation of concerns.
The front end is decoupled from the worker using asynchronous messaging.
The front end and the worker can be scaled independently.

Challenges
Without careful design, the front end and the worker can become large, monolithic components that are
difficult to maintain and update.
There may be hidden dependencies, if the front end and worker share data schemas or code modules.

Best practices
Expose a well-designed API to the client. See API design best practices.
Autoscale to handle changes in load. See Autoscaling best practices.
Cache semi-static data. See Caching best practices.
Use a CDN to host static content. See CDN best practices.
Use polyglot persistence when appropriate. See Use the best data store for the job.
Partition data to improve scalability, reduce contention, and optimize performance. See Data partitioning best
practices.

Web-Queue-Worker on Azure App Service


This section describes a recommended Web-Queue-Worker architecture that uses Azure App Service.

The front end is implemented as an Azure App Service web app, and the worker is implemented as an Azure
Functions app. The web app and the function app are both associated with an App Service plan that
provides the VM instances.
You can use either Azure Service Bus or Azure Storage queues for the message queue. (The diagram shows
an Azure Storage queue.)
Azure Cache for Redis stores session state and other data that needs low latency access.
Azure CDN is used to cache static content such as images, CSS, or HTML.
For storage, choose the storage technologies that best fit the needs of the application. You might use
multiple storage technologies (polyglot persistence). To illustrate this idea, the diagram shows Azure SQL
Database and Azure Cosmos DB.
For more details, see App Service web application reference architecture.
Additional considerations
Not every transaction has to go through the queue and worker to storage. The web front end can perform
simple read/write operations directly. Workers are designed for resource-intensive tasks or long-running
workflows. In some cases, you might not need a worker at all.
Use the built-in autoscale feature of App Service to scale out the number of VM instances. If the load on the
application follows predictable patterns, use schedule-based autoscale. If the load is unpredictable, use
metrics-based autoscaling rules.
Consider putting the web app and the function app into separate App Service plans. That way, they can be
scaled independently.
Use separate App Service plans for production and testing. Otherwise, if you use the same plan for
production and testing, it means your tests are running on your production VMs.
Use deployment slots to manage deployments. This lets you to deploy an updated version to a staging slot,
then swap over to the new version. It also lets you swap back to the previous version, if there was a problem
with the update.
Ten design principles for Azure applications
11/2/2020 • 2 minutes to read • Edit Online

Follow these design principles to make your application more scalable, resilient, and manageable.
Design for self healing . In a distributed system, failures happen. Design your application to be self healing when
failures occur.
Make all things redundant . Build redundancy into your application, to avoid having single points of failure.
Minimize coordination . Minimize coordination between application services to achieve scalability.
Design to scale out . Design your application so that it can scale horizontally, adding or removing new instances
as demand requires.
Par tition around limits . Use partitioning to work around database, network, and compute limits.
Design for operations . Design your application so that the operations team has the tools they need.
Use managed ser vices . When possible, use platform as a service (PaaS) rather than infrastructure as a service
(IaaS).
Use the best data store for the job . Pick the storage technology that is the best fit for your data and how it will
be used.
Design for evolution . All successful applications change over time. An evolutionary design is key for continuous
innovation.
Build for the needs of business . Every design decision must be justified by a business requirement.
Design for self healing
11/2/2020 • 4 minutes to read • Edit Online

Design your application to be self healing when failures occur


In a distributed system, failures can happen. Hardware can fail. The network can have transient failures. Rarely, an
entire service or region may experience a disruption, but even those must be planned for.
Therefore, design an application to be self healing when failures occur. This requires a three-pronged approach:
Detect failures.
Respond to failures gracefully.
Log and monitor failures, to give operational insight.
How you respond to a particular type of failure may depend on your application's availability requirements. For
example, if you require very high availability, you might automatically fail over to a secondary region during a
regional outage. However, that will incur a higher cost than a single-region deployment.
Also, don't just consider big events like regional outages, which are generally rare. You should focus as much, if not
more, on handling local, short-lived failures, such as network connectivity failures or failed database connections.

Recommendations
Retr y failed operations . Transient failures may occur due to momentary loss of network connectivity, a dropped
database connection, or a timeout when a service is busy. Build retry logic into your application to handle transient
failures. For many Azure services, the client SDK implements automatic retries. For more information, see Transient
fault handling and the Retry pattern.
Protect failing remote ser vices (Circuit Breaker) . It's good to retry after a transient failure, but if the failure
persists, you can end up with too many callers hammering a failing service. This can lead to cascading failures, as
requests back up. Use the Circuit Breaker pattern to fail fast (without making the remote call) when an operation is
likely to fail.
Isolate critical resources (Bulkhead) . Failures in one subsystem can sometimes cascade. This can happen if a
failure causes some resources, such as threads or sockets, not to get freed in a timely manner, leading to resource
exhaustion. To avoid this, partition a system into isolated groups, so that a failure in one partition does not bring
down the entire system.
Perform load leveling . Applications may experience sudden spikes in traffic that can overwhelm services on the
backend. To avoid this, use the Queue-Based Load Leveling pattern to queue work items to run asynchronously. The
queue acts as a buffer that smooths out peaks in the load.
Fail over . If an instance can't be reached, fail over to another instance. For things that are stateless, like a web
server, put several instances behind a load balancer or traffic manager. For things that store state, like a database,
use replicas and fail over. Depending on the data store and how it replicates, this may require the application to
deal with eventual consistency.
Compensate failed transactions . In general, avoid distributed transactions, as they require coordination across
services and resources. Instead, compose an operation from smaller individual transactions. If the operation fails
midway through, use Compensating Transactions to undo any step that already completed.
Checkpoint long-running transactions . Checkpoints can provide resiliency if a long-running operation fails.
When the operation restarts (for example, it is picked up by another VM), it can be resumed from the last
checkpoint.
Degrade gracefully . Sometimes you can't work around a problem, but you can provide reduced functionality that
is still useful. Consider an application that shows a catalog of books. If the application can't retrieve the thumbnail
image for the cover, it might show a placeholder image. Entire subsystems might be noncritical for the application.
For example, in an e-commerce site, showing product recommendations is probably less critical than processing
orders.
Throttle clients . Sometimes a small number of users create excessive load, which can reduce your application's
availability for other users. In this situation, throttle the client for a certain period of time. See the Throttling pattern.
Block bad actors . Just because you throttle a client, it doesn't mean client was acting maliciously. It just means the
client exceeded their service quota. But if a client consistently exceeds their quota or otherwise behaves badly, you
might block them. Define an out-of-band process for user to request getting unblocked.
Use leader election . When you need to coordinate a task, use Leader Election to select a coordinator. That way,
the coordinator is not a single point of failure. If the coordinator fails, a new one is selected. Rather than implement
a leader election algorithm from scratch, consider an off-the-shelf solution such as Zookeeper.
Test with fault injection . All too often, the success path is well tested but not the failure path. A system could run
in production for a long time before a failure path is exercised. Use fault injection to test the resiliency of the
system to failures, either by triggering actual failures or by simulating them.
Embrace chaos engineering . Chaos engineering extends the notion of fault injection, by randomly injecting
failures or abnormal conditions into production instances.
For a structured approach to making your applications self healing, see Design reliable applications for Azure.
Make all things redundant
11/2/2020 • 2 minutes to read • Edit Online

Build redundancy into your application, to avoid having single points of


failure
A resilient application routes around failure. Identify the critical paths in your application. Is there redundancy at
each point in the path? If a subsystem fails, will the application fail over to something else?

Recommendations
Consider business requirements . The amount of redundancy built into a system can affect both cost and
complexity. Your architecture should be informed by your business requirements, such as recovery time objective
(RTO). For example, a multi-region deployment is more expensive than a single-region deployment, and is more
complicated to manage. You will need operational procedures to handle failover and failback. The additional cost
and complexity might be justified for some business scenarios and not others.
Place VMs behind a load balancer . Don't use a single VM for mission-critical workloads. Instead, place multiple
VMs behind a load balancer. If any VM becomes unavailable, the load balancer distributes traffic to the remaining
healthy VMs. To learn how to deploy this configuration, see Multiple VMs for scalability and availability.

Load
Balancer

Replicate databases . Azure SQL Database and Cosmos DB automatically replicate the data within a region, and
you can enable geo-replication across regions. If you are using an IaaS database solution, choose one that supports
replication and failover, such as SQL Server Always On availability groups.
Enable geo-replication . Geo-replication for Azure SQL Database and Cosmos DB creates secondary readable
replicas of your data in one or more secondary regions. In the event of an outage, the database can fail over to the
secondary region for writes.
Par tition for availability . Database partitioning is often used to improve scalability, but it can also improve
availability. If one shard goes down, the other shards can still be reached. A failure in one shard will only disrupt a
subset of the total transactions.
Deploy to more than one region . For the highest availability, deploy the application to more than one region.
That way, in the rare case when a problem affects an entire region, the application can fail over to another region.
The following diagram shows a multi-region application that uses Azure Traffic Manager to handle failover.
Region 1

Region 2
Azure Traffic
Manager

Synchronize front and backend failover . Use Azure Traffic Manager to fail over the front end. If the front end
becomes unreachable in one region, Traffic Manager will route new requests to the secondary region. Depending
on your database solution, you may need to coordinate failing over the database.
Use automatic failover but manual failback . Use Traffic Manager for automatic failover, but not for automatic
failback. Automatic failback carries a risk that you might switch to the primary region before the region is
completely healthy. Instead, verify that all application subsystems are healthy before manually failing back. Also,
depending on the database, you might need to check data consistency before failing back.
Include redundancy for Traffic Manager . Traffic Manager is a possible failure point. Review the Traffic Manager
SLA, and determine whether using Traffic Manager alone meets your business requirements for high availability. If
not, consider adding another traffic management solution as a failback. If the Azure Traffic Manager service fails,
change your CNAME records in DNS to point to the other traffic management service.
Minimize coordination
11/2/2020 • 4 minutes to read • Edit Online

Minimize coordination between application services to achieve


scalability
Most cloud applications consist of multiple application services — web front ends, databases, business processes,
reporting and analysis, and so on. To achieve scalability and reliability, each of those services should run on
multiple instances.
What happens when two instances try to perform concurrent operations that affect some shared state? In some
cases, there must be coordination across nodes, for example to preserve ACID guarantees. In this diagram, Node2
is waiting for Node1 to release a database lock:

Update
Orders
Node 1

Update OrderItems

(blocked)
Node 2

Coordination limits the benefits of horizontal scale and creates bottlenecks. In this example, as you scale out the
application and add more instances, you'll see increased lock contention. In the worst case, the front-end instances
will spend most of their time waiting on locks.
"Exactly once" semantics are another frequent source of coordination. For example, an order must be processed
exactly once. Two workers are listening for new orders. Worker1 picks up an order for processing. The application
must ensure that Worker2 doesn't duplicate the work, but also if Worker1 crashes, the order isn't dropped.

Orders Worker 1

order #123 ? Process orders


Create order

Worker 2

You can use a pattern such as Scheduler Agent Supervisor to coordinate between the workers, but in this case a
better approach might be to partition the work. Each worker is assigned a certain range of orders (say, by billing
region). If a worker crashes, a new instance picks up where the previous instance left off, but multiple instances
aren't contending.

Recommendations
Embrace eventual consistency. When data is distributed, it takes coordination to enforce strong consistency
guarantees. For example, suppose an operation updates two databases. Instead of putting it into a single
transaction scope, it's better if the system can accommodate eventual consistency, perhaps by using the
Compensating Transaction pattern to logically roll back after a failure.
Use domain events to synchronize state. A domain event is an event that records when something happens
that has significance within the domain. Interested services can listen for the event, rather than using a global
transaction to coordinate across multiple services. If this approach is used, the system must tolerate eventual
consistency (see previous item).
Consider patterns such as CQRS and event sourcing. These two patterns can help to reduce contention
between read workloads and write workloads.
The CQRS pattern separates read operations from write operations. In some implementations, the read data
is physically separated from the write data.
In the Event Sourcing pattern, state changes are recorded as a series of events to an append-only data store.
Appending an event to the stream is an atomic operation, requiring minimal locking.
These two patterns complement each other. If the write-only store in CQRS uses event sourcing, the read-only
store can listen for the same events to create a readable snapshot of the current state, optimized for queries. Before
adopting CQRS or event sourcing, however, be aware of the challenges of this approach.
Par tition data. Avoid putting all of your data into one data schema that is shared across many application
services. A microservices architecture enforces this principle by making each service responsible for its own data
store. Within a single database, partitioning the data into shards can improve concurrency, because a service
writing to one shard does not affect a service writing to a different shard.
Design idempotent operations. When possible, design operations to be idempotent. That way, they can be
handled using at-least-once semantics. For example, you can put work items on a queue. If a worker crashes in the
middle of an operation, another worker simply picks up the work item.
Use asynchronous parallel processing. If an operation requires multiple steps that are performed
asynchronously (such as remote service calls), you might be able to call them in parallel, and then aggregate the
results. This approach assumes that each step does not depend on the results of the previous step.
Use optimistic concurrency when possible. Pessimistic concurrency control uses database locks to prevent
conflicts. This can cause poor performance and reduce availability. With optimistic concurrency control, each
transaction modifies a copy or snapshot of the data. When the transaction is committed, the database engine
validates the transaction and rejects any transactions that would affect database consistency.
Azure SQL Database and SQL Server support optimistic concurrency through snapshot isolation. Some Azure
storage services support optimistic concurrency through the use of Etags, including Azure Cosmos DB and Azure
Storage.
Consider MapReduce or other parallel, distributed algorithms. Depending on the data and type of work to
be performed, you may be able to split the work into independent tasks that can be performed by multiple nodes
working in parallel. See Big compute architecture style.
Use leader election for coordination. In cases where you need to coordinate operations, make sure the
coordinator does not become a single point of failure in the application. Using the Leader Election pattern, one
instance is the leader at any time, and acts as the coordinator. If the leader fails, a new instance is elected to be the
leader.
Design to scale out
11/2/2020 • 2 minutes to read • Edit Online

Design your application so that it can scale horizontally


A primary advantage of the cloud is elastic scaling — the ability to use as much capacity as you need, scaling out as
load increases, and scaling in when the extra capacity is not needed. Design your application so that it can scale
horizontally, adding or removing new instances as demand requires.

Recommendations
Avoid instance stickiness . Stickiness, or session affinity, is when requests from the same client are always
routed to the same server. Stickiness limits the application's ability to scale out. For example, traffic from a high-
volume user will not be distributed across instances. Causes of stickiness include storing session state in memory,
and using machine-specific keys for encryption. Make sure that any instance can handle any request.
Identify bottlenecks . Scaling out isn't a magic fix for every performance issue. For example, if your backend
database is the bottleneck, it won't help to add more web servers. Identify and resolve the bottlenecks in the
system first, before throwing more instances at the problem. Stateful parts of the system are the most likely cause
of bottlenecks.
Decompose workloads by scalability requirements. Applications often consist of multiple workloads, with
different requirements for scaling. For example, an application might have a public-facing site and a separate
administration site. The public site may experience sudden surges in traffic, while the administration site has a
smaller, more predictable load.
Offload resource-intensive tasks. Tasks that require a lot of CPU or I/O resources should be moved to
background jobs when possible, to minimize the load on the front end that is handling user requests.
Use built-in autoscaling features . Many Azure compute services have built-in support for autoscaling. If the
application has a predictable, regular workload, scale out on a schedule. For example, scale out during business
hours. Otherwise, if the workload is not predictable, use performance metrics such as CPU or request queue length
to trigger autoscaling. For autoscaling best practices, see Autoscaling.
Consider aggressive autoscaling for critical workloads . For critical workloads, you want to keep ahead of
demand. It's better to add new instances quickly under heavy load to handle the additional traffic, and then
gradually scale back.
Design for scale in . Remember that with elastic scale, the application will have periods of scale in, when
instances get removed. The application must gracefully handle instances being removed. Here are some ways to
handle scalein:
Listen for shutdown events (when available) and shut down cleanly.
Clients/consumers of a service should support transient fault handling and retry.
For long-running tasks, consider breaking up the work, using checkpoints or the Pipes and Filters pattern.
Put work items on a queue so that another instance can pick up the work, if an instance is removed in the
middle of processing.
Partition around limits
11/2/2020 • 2 minutes to read • Edit Online

Use partitioning to work around database, network, and compute limits


In the cloud, all services have limits in their ability to scale up. Azure service limits are documented in Azure
subscription and service limits, quotas, and constraints. Limits include number of cores, database size, query
throughput, and network throughput. If your system grows sufficiently large, you may hit one or more of these
limits. Use partitioning to work around these limits.
There are many ways to partition a system, such as:
Partition a database to avoid limits on database size, data I/O, or number of concurrent sessions.
Partition a queue or message bus to avoid limits on the number of requests or the number of concurrent
connections.
Partition an App Service web app to avoid limits on the number of instances per App Service plan.
A database can be partitioned horizontally, vertically, or functionally.
In horizontal partitioning, also called sharding, each partition holds data for a subset of the total data set.
The partitions share the same data schema. For example, customers whose names start with A–M go into
one partition, N–Z into another partition.
In vertical partitioning, each partition holds a subset of the fields for the items in the data store. For
example, put frequently accessed fields in one partition, and less frequently accessed fields in another.
In functional partitioning, data is partitioned according to how it is used by each bounded context in the
system. For example, store invoice data in one partition and product inventory data in another. The schemas
are independent.
For more detailed guidance, see Data partitioning.

Recommendations
Par tition different par ts of the application . Databases are one obvious candidate for partitioning, but also
consider storage, cache, queues, and compute instances.
Design the par tition key to avoid hotspots . If you partition a database, but one shard still gets the majority of
the requests, then you haven't solved your problem. Ideally, load gets distributed evenly across all the partitions.
For example, hash by customer ID and not the first letter of the customer name, because some letters are more
frequent. The same principle applies when partitioning a message queue. Pick a partition key that leads to an even
distribution of messages across the set of queues. For more information, see Sharding.
Par tition around Azure subscription and ser vice limits . Individual components and services have limits, but
there are also limits for subscriptions and resource groups. For very large applications, you might need to partition
around those limits.
Par tition at different levels . Consider a database server deployed on a VM. The VM has a VHD that is backed by
Azure Storage. The storage account belongs to an Azure subscription. Notice that each step in the hierarchy has
limits. The database server may have a connection pool limit. VMs have CPU and network limits. Storage has IOPS
limits. The subscription has limits on the number of VM cores. Generally, it's easier to partition lower in the
hierarchy. Only large applications should need to partition at the subscription level.
Design for operations
11/2/2020 • 2 minutes to read • Edit Online

Design an application so that the operations team has the tools they
need
The cloud has dramatically changed the role of the operations team. They are no longer responsible for managing
the hardware and infrastructure that hosts the application. That said, operations is still a critical part of running a
successful cloud application. Some of the important functions of the operations team include:
Deployment
Monitoring
Escalation
Incident response
Security auditing
Robust logging and tracing are particularly important in cloud applications. Involve the operations team in design
and planning, to ensure the application gives them the data and insight they need to be successful.

Recommendations
Make all things obser vable . Once a solution is deployed and running, logs and traces are your primary insight
into the system. Tracing records a path through the system, and is useful to pinpoint bottlenecks, performance
issues, and failure points. Logging captures individual events such as application state changes, errors, and
exceptions. Log in production, or else you lose insight at the very times when you need it the most.
Instrument for monitoring . Monitoring gives insight into how well (or poorly) an application is performing, in
terms of availability, performance, and system health. For example, monitoring tells you whether you are meeting
your SLA. Monitoring happens during the normal operation of the system. It should be as close to real-time as
possible, so that the operations staff can react to issues quickly. Ideally, monitoring can help avert problems before
they lead to a critical failure. For more information, see Monitoring and diagnostics.
Instrument for root cause analysis . Root cause analysis is the process of finding the underlying cause of
failures. It occurs after a failure has already happened.
Use distributed tracing . Use a distributed tracing system that is designed for concurrency, asynchrony, and cloud
scale. Traces should include a correlation ID that flows across service boundaries. A single operation may involve
calls to multiple application services. If an operation fails, the correlation ID helps to pinpoint the cause of the
failure.
Standardize logs and metrics . The operations team will need to aggregate logs from across the various services
in your solution. If every service uses its own logging format, it becomes difficult or impossible to get useful
information from them. Define a common schema that includes fields such as correlation ID, event name, IP
address of the sender, and so forth. Individual services can derive custom schemas that inherit the base schema,
and contain additional fields.
Automate management tasks , including provisioning, deployment, and monitoring. Automating a task makes it
repeatable and less prone to human errors.
Treat configuration as code . Check configuration files into a version control system, so that you can track and
version your changes, and roll back if needed.
Use platform as a service (PaaS) options
11/2/2020 • 2 minutes to read • Edit Online

When possible, use platform as a service (PaaS) rather than


infrastructure as a service (IaaS)
IaaS is like having a box of parts. You can build anything, but you have to assemble it yourself. PaaS options are
easier to configure and administer. You don't need to provision VMs, set up VNets, manage patches and updates,
and all of the other overhead associated with running software on a VM.
For example, suppose your application needs a message queue. You could set up your own messaging service on a
VM, using something like RabbitMQ. But Azure Service Bus already provides reliable messaging as service, and it's
simpler to set up. Just create a Service Bus namespace (which can be done as part of a deployment script) and then
call Service Bus using the client SDK.
Of course, your application may have specific requirements that make an IaaS approach more suitable. However,
even if your application is based on IaaS, look for places where it may be natural to incorporate PaaS options.
These include cache, queues, and data storage.

IN ST EA D O F RUN N IN G. . . C O N SIDER USIN G. . .

Active Directory Azure Active Directory

Elasticsearch Azure Search

Hadoop HDInsight

IIS App Service

MongoDB Cosmos DB

Redis Azure Cache for Redis

SQL Server Azure SQL Database

File share Azure NetApp Files

Please note that this is not meant to be an exhaustive list, but a subset of equivalent options.
Use the best data store for the job
11/2/2020 • 2 minutes to read • Edit Online

Pick the storage technology that is the best fit for your data and how it
will be used
Gone are the days when you would just stick all of your data into a big relational SQL database. Relational
databases are very good at what they do — providing ACID guarantees for transactions over relational data. But
they come with some costs:
Queries may require expensive joins.
Data must be normalized and conform to a predefined schema (schema on write).
Lock contention may impact performance.
In any large solution, it's likely that a single data store technology won't fill all your needs. Alternatives to relational
databases include key/value stores, document databases, search engine databases, time series databases, column
family databases, and graph databases. Each has pros and cons, and different types of data fit more naturally into
one or another.
For example, you might store a product catalog in a document database, such as Cosmos DB, which allows for a
flexible schema. In that case, each product description is a self-contained document. For queries over the entire
catalog, you might index the catalog and store the index in Azure Search. Product inventory might go into a SQL
database, because that data requires ACID guarantees.
Remember that data includes more than just the persisted application data. It also includes application logs, events,
messages, and caches.

Recommendations
Don't use a relational database for ever ything . Consider other data stores when appropriate. See Choose the
right data store.
Embrace polyglot persistence . In any large solution, it's likely that a single data store technology won't fill all
your needs.
Consider the type of data . For example, put transactional data into SQL, put JSON documents into a document
database, put telemetry data into a time series data base, put application logs in Elasticsearch, and put blobs in
Azure Blob Storage.
Prefer availability over (strong) consistency . The CAP theorem implies that a distributed system must make
trade-offs between availability and consistency. (Network partitions, the other leg of the CAP theorem, can never
be completely avoided.) Often, you can achieve higher availability by adopting an eventual consistency model.
Consider the skillset of the development team . There are advantages to using polyglot persistence, but it's
possible to go overboard. Adopting a new data storage technology requires a new set of skills. The development
team must understand how to get the most out of the technology. They must understand appropriate usage
patterns, how to optimize queries, tune for performance, and so on. Factor this in when considering storage
technologies.
Use compensating transactions . A side effect of polyglot persistence is that single transaction might write data
to multiple stores. If something fails, use compensating transactions to undo any steps that already completed.
Look at bounded contexts . Bounded context is a term from domain driven design. A bounded context is an
explicit boundary around a domain model, and defines which parts of the domain the model applies to. Ideally, a
bounded context maps to a subdomain of the business domain. The bounded contexts in your system are a natural
place to consider polyglot persistence. For example, "products" may appear in both the Product Catalog
subdomain and the Product Inventory subdomain, but it's very likely that these two subdomains have different
requirements for storing, updating, and querying products.
Design for evolution
11/2/2020 • 3 minutes to read • Edit Online

An evolutionary design is key for continuous innovation


All successful applications change over time, whether to fix bugs, add new features, bring in new technologies, or
make existing systems more scalable and resilient. If all the parts of an application are tightly coupled, it becomes
very hard to introduce changes into the system. A change in one part of the application may break another part, or
cause changes to ripple through the entire codebase.
This problem is not limited to monolithic applications. An application can be decomposed into services, but still
exhibit the sort of tight coupling that leaves the system rigid and brittle. But when services are designed to evolve,
teams can innovate and continuously deliver new features.
Microservices are becoming a popular way to achieve an evolutionary design, because they address many of the
considerations listed here.

Recommendations
Enforce high cohesion and loose coupling . A service is cohesive if it provides functionality that logically
belongs together. Services are loosely coupled if you can change one service without changing the other. High
cohesion generally means that changes in one function will require changes in other related functions. If you find
that updating a service requires coordinated updates to other services, it may be a sign that your services are not
cohesive. One of the goals of domain-driven design (DDD) is to identify those boundaries.
Encapsulate domain knowledge . When a client consumes a service, the responsibility for enforcing the
business rules of the domain should not fall on the client. Instead, the service should encapsulate all of the domain
knowledge that falls under its responsibility. Otherwise, every client has to enforce the business rules, and you end
up with domain knowledge spread across different parts of the application.
Use asynchronous messaging . Asynchronous messaging is a way to decouple the message producer from the
consumer. The producer does not depend on the consumer responding to the message or taking any particular
action. With a pub/sub architecture, the producer may not even know who is consuming the message. New
services can easily consume the messages without any modifications to the producer.
Don't build domain knowledge into a gateway . Gateways can be useful in a microservices architecture, for
things like request routing, protocol translation, load balancing, or authentication. However, the gateway should be
restricted to this sort of infrastructure functionality. It should not implement any domain knowledge, to avoid
becoming a heavy dependency.
Expose open interfaces . Avoid creating custom translation layers that sit between services. Instead, a service
should expose an API with a well-defined API contract. The API should be versioned, so that you can evolve the API
while maintaining backward compatibility. That way, you can update a service without coordinating updates to all
of the upstream services that depend on it. Public facing services should expose a RESTful API over HTTP. Backend
services might use an RPC-style messaging protocol for performance reasons.
Design and test against ser vice contracts . When services expose well-defined APIs, you can develop and test
against those APIs. That way, you can develop and test an individual service without spinning up all of its
dependent services. (Of course, you would still perform integration and load testing against the real services.)
Abstract infrastructure away from domain logic . Don't let domain logic get mixed up with infrastructure-
related functionality, such as messaging or persistence. Otherwise, changes in the domain logic will require
updates to the infrastructure layers and vice versa.
Offload cross-cutting concerns to a separate ser vice . For example, if several services need to authenticate
requests, you could move this functionality into its own service. Then you could evolve the authentication service
— for example, by adding a new authentication flow — without touching any of the services that use it.
Deploy ser vices independently . When the DevOps team can deploy a single service independently of other
services in the application, updates can happen more quickly and safely. Bug fixes and new features can be rolled
out at a more regular cadence. Design both the application and the release process to support independent
updates.
Build for the needs of the business
11/2/2020 • 2 minutes to read • Edit Online

Every design decision must be justified by a business requirement


This design principle may seem obvious, but it's crucial to keep in mind when designing a solution. Do you
anticipate millions of users, or a few thousand? Is a one-hour application outage acceptable? Do you expect large
bursts in traffic or a predictable workload? Ultimately, every design decision must be justified by a business
requirement.

Recommendations
Define business objectives , including the recovery time objective (RTO), recovery point objective (RPO), and
maximum tolerable outage (MTO). These numbers should inform decisions about the architecture. For example, to
achieve a low RTO, you might implement automated failover to a secondary region. But if your solution can
tolerate a higher RTO, that degree of redundancy might be unnecessary.
Document ser vice level agreements (SL A) and ser vice level objectives (SLO) , including availability and
performance metrics. You might build a solution that delivers 99.95% availability. Is that enough? The answer is a
business decision.
Model the application around the business domain . Start by analyzing the business requirements. Use these
requirements to model the application. Consider using a domain-driven design (DDD) approach to create domain
models that reflect the business processes and use cases.
Capture both functional and nonfunctional requirements . Functional requirements let you judge whether
the application does the right thing. Nonfunctional requirements let you judge whether the application does those
things well. In particular, make sure that you understand your requirements for scalability, availability, and latency.
These requirements will influence design decisions and choice of technology.
Decompose by workload . The term "workload" in this context means a discrete capability or computing task,
which can be logically separated from other tasks. Different workloads may have different requirements for
availability, scalability, data consistency, and disaster recovery.
Plan for growth . A solution might meet your current needs, in terms of number of users, volume of transactions,
data storage, and so forth. However, a robust application can handle growth without major architectural changes.
See Design to scale out and Partition around limits. Also consider that your business model and business
requirements will likely change over time. If an application's service model and data models are too rigid, it
becomes hard to evolve the application for new use cases and scenarios. See Design for evolution.
Manage costs . In a traditional on-premises application, you pay upfront for hardware as a capital expenditure. In
a cloud application, you pay for the resources that you consume. Make sure that you understand the pricing model
for the services that you consume. The total cost will include network bandwidth usage, storage, IP addresses,
service consumption, and other factors. For more information, see Azure pricing. Also consider your operations
costs. In the cloud, you don't have to manage the hardware or other infrastructure, but you still need to manage
your applications, including DevOps, incident response, disaster recovery, and so forth.
Choose an Azure compute service for your
application
11/2/2020 • 6 minutes to read • Edit Online

Azure offers a number of ways to host your application code. The term compute refers to the hosting model for
the computing resources that your application runs on. The following flowchart will help you to choose a compute
service for your application.
If your application consists of multiple workloads, evaluate each workload separately. A complete solution may
incorporate two or more compute services.

Choose a candidate service


Use the following flowchart to select a candidate compute service.

Definitions:
"Lift and shift" is a strategy for migrating a workload to the cloud without redesigning the application or
making code changes. Also called rehosting. For more information, see Azure migration center.
Cloud optimized is a strategy for migrating to the cloud by refactoring an application to take advantage of
cloud-native features and capabilities.
The output from this flowchart is a star ting point for consideration. Next, perform a more detailed evaluation of
the service to see if it meets your needs.
This article includes several tables which may help you to make these tradeoff decisions. Based on this analysis,
you may find that the initial candidate isn't suitable for your particular application or workload. In that case,
expand your analysis to include other compute services.

Understand the basic features


If you're not familiar with the Azure service selected in the previous step, read the overview documentation to
understand the basics of the service.
App Service. A managed service for hosting web apps, mobile app back ends, RESTful APIs, or automated
business processes.
Azure Kubernetes Service (AKS). A managed Kubernetes service for running containerized applications.
Batch. A managed service for running large-scale parallel and high-performance computing (HPC) applications
Container Instances. The fastest and simplest way to run a container in Azure, without having to provision any
virtual machines and without having to adopt a higher-level service.
Functions. A managed FaaS service.
Service Fabric. A distributed systems platform that can run in many environments, including Azure or on
premises.
Virtual machines. Deploy and manage VMs inside an Azure virtual network.

Understand the hosting models


Cloud services, including Azure services, generally fall into three categories: IaaS, PaaS, or FaaS. (There is also
SaaS, software-as-a-service, which is out of scope for this article.) It's useful to understand the differences.
Infrastructure-as-a-Ser vice (IaaS) lets you provision individual VMs along with the associated networking and
storage components. Then you deploy whatever software and applications you want onto those VMs. This model
is the closest to a traditional on-premises environment, except that Microsoft manages the infrastructure. You still
manage the individual VMs.
Platform-as-a-Ser vice (PaaS) provides a managed hosting environment, where you can deploy your application
without needing to manage VMs or networking resources. Azure App Service is a PaaS service.
Functions-as-a-Ser vice (FaaS) goes even further in removing the need to worry about the hosting
environment. In a FaaS model, you simply deploy your code and the service automatically runs it. Azure Functions
are a FaaS service.
There is a spectrum from IaaS to pure PaaS. For example, Azure VMs can autoscale by using virtual machine scale
sets. This automatic scaling capability isn't strictly PaaS, but it's the type of management feature found in PaaS
services.
In general, there is a tradeoff between control and ease of management. IaaS gives the most control, flexibility, and
portability, but you have to provision, configure and manage the VMs and network components you create. FaaS
services automatically manage nearly all aspects of running an application. PaaS services fall somewhere in
between.

A Z URE
VIRT UA L APP SERVIC E A Z URE K UB ERN ET E C O N TA IN ER A Z URE
C RIT ERIA M A C H IN ES SERVIC E FA B RIC F UN C T IO N S S SERVIC E IN STA N C ES B ATC H

Application Agnostic Applications Services, Functions Containers Containers Scheduled


compositio , containers guest jobs
n executables,
containers

Density Agnostic Multiple Multiple Serverless 1 Multiple No Multiple


apps per services per containers dedicated apps per
instance via VM per node instances VM
app service
plans

Minimum 12 1 53 Serverless 1 33 No 14
number of dedicated
nodes nodes

State Stateless or Stateless Stateless or Stateless Stateless or Stateless Stateless


manageme Stateful stateful Stateful
nt

Web Agnostic Built in Agnostic Not Agnostic Agnostic No


hosting applicable

Can be Supported Supported5 Supported Supported Supported Supported Supported


deployed to 5
dedicated
VNet?

Hybrid Supported Supported Supported Supported Supported Not Supported


connectivity 6 7 supported

Notes
1. If using Consumption plan. If using App Service plan, functions run on the VMs allocated for your App Service
plan. See Choose the correct service plan for Azure Functions.
2. Higher SLA with two or more instances.
3. Recommended for production environments.
4. Can scale down to zero after job completes.
5. Requires App Service Environment (ASE).
6. Use Azure App Service Hybrid Connections.
7. Requires App Service plan or Azure Functions Premium plan.

DevOps
A Z URE
VIRT UA L APP SERVIC E A Z URE K UB ERN ET E C O N TA IN ER A Z URE
C RIT ERIA M A C H IN ES SERVIC E FA B RIC F UN C T IO N S S SERVIC E IN STA N C ES B ATC H

Local Agnostic IIS Express, Local node Visual Minikube, Local Not
debugging others 1 cluster Studio or others container supported
Azure runtime
Functions
CLI

Programmi Agnostic Web and Guest Functions Agnostic Agnostic Command


ng model API executable, with line
applications Service triggers application
, WebJobs model,
for Actor
background model,
tasks Containers

Application No built-in Deploymen Rolling Deploymen Rolling Not


update support t slots upgrade t slots update applicable
(per service)

Notes
1. Options include IIS Express for ASP.NET or node.js (iisnode); PHP web server; Azure Toolkit for IntelliJ, Azure
Toolkit for Eclipse. App Service also supports remote debugging of deployed web app.
2. See Resource Manager providers, regions, API versions and schemas.

Scalability
A Z URE
VIRT UA L APP SERVIC E A Z URE K UB ERN ET E C O N TA IN ER A Z URE
C RIT ERIA M A C H IN ES SERVIC E FA B RIC F UN C T IO N S S SERVIC E IN STA N C ES B ATC H

Autoscaling Virtual Built-in Virtual Built-in Pod auto- Not N/A


machine service machine service scaling1 , supported
scale sets scale sets cluster
auto-
scaling2

Load Azure Load Integrated Azure Load Integrated Azure Load No built-in Azure Load
balancer Balancer Balancer Balancer or support Balancer
Application
Gateway
A Z URE
VIRT UA L APP SERVIC E A Z URE K UB ERN ET E C O N TA IN ER A Z URE
C RIT ERIA M A C H IN ES SERVIC E FA B RIC F UN C T IO N S S SERVIC E IN STA N C ES B ATC H

Scale limit3 Platform 20 100 nodes 200 100 nodes 20 20 core


image: instances, per scale instances per cluster container limit
1000 nodes 100 with set per (default groups per (default
per scale App Service Function limit) subscription limit).
set, Custom Environmen app (default
image: 600 t limit).
nodes per
scale set

Notes
1. See Autoscale pods.
2. See Automatically scale a cluster to meet application demands on Azure Kubernetes Service (AKS).
3. See Azure subscription and service limits, quotas, and constraints.

Availability
A Z URE
VIRT UA L APP SERVIC E A Z URE K UB ERN ET E C O N TA IN ER A Z URE
C RIT ERIA M A C H IN ES SERVIC E FA B RIC F UN C T IO N S S SERVIC E IN STA N C ES B ATC H

SLA SLA for SLA for App SLA for SLA for SLA for AKS SLA for SLA for
Virtual Service Service Functions Container Azure Batch
Machines Fabric Instances

Multi Traffic Traffic Traffic Azure Front Traffic Not Not


region manager manager manager, Door manager supported Supported
failover Multi-
Region
Cluster

For guided learning on Service Guarantees, review Core Cloud Services - Azure architecture and service
guarantees.

Security
Review and understand the available security controls and visibility for each service
App Service
Azure Kubernetes Service
Batch
Container Instances
Functions
Service Fabric
Virtual machine - Windows
Virtual machine - LINUX

Other criteria
A Z URE
VIRT UA L APP SERVIC E A Z URE K UB ERN ET E C O N TA IN ER A Z URE
C RIT ERIA M A C H IN ES SERVIC E FA B RIC F UN C T IO N S S SERVIC E IN STA N C ES B ATC H

SSL Configured Supported Supported Supported Ingress Use sidecar Supported


in VM controller container

Cost Windows, App Service Service Azure AKS pricing Container Azure Batch
Linux pricing Fabric Functions Instances pricing
pricing pricing pricing

Suitable N-Tier, Big Web- Microservic Microservic Microservic Microservic Big


architecture compute Queue- es, Event- es, Event- es, Event- es, task compute
styles (HPC) Worker, N- driven driven driven automation, (HPC)
Tier architecture architecture architecture batch jobs

The output from this flowchart is a star ting point for consideration. Next, perform a more detailed evaluation of
the service to see if it meets your needs.

Consider limits and cost


Perform a more detailed evaluation looking at the following aspects of the service:
Service limits
Cost
SLA
Regional availability
Compute comparison tables

Next steps
Core Cloud Services - Azure compute options. This Microsoft Learn module explores how compute services
can solve common business needs.
Understand data store models
11/2/2020 • 12 minutes to read • Edit Online

Modern business systems manage increasingly large volumes of heterogeneous data. This heterogeneity means
that a single data store is usually not the best approach. Instead, it's often better to store different types of data in
different data stores, each focused toward a specific workload or usage pattern. The term polyglot persistence is
used to describe solutions that use a mix of data store technologies. Therefore, it's important to understand the
main storage models and their tradeoffs.
Selecting the right data store for your requirements is a key design decision. There are literally hundreds of
implementations to choose from among SQL and NoSQL databases. Data stores are often categorized by how they
structure data and the types of operations they support. This article describes several of the most common storage
models. Note that a particular data store technology may support multiple storage models. For example, a
relational database management systems (RDBMS) may also support key/value or graph storage. In fact, there is a
general trend for so-called multi-model support, where a single database system supports several models. But it's
still useful to understand the different models at a high level.
Not all data stores in a given category provide the same feature-set. Most data stores provide server-side
functionality to query and process data. Sometimes this functionality is built into the data storage engine. In other
cases, the data storage and processing capabilities are separated, and there may be several options for processing
and analysis. Data stores also support different programmatic and management interfaces.
Generally, you should start by considering which storage model is best suited for your requirements. Then consider
a particular data store within that category, based on factors such as feature set, cost, and ease of management.

Relational database management systems


Relational databases organize data as a series of two-dimensional tables with rows and columns. Most vendors
provide a dialect of the Structured Query Language (SQL) for retrieving and managing data. An RDBMS typically
implements a transactionally consistent mechanism that conforms to the ACID (Atomic, Consistent, Isolated,
Durable) model for updating information.
An RDBMS typically supports a schema-on-write model, where the data structure is defined ahead of time, and all
read or write operations must use the schema.
This model is very useful when strong consistency guarantees are important — where all changes are atomic, and
transactions always leave the data in a consistent state. However, an RDBMS generally can't scale out horizontally
without sharding the data in some way. Also, the data in an RDBMS must normalized, which isn't appropriate for
every data set.
Azure services
Azure SQL Database | (Security Baseline)
Azure Database for MySQL | (Security Baseline)
Azure Database for PostgreSQL | (Security Baseline)
Azure Database for MariaDB | (Security Baseline)
Workload
Records are frequently created and updated.
Multiple operations have to be completed in a single transaction.
Relationships are enforced using database constraints.
Indexes are used to optimize query performance.
Data type
Data is highly normalized.
Database schemas are required and enforced.
Many-to-many relationships between data entities in the database.
Constraints are defined in the schema and imposed on any data in the database.
Data requires high integrity. Indexes and relationships need to be maintained accurately.
Data requires strong consistency. Transactions operate in a way that ensures all data are 100% consistent for all
users and processes.
Size of individual data entries is small to medium-sized.
Examples
Inventory management
Order management
Reporting database
Accounting

Key/value stores
A key/value store associates each data value with a unique key. Most key/value stores only support simple query,
insert, and delete operations. To modify a value (either partially or completely), an application must overwrite the
existing data for the entire value. In most implementations, reading or writing a single value is an atomic operation.
An application can store arbitrary data as a set of values. Any schema information must be provided by the
application. The key/value store simply retrieves or stores the value by key.

Key/value stores are highly optimized for applications performing simple lookups, but are less suitable if you need
to query data across different key/value stores. Key/value stores are also not optimized for querying by value.
A single key/value store can be extremely scalable, as the data store can easily distribute data across multiple nodes
on separate machines.
Azure services
Azure Cosmos DB Table API, etcd API (preview), and SQL API | (Cosmos DB Security Baseline)
Azure Cache for Redis | (Security Baseline)
Azure Table Storage | (Security Baseline)
Workload
Data is accessed using a single key, like a dictionary.
No joins, lock, or unions are required.
No aggregation mechanisms are used.
Secondary indexes are generally not used.
Data type
Each key is associated with a single value.
There is no schema enforcement.
No relationships between entities.
Examples
Data caching
Session management
User preference and profile management
Product recommendation and ad serving

Document databases
A document database stores a collection of documents, where each document consists of named fields and data.
The data can be simple values or complex elements such as lists and child collections. Documents are retrieved by
unique keys.
Typically, a document contains the data for single entity, such as a customer or an order. A document may contain
information that would be spread across several relational tables in an RDBMS. Documents don't need to have the
same structure. Applications can store different data in documents as business requirements change.

Azure service
Azure Cosmos DB SQL API | (Cosmos DB Security Baseline)
Workload
Insert and update operations are common.
No object-relational impedance mismatch. Documents can better match the object structures used in application
code.
Individual documents are retrieved and written as a single block.
Data requires index on multiple fields.
Data type
Data can be managed in de-normalized way.
Size of individual document data is relatively small.
Each document type can use its own schema.
Documents can include optional fields.
Document data is semi-structured, meaning that data types of each field are not strictly defined.
Examples
Product catalog
Content management
Inventory management

Graph databases
A graph database stores two types of information, nodes and edges. Edges specify relationships between nodes.
Nodes and edges can have properties that provide information about that node or edge, similar to columns in a
table. Edges can also have a direction indicating the nature of the relationship.
Graph databases can efficiently perform queries across the network of nodes and edges and analyze the
relationships between entities. The following diagram shows an organization's personnel database structured as a
graph. The entities are employees and departments, and the edges indicate reporting relationships and the
departments in which employees work.

This structure makes it straightforward to perform queries such as "Find all employees who report directly or
indirectly to Sarah" or "Who works in the same department as John?" For large graphs with lots of entities and
relationships, you can perform very complex analyses very quickly. Many graph databases provide a query
language that you can use to traverse a network of relationships efficiently.
Azure services
Azure Cosmos DB Gremlin API | (Security Baseline)
SQL Server | (Security Baseline)
Workload
Complex relationships between data items involving many hops between related data items.
The relationship between data items are dynamic and change over time.
Relationships between objects are first-class citizens, without requiring foreign-keys and joins to traverse.
Data type
Nodes and relationships.
Nodes are similar to table rows or JSON documents.
Relationships are just as important as nodes, and are exposed directly in the query language.
Composite objects, such as a person with multiple phone numbers, tend to be broken into separate, smaller
nodes, combined with traversable relationships
Examples
Organization charts
Social graphs
Fraud detection
Recommendation engines

Data analytics
Data analytics stores provide massively parallel solutions for ingesting, storing, and analyzing data. The data is
distributed across multiple servers to maximize scalability. Large data file formats such as delimiter files (CSV),
parquet, and ORC are widely used in data analytics. Historical data is typically stored in data stores such as blob
storage or Azure Data Lake Storage Gen2, which are then accessed by Azure Synapse, Databricks, or HDInsight as
external tables. A typical scenario using data stored as parquet files for performance, is described in the article Use
external tables with Synapse SQL.
Azure services
Azure Synapse Analytics | (Security Baseline)
Azure Data Lake | (Security Baseline)
Azure Data Explorer | (Security Baseline)
Azure Analysis Services
HDInsight | (Security Baseline)
Azure Databricks | (Security Baseline)
Workload
Data analytics
Enterprise BI
Data type
Historical data from multiple sources.
Usually denormalized in a "star" or "snowflake" schema, consisting of fact and dimension tables.
Usually loaded with new data on a scheduled basis.
Dimension tables often include multiple historic versions of an entity, referred to as a slowly changing
dimension.
Examples
Enterprise data warehouse

Column-family databases
A column-family database organizes data into rows and columns. In its simplest form, a column-family database
can appear very similar to a relational database, at least conceptually. The real power of a column-family database
lies in its denormalized approach to structuring sparse data.
You can think of a column-family database as holding tabular data with rows and columns, but the columns are
divided into groups known as column families. Each column family holds a set of columns that are logically related
together and are typically retrieved or manipulated as a unit. Other data that is accessed separately can be stored in
separate column families. Within a column family, new columns can be added dynamically, and rows can be sparse
(that is, a row doesn't need to have a value for every column).
The following diagram shows an example with two column families, Identity and Contact Info . The data for a
single entity has the same row key in each column-family. This structure, where the rows for any given object in a
column family can vary dynamically, is an important benefit of the column-family approach, making this form of
data store highly suited for storing structured, volatile data.

Unlike a key/value store or a document database, most column-family databases store data in key order, rather than
by computing a hash. Many implementations allow you to create indexes over specific columns in a column-family.
Indexes let you retrieve data by columns value, rather than row key.
Read and write operations for a row are usually atomic with a single column-family, although some
implementations provide atomicity across the entire row, spanning multiple column-families.
Azure services
Azure Cosmos DB Cassandra API | (Security Baseline)
HBase in HDInsight | (Security Baseline)
Workload
Most column-family databases perform write operations extremely quickly.
Update and delete operations are rare.
Designed to provide high throughput and low-latency access.
Supports easy query access to a particular set of fields within a much larger record.
Massively scalable.
Data type
Data is stored in tables consisting of a key column and one or more column families.
Specific columns can vary by individual rows.
Individual cells are accessed via get and put commands
Multiple rows are returned using a scan command.
Examples
Recommendations
Personalization
Sensor data
Telemetry
Messaging
Social media analytics
Web analytics
Activity monitoring
Weather and other time-series data

Search Engine Databases


A search engine database allows applications to search for information held in external data stores. A search engine
database can index massive volumes of data and provide near real-time access to these indexes.
Indexes can be multi-dimensional and may support free-text searches across large volumes of text data. Indexing
can be performed using a pull model, triggered by the search engine database, or using a push model, initiated by
external application code.
Searching can be exact or fuzzy. A fuzzy search finds documents that match a set of terms and calculates how
closely they match. Some search engines also support linguistic analysis that can return matches based on
synonyms, genre expansions (for example, matching dogs to pets ), and stemming (matching words with the
same root).
Azure service
Azure Search | (Security Baseline)
Workload
Data indexes from multiple sources and services.
Queries are ad-hoc and can be complex.
Full text search is required.
Ad hoc self-service query is required.
Data type
Semi-structured or unstructured text
Text with reference to structured data
Examples
Product catalogs
Site search
Logging

Time series databases


Time series data is a set of values organized by time. Time series databases typically collect large amounts of data in
real time from a large number of sources. Updates are rare, and deletes are often done as bulk operations. Although
the records written to a time-series database are generally small, there are often a large number of records, and
total data size can grow rapidly.
Azure service
Azure Time Series Insights
Workload
Records are generally appended sequentially in time order.
An overwhelming proportion of operations (95-99%) are writes.
Updates are rare.
Deletes occur in bulk, and are made to contiguous blocks or records.
Data is read sequentially in either ascending or descending time order, often in parallel.
Data type
A timestamp is used as the primary key and sorting mechanism.
Tags may define additional information about the type, origin, and other information about the entry.
Examples
Monitoring and event telemetry.
Sensor or other IoT data.

Object storage
Object storage is optimized for storing and retrieving large binary objects (images, files, video and audio streams,
large application data objects and documents, virtual machine disk images). Large data files are also popularly used
in this model, for example, delimiter file (CSV), parquet, and ORC. Object stores can manage extremely large
amounts of unstructured data.
Azure service
Azure Blob Storage | (Security Baseline)
Azure Data Lake Storage Gen2 | (Security Baseline)
Workload
Identified by key.
Content is typically an asset such as a delimiter, image, or video file.
Content must be durable and external to any application tier.
Data type
Data size is large.
Value is opaque.
Examples
Images, videos, office documents, PDFs
Static HTML, JSON, CSS
Log and audit files
Database backups

Shared files
Sometimes, using simple flat files can be the most effective means of storing and retrieving information. Using file
shares enables files to be accessed across a network. Given appropriate security and concurrent access control
mechanisms, sharing data in this way can enable distributed services to provide highly scalable data access for
performing basic, low-level operations such as simple read and write requests.
Azure service
Azure Files | (Security Baseline)
Workload
Migration from existing apps that interact with the file system.
Requires SMB interface.
Data type
Files in a hierarchical set of folders.
Accessible with standard I/O libraries.
Examples
Legacy files
Shared content accessible among a number of VMs or app instances
Aided with this understanding of different data storage models, the next step is to evaluate your workload and
application, and decide which data store will meet your specific needs. Use the data storage decision tree to help
with this process.
Select an Azure data store for your application
11/2/2020 • 2 minutes to read • Edit Online

Azure offers a number of managed data storage solutions, each providing different features and capabilities. This
article will help you to choose a data store for your application.
If your application consists of multiple workloads, evaluate each workload separately. A complete solution may
incorporate multiple data stores.
Use the following flowchart to select a candidate data store.
START

Azure Database MySql


for MySQL Else... Yes
Require compatible Relational data?
format? SQL Database
Azure Database PostreSQL
for PostgreSQL No

MariaDB Semi-structured Yes CosmosDB SQL


Azure Database data, schema-on-
for MariaDB read? API

Cassandra No
CosmosDB
Cassandra API
Yes
Need SMB interface? Azure Files
CosmosDB MongoDB
MongoDB API No
Blob Storage
Yes
Archive? cool access tier
archive access tier

No

Yes
Search index data? Azure Search

No

Yes
Time series data?
Time Series
Insights

No

File system Yes Analyse with


File system or object Large data (TB/PB) Azure Synapse, Databricks,
Data Lake Store store? for analysis? Azure Analytics Services.

Object No

Yes Binary blobs


Blob Storage (images, PDF files,
etc.)?

No

Yes Cosmos DB
Graph data?
Graph API

No
Yes
Azure Cache for
Redis
Transient data?
No
Cosmos DB SQL
API

The output from this flowchart is a star ting point for consideration. Next, perform a more detailed evaluation of
the data store to see if it meets your needs. Refer to Criteria for choosing a data store to aid in this evaluation.
Criteria for choosing a data store
11/2/2020 • 3 minutes to read • Edit Online

This article describes the comparison criteria you should use when evaluating a data store. The goal is to help you
determine which data storage types can meet your solution's requirements.

General considerations
Keep the following considerations in mind when making your selection.
Functional requirements
Data format . What type of data are you intending to store? Common types include transactional data,
JSON objects, telemetry, search indexes, or flat files.
Data size . How large are the entities you need to store? Will these entities need to be maintained as a
single document, or can they be split across multiple documents, tables, collections, and so forth?
Scale and structure . What is the overall amount of storage capacity you need? Do you anticipate
partitioning your data?
Data relationships . Will your data need to support one-to-many or many-to-many relationships? Are
relationships themselves an important part of the data? Will you need to join or otherwise combine data
from within the same dataset, or from external datasets?
Consistency model . How important is it for updates made in one node to appear in other nodes, before
further changes can be made? Can you accept eventual consistency? Do you need ACID guarantees for
transactions?
Schema flexibility . What kind of schemas will you apply to your data? Will you use a fixed schema, a
schema-on-write approach, or a schema-on-read approach?
Concurrency . What kind of concurrency mechanism do you want to use when updating and synchronizing
data? Will the application perform many updates that could potentially conflict. If so, you may require
record locking and pessimistic concurrency control. Alternatively, can you support optimistic concurrency
controls? If so, is simple timestamp-based concurrency control enough, or do you need the added
functionality of multi-version concurrency control?
Data movement . Will your solution need to perform ETL tasks to move data to other stores or data
warehouses?
Data lifecycle . Is the data write-once, read-many? Can it be moved into cool or cold storage?
Other suppor ted features . Do you need any other specific features, such as schema validation,
aggregation, indexing, full-text search, MapReduce, or other query capabilities?
Non-functional requirements
Performance and scalability . What are your data performance requirements? Do you have specific
requirements for data ingestion rates and data processing rates? What are the acceptable response times
for querying and aggregation of data once ingested? How large will you need the data store to scale up? Is
your workload more read-heavy or write-heavy?
Reliability . What overall SLA do you need to support? What level of fault-tolerance do you need to provide
for data consumers? What kind of backup and restore capabilities do you need?
Replication . Will your data need to be distributed among multiple replicas or regions? What kind of data
replication capabilities do you require?
Limits . Will the limits of a particular data store support your requirements for scale, number of
connections, and throughput?
Management and cost
Managed ser vice . When possible, use a managed data service, unless you require specific capabilities that
can only be found in an IaaS-hosted data store.
Region availability . For managed services, is the service available in all Azure regions? Does your
solution need to be hosted in certain Azure regions?
Por tability . Will your data need to be migrated to on-premises, external datacenters, or other cloud
hosting environments?
Licensing . Do you have a preference of a proprietary versus OSS license type? Are there any other external
restrictions on what type of license you can use?
Overall cost . What is the overall cost of using the service within your solution? How many instances will
need to run, to support your uptime and throughput requirements? Consider operations costs in this
calculation. One reason to prefer managed services is the reduced operational cost.
Cost effectiveness . Can you partition your data, to store it more cost effectively? For example, can you
move large objects out of an expensive relational database into an object store?
Security
Security . What type of encryption do you require? Do you need encryption at rest? What authentication
mechanism do you want to use to connect to your data?
Auditing . What kind of audit log do you need to generate?
Networking requirements . Do you need to restrict or otherwise manage access to your data from other
network resources? Does data need to be accessible only from inside the Azure environment? Does the data
need to be accessible from specific IP addresses or subnets? Does it need to be accessible from applications
or services hosted on-premises or in other external datacenters?
DevOps
Skill set . Are there particular programming languages, operating systems, or other technology that your
team is particularly adept at using? Are there others that would be difficult for your team to work with?
Clients Is there good client support for your development languages?
Overview of load-balancing options in Azure
11/2/2020 • 4 minutes to read • Edit Online

The term load balancing refers to the distribution of workloads across multiple computing resources. Load
balancing aims to optimize resource use, maximize throughput, minimize response time, and avoid overloading
any single resource. It can also improve availability by sharing a workload across redundant computing resources.

Overview
Azure load balancing services can be categorized along two dimensions: global versus regional, and HTTP(S)
versus non-HTTP(S).
Global versus regional
Global load-balancing services distribute traffic across regional backends, clouds, or hybrid on-premises
services. These services route end-user traffic to the closest available backend. They also react to changes in
service reliability or performance, in order to maximize availability and performance. You can think of them
as systems that load balance between application stamps, endpoints, or scale-units hosted across different
regions/geographies.
Regional load-balancing services distribute traffic within virtual networks across virtual machines (VMs) or
zonal and zone-redundant service endpoints within a region. You can think of them as systems that load
balance between VMs, containers, or clusters within a region in a virtual network.
HTTP(S ) versus non-HTTP(S )
HTTP(S) load-balancing services are Layer 7 load balancers that only accept HTTP(S) traffic. They are
intended for web applications or other HTTP(S) endpoints. They include features such as SSL offload, web
application firewall, path-based load balancing, and session affinity.
Non-HTTP/S load-balancing services can handle non-HTTP(S) traffic and are recommended for non-web
workloads.
The following table summarizes the Azure load balancing services by these categories:

SERVIC E GLO B A L / REGIO N A L REC O M M EN DED T RA F F IC

Azure Front Door Global HTTP(S)

Traffic Manager Global non-HTTP(S)

Application Gateway Regional HTTP(S)

Azure Load Balancer Global non-HTTP(S)

Azure load balancing services


Here are the main load-balancing services currently available in Azure:
Front Door is an application delivery network that provides global load balancing and site acceleration service for
web applications. It offers Layer 7 capabilities for your application like SSL offload, path-based routing, fast failover,
caching, etc. to improve performance and high-availability of your applications.
NOTE
At this time, Azure Front Door does not support Web Sockets.

Traffic Manager is a DNS-based traffic load balancer that enables you to distribute traffic optimally to services
across global Azure regions, while providing high availability and responsiveness. Because Traffic Manager is a
DNS-based load-balancing service, it load balances only at the domain level. For that reason, it can't fail over as
quickly as Front Door, because of common challenges around DNS caching and systems not honoring DNS TTLs.
Application Gateway provides application delivery controller (ADC) as a service, offering various Layer 7 load-
balancing capabilities. Use it to optimize web farm productivity by offloading CPU-intensive SSL termination to the
gateway.
Azure Load Balancer is a high-performance, ultra low-latency Layer 4 load-balancing service (inbound and
outbound) for all UDP and TCP protocols. It is built to handle millions of requests per second while ensuring your
solution is highly available. Azure Load Balancer is zone-redundant, ensuring high availability across Availability
Zones.

Decision tree for load balancing in Azure


When selecting the load-balancing options, here are some factors to consider:
Traffic type . Is it a web (HTTP/HTTPS) application? Is it public facing or a private application?
Global versus. regional . Do you need to load balance VMs or containers within a virtual network, or load
balance scale unit/deployments across regions, or both?
Availability . What is the service SLA?
Cost . See Azure pricing. In addition to the cost of the service itself, consider the operations cost for managing a
solution built on that service.
Features and limits . What are the overall limitations of each service? See Service limits.
The following flowchart will help you to choose a load-balancing solution for your application. The flowchart
guides you through a set of key decision criteria to reach a recommendation.
Treat this flowchar t as a star ting point. Every application has unique requirements, so use the
recommendation as a starting point. Then perform a more detailed evaluation.
If your application consists of multiple workloads, evaluate each workload separately. A complete solution may
incorporate two or more load-balancing solutions.
Definitions
Internet facing . Applications that are publicly accessible from the internet. As a best practice, application
owners apply restrictive access policies or protect the application by setting up offerings like web
application firewall and DDoS protection.
Global . End users or clients are located beyond a small geographical area. For example, users across
multiple continents, across countries/regions within a continent, or even across multiple metropolitan areas
within a larger country/region.
PaaS . Platform as a service (PaaS) services provide a managed hosting environment, where you can deploy
your application without needing to manage VMs or networking resources. In this case, PaaS refers to
services that provide integrated load balancing within a region. See Choosing a compute service –
Scalability.
IaaS . Infrastructure as a service (IaaS) is a computing option where you provision the VMs that you need,
along with associated network and storage components. IaaS applications require internal load balancing
within a virtual network, using Azure Load Balancer.
Application-layer processing refers to special routing within a virtual network. For example, path-based
routing within the virtual network across VMs or virtual machine scale sets. For more information, see
When should we deploy an Application Gateway behind Front Door?.
Asynchronous messaging options in Azure
11/2/2020 • 17 minutes to read • Edit Online

This article describes the different types of messages and the entities that participate in a messaging infrastructure.
Based on the requirements of each message type, the article recommends Azure messaging services. The options
include Azure Service Bus, Event Grid, and Event Hubs.
At an architectural level, a message is a datagram created by an entity (producer), to distribute information so that
other entities (consumers) can be aware and act accordingly. The producer and the consumer can communicate
directly or optionally through an intermediary entity (message broker). This article focuses on asynchronous
messaging using a message broker.

Messages can be classified into two main categories. If the producer expects an action from the consumer, that
message is a command. If the message informs the consumer that an action has taken place, then the message is
an event.

Commands
The producer sends a command with the intent that the consumer(s) will perform an operation within the scope of
a business transaction.
A command is a high-value message and must be delivered at least once. If a command is lost, the entire business
transaction might fail. Also, a command shouldn't be processed more than once. Doing so might cause an
erroneous transaction. A customer might get duplicate orders or billed twice.
Commands are often used to manage the workflow of a multistep business transaction. Depending on the
business logic, the producer may expect the consumer to acknowledge the message and report the results of the
operation. Based on that result, the producer may choose an appropriate course of action.

Events
An event is a type of message that a producer raises to announce facts.
The producer (known as the publisher in this context) has no expectations that the events will result in any action.
Interested consumer(s), can subscribe, listen for events, and take actions depending on their consumption scenario.
Events can have multiple subscribers or no subscribers at all. Two different subscribers can react to an event with
different actions and not be aware of one another.
The producer and consumer are loosely coupled and managed independently. The consumer isn't expected to
acknowledge the event back to the producer. A consumer that is no longer interested in the events, can
unsubscribe. The consumer is removed from the pipeline without affecting the producer or the overall functionality
of the system.
There are two categories of events:
The producer raises events to announce discrete facts. A common use case is event notification. For
example, Azure Resource Manager raises events when it creates, modifies, or deletes resources. A subscriber
of those events could be a Logic App that sends alert emails.
The producer raises related events in a sequence, or a stream of events, over a period of time. Typically, a
stream is consumed for statistical evaluation. The evaluation can be done within a temporal window or as
events arrive. Telemetry is a common use case, for example, health and load monitoring of a system.
Another case is event streaming from IoT devices.
A common pattern for implementing event messaging is the Publisher-Subscriber pattern.

Role and benefits of a message broker


An intermediate message broker provides the functionality of moving messages from producer to consumer and
can offer additional benefits.
Decoupling
A message broker decouples the producer from the consumer in the logic that generates and uses the messages,
respectively. In a complex workflow, the broker can encourage business operations to be decoupled and help
coordinate the workflow.
For example, a single business transaction requires distinct operations that are performed in a business logic
sequence. The producer issues a command that signals a consumer to start an operation. The consumer
acknowledges the message in a separate queue reserved for lining up responses for the producer. Only after
receiving the response, the producer sends a new message to start the next operation in the sequence. A different
consumer processes that message and sends a completion message to the response queue. By using messaging,
the services coordinate the workflow of the transaction among themselves.
A message broker provides temporal decoupling. The producer and consumer don't have to run concurrently. A
producer can send a message to the message broker regardless of the availability of the consumer. Conversely, the
consumer isn't restricted by the producer's availability.
For example, the user interface of a web app generates messages and uses a queue as the message broker. When
ready, consumers can retrieve messages from the queue and perform the work. Temporal decoupling helps the
user interface to remain responsive. It's not blocked while the messages are handled asynchronously.
Certain operations can take long to complete. After issuing a command, the producer shouldn't have to wait until
the consumer completes it. A message broker helps asynchronous processing of messages.
Load balancing
Producers may post a large number of messages that are serviced by many consumers. Use a message broker to
distribute processing across servers and improve throughput. Consumers can run on different servers to spread
the load. Consumers can be added dynamically to scale out the system when needed or removed otherwise.

The Competing Consumers Pattern explains how to process multiple messages concurrently to optimize
throughput, to improve scalability and availability, and to balance the workload.
Load leveling
The volume of messages generated by the producer or a group of producers can be variable. At times there might
be a large volume causing spikes in messages. Instead of adding consumers to handle this work, a message broker
can act as a buffer, and consumers gradually drain messages at their own pace without stressing the system.

The Queue-based Load Leveling Pattern provides more information.


Reliable messaging
A message broker helps ensure that messages aren't lost even if communication fails between the producer and
consumer. The producer can post messages to the message broker and the consumer can retrieve them when
communication is reestablished. The producer isn't blocked unless it loses connectivity with the message broker.
Resilient messaging
A message broker can add resiliency to the consumers in your system. If a consumer fails while processing a
message, another instance of the consumer can process that message. The reprocessing is possible because the
message persists in the broker.

Technology choices for a message broker


Azure provides several message broker services, each with a range of features. Before choosing a service,
determine the intent and requirements of the message.
Azure Service Bus
Azure Service Bus queues are well suited for transferring commands from producers to consumers. Here are some
considerations.
Pull model
A consumer of a Service Bus queue constantly polls Service Bus to check if new messages are available. The client
SDKs and Azure Functions trigger for Service Bus abstract that model. When a new message is available, the
consumer's callback is invoked, and the message is sent to the consumer.
Guaranteed delivery
Service Bus allows a consumer to peek the queue and lock a message from other consumers.
It's the responsibility of the consumer to report the processing status of the message. Only when the consumer
marks the message as consumed, Service Bus removes the message from the queue. If a failure, timeout, or crash
occurs, Service Bus unlocks the message so that other consumers can retrieve it. This way messages aren't lost in
transfer.
A producer might accidentally send the same message twice. For instance, a producer instance fails after sending a
message. Another producer replaces the original instance and sends the message again. Azure Service Bus queues
provide a built-in de-duping capability that detects and removes duplicate messages. There's still a chance that a
message is delivered twice. For example, if a consumer fails while processing, the message is returned to the queue
and is retrieved by the same or another consumer. The message processing logic in the consumer should be
idempotent so that even if the work is repeated, the state of the system isn't changed. For more information about
idempotency, see Idempotency Patterns on Jonathon Oliver's blog.
Message ordering
If you want consumers to get the messages in the order they are sent, Service Bus queues guarantee first-in-first-
out (FIFO) ordered delivery by using sessions. A session can have one or more messages. The messages are
correlated with the SessionId property. Messages that are part of a session, never expire. A session can be locked
to a consumer to prevent its messages from being handled by a different consumer.
For more information, see Message Sessions.
Message persistence
Service bus queues support temporal decoupling. Even when a consumer isn't available or unable to process the
message, it remains in the queue.
Checkpoint long-running transactions
Business transactions can run for a long time. Each operation in the transaction can have multiple messages. Use
checkpointing to coordinate the workflow and provide resiliency in case a transaction fails.
Service Bus queues allow checkpointing through the session state capability. State information is incrementally
recorded in the queue (SetState ) for messages that belong to a session. For example, a consumer can track
progress by checking the state (GetState ) every now and then. If a consumer fails, another consumer can use state
information to determine the last known checkpoint to resume the session.
Dead-letter queue (DLQ)
A Service Bus queue has a default subqueue, called the dead-letter queue (DLQ) to hold messages that couldn't be
delivered or processed. Service Bus or the message processing logic in the consumer can add messages to the
DLQ. The DLQ keeps the messages until they are retrieved from the queue.
Here are examples when a message can end up being in the DLQ:
A poison message is a message that cannot be handled because it's malformed or contains unexpected
information. In Service Bus queues, you can detect poison messages by setting the MaxDeliver yCount
property of the queue. If number of times the same message is received exceeds that property value,
Service Bus moves the message to the DLQ.
A message might no longer be relevant if it isn't processed within a period. Service Bus queues allow the
producer to post messages with a time-to-live attribute. If this period expires before the message is
received, the message is placed in the DLQ.
Examine messages in the DLQ to determine the reason for failure.
Hybrid solution
Service Bus bridges on-premises systems and cloud solutions. On-premises systems are often difficult to reach
because of firewall restrictions. Both the producer and consumer (either can be on-premises or the cloud) can use
the Service Bus queue endpoint as the pickup and drop off location for messages.
Topics and subscriptions
Service Bus supports the Publisher-Subscriber pattern through Service Bus topics and subscriptions.
This feature provides a way for the producer to broadcast messages to multiple consumers. When a topic receives
a message, it's forwarded to all the subscribed consumers. Optionally, a subscription can have filter criteria that
allows the consumer to get a subset of messages. Each consumer retrieves messages from a subscription in a
similar way to a queue.
For more information, see Azure Service Bus topics.
Azure Event Grid
Azure Event Grid is recommended for discrete events. Event Grid follows the Publisher-Subscriber pattern. When
event sources trigger events, they are published to Event grid topics. Consumers of those events create Event Grid
subscriptions by specifying event types and event handler that will process the events. If there are no subscribers,
the events are discarded. Each event can have multiple subscriptions.
Push Model
Event Grid propagates messages to the subscribers in a push model. Suppose you have an event grid subscription
with a webhook. When a new event arrives, Event Grid posts the event to the webhook endpoint.
Integrated with Azure
Choose Event Grid if you want to get notifications about Azure resources. Many Azure services act as event sources
that have built-in Event Grid topics. Event Grid also supports various Azure services that can be configured as
event handlers. It's easy to subscribe to those topics to route events to event handlers of your choice. For example,
you can use Event Grid to invoke an Azure Function when a blob storage is created or deleted.
Custom topics
Create custom Event Grid topics, if you want to send events from your application or an Azure service that isn't
integrated with Event Grid.
For example, to see the progress of an entire business transaction, you want the participating services to raise
events as they are processing their individual business operations. A web app shows those events. One way is to
create a custom topic and add a subscription with your web app registered through an HTTP WebHook. As
business services send events to the custom topic, Event Grid pushes them to your web app.
Filtered events
You can specify filters in a subscription to instruct Event Grid to route only a subset of events to a specific event
handler. The filters are specified in the subscription schema. Any event sent to the topic with values that match the
filter are automatically forwarded to that subscription.
For example, content in various formats are uploaded to Blob Storage. Each time a file is added, an event is raised
and published to Event Grid. The event subscription might have a filter that only sends events for images so that an
event handler can generate thumbnails.
For more information about filtering, see Filter events for Event Grid.
High throughput
Event Grid can route 10,000,000 events per second per region. The first 100,000 operations per month are free.
For cost considerations, see How much does Event Grid cost?
Resilient delivery
Even though successful delivery for events isn't as crucial as commands, you might still want some guarantee
depending on the type of event. Event Grid offers features that you can enable and customize, such as retry
policies, expiration time, and dead lettering. For more information, see Delivery and retry.
Event Grid's retry process can help resiliency but it's not fail-safe. In the retry process, Event Grid might deliver the
message more than once, skip, or delay some retries if the endpoint is unresponsive for a long time. For more
information, see Retry schedule and duration.
You can persist undelivered events to a blob storage account by enabling dead-lettering. There's a delay in
delivering the message to the blob storage endpoint and if that endpoint is unresponsive, then Event Grid discards
the event. For more information, see Dead letter and retry policies.
Azure Event Hubs
When working with an event stream, Azure Event Hubs is the recommended message broker. Essentially, it's a
large buffer that's capable of receiving large volumes of data with low latency. The received data can be read
quickly through concurrent operations. You can transform the received data by using any real-time analytics
provider. Event Hubs also provides the capability to store events in a storage account.
Fast ingestion
Event Hubs is capable of ingesting millions of events per second. The events are only appended to the stream and
are ordered by time.
Pull model
Like Event Grid, Event Hubs also offers Publisher-Subscriber capabilities. A key difference between Event Grid and
Event Hubs is in the way event data is made available to the subscribers. Event Grid pushes the ingested data to the
subscribers whereas Event Hub makes the data available in a pull model. As events are received, Event Hubs
appends them to the stream. A subscriber manages its cursor and can move forward and back in the stream, select
a time offset, and replay a sequence at its pace.
Stream processors are subscribers that pull data from Event Hubs for the purposes of transformation and
statistical analysis. Use Azure Stream Analytics and Apache Spark for complex processing such as aggregation over
time windows or anomaly detection.
If you want to act on each event per partition, you can pull the data by using Event Processing Host or by using
built in connector such as Logic Apps to provide the transformation logic. Another option is to use Azure Functions.
Partitioning
A partition is a portion of the event stream. The events are divided by using a partition key. For example, several
IoT devices send device data to an event hub. The partition key is the device identifier. As events are ingested, Event
Hubs moves them to separate partitions. Within each partition, all events are ordered by time.
A consumer is an instance of code that processes the event data. Event Hubs follows a partitioned consumer
pattern. Each consumer only reads a specific partition. Having multiple partitions results in faster processing
because the stream can be read concurrently by multiple consumers.
Instances of the same consumer make up a single consumer group. Multiple consumer groups can read the same
stream with different intentions. Suppose an event stream has data from a temperature sensor. One consumer
group can read the stream to detect anomalies such as a spike in temperature. Another can read the same stream
to calculate a rolling average temperature in a temporal window.
Event Hubs supports the Publisher-Subscriber pattern by allowing multiple consumer groups. Each consumer
group is a subscriber.
For more information about Event Hub partitioning, see Partitions.
Event Hubs Capture
The Capture feature allows you to store the event stream to an Azure Blob storage or Data Lake Storage. This way
of storing events is reliable because even if the storage account isn't available, Capture keeps your data for a
period, and then writes to the storage after it's available.

Storage services can also offer additional features for analyzing events. For example, by taking advantage of
the access tiers of a blob storage account, you can store events in a hot tier for data that needs frequent access.
You might use that data for visualization. Alternately, you can store data in the archive tier and retrieve it
occasionally for auditing purposes.

Capture stores all events ingested by Event Hubs and is useful for batch processing. You can generate reports on
the data by using a MapReduce function. Captured data can also serve as the source of truth. If certain facts were
missed while aggregating the data, you can refer to the captured data.
For details about this feature, see Capture events through Azure Event Hubs in Azure Blob Storage or Azure Data
Lake Storage.
Support for Apache Kafka clients
Event Hubs provides an endpoint for Apache Kafka clients. Existing clients can update their configuration to point
to the endpoint and start sending events to Event Hubs. No code changes are required.
For more information, see Event Hubs for Apache Kafka.

Crossover scenarios
In some cases, it's advantageous to combine two messaging services.
Combining services can increase the efficiency of your messaging system. For instance, in your business
transaction, you use Azure Service Bus queues to handle messages. Queues that are mostly idle and receive
messages occasionally are inefficient because the consumer is constantly polling the queue for new messages. You
can set up an Event Grid subscription with an Azure Function as the event handler. Each time the queue receives a
message and there are no consumers listening, Event Grid sends a notification, which invokes the Azure Function
that drains the queue.

For details about connecting Service Bus to Event Grid, see Azure Service Bus to Event Grid integration overview.
The Enterprise integration on Azure using message queues and events reference architecture shows an
implementation of Service Bus to Event Grid integration.
Here's another example. Event Grid receives a set of events in which some events require a workflow while others
are for notification. The message metadata indicates the type of event. One way is to check the metadata by using
the filtering feature in the event subscription. If it requires a workflow, Event Grid sends it to Azure Service Bus
queue. The receivers of that queue can take necessary actions. The notification events are sent to Logic Apps to
send alert emails.

Related patterns
Consider these patterns when implementing asynchronous messaging:
Competing Consumers Pattern. Multiple consumers may need to compete to read messages from a queue. This
pattern explains how to process multiple messages concurrently to optimize throughput, to improve scalability
and availability, and to balance the workload.
Priority Queue Pattern. For cases where the business logic requires that some messages are processed before
others, this pattern describes how messages posted by a producer that have a higher priority can be received
and processed more quickly by a consumer than messages of a lower priority.
Queue-based Load Leveling Pattern. This pattern uses a message broker to act as a buffer between a producer
and a consumer to help to minimize the impact on availability and responsiveness of intermittent heavy loads
for both those entities.
Retry Pattern. A producer or consumer might be unable connect to a queue, but the reasons for this failure may
be temporary and quickly pass. This pattern describes how to handle this situation to add resiliency to an
application.
Scheduler Agent Supervisor Pattern. Messaging is often used as part of a workflow implementation. This
pattern demonstrates how messaging can coordinate a set of actions across a distributed set of services and
other remote resources, and enable a system to recover and retry actions that fail.
Choreography pattern. This pattern shows how services can use messaging to control the workflow of a
business transaction.
Claim-Check Pattern. This pattern shows how to split a large message into a claim check and a payload.
Kafka on Azure
11/13/2020 • 9 minutes to read • Edit Online

Apache Kafka is an open-source distributed event streaming platform with the capability to publish, subscribe,
store, and process streams of events in a distributed and highly scalable manner. Kafka is deployed on hardware,
virtual machines, containers, and on-premises as well as in the cloud.
You can take advantage of Azure cloud capacity, cost, and flexibility by implementing Kafka on Azure.
This article presents options for implementing Kafka on Azure, evaluates their pros and cons, provides a feature
comparison, and offers a decision guidance matrix to help you select between the options.

Architecture
The following diagram shows what a typical Lambda architecture looks like with different Kafka on Azure options
for the ingestion phase and an exhaustive list of services from the Azure ecosystem supporting them.

Kafka on Azure options


To evaluate the options for Kafka on Azure, place them on a continuum between Infrastructure-as-a-Service (IaaS)
and Platform-as-a-Service (PaaS).
To evaluate the options, use a PaaS-first approach. First, consider whether your design requirements are met using
the PaaS offering, and if you come across a limitation, move on to the next offering in the list.
The following diagram summarizes Kafka on Azure options using the IaaS-PaaS continuum.

The following sections present pros and cons for each Kafka on Azure option.
Confluent Cloud PaaS solution
Confluent provides a fully managed Apache Kafka on Azure. The Confluent Cloud implementation of Kafka
abstracts the user from all the deployment, implementation, and management details while providing a pure
service. For more detailed information, see Supported features for Confluent Cloud.
Here are some of the advantages and limitations of using the Confluent Cloud solution:
Pros
The managed PaaS offering ensures easy usage with support.
Offers a rich feature set through the Confluent ecosystem.
The ability to purchase through the Azure Marketplace instead of entering a billing arrangement with a third
party.
Cons
Available in a limited set of Azure regions.
Missing geo-replication.
Limited network peering options available.
Tied in with Confluent licensing agreements.
Requires Enterprise pricing for an extensive feature set at additional cost.
Kafka version support and compatibility will always lag behind the published version.
Azure Event Hubs with Kafka PaaS solution
Azure Event Hubs provides a fully managed, cloud-native service that doesn't require you to configure servers,
disks, or networks. Azure Event Hubs is compatible with Apache Kafka client applications that use producer and
consumer APIs for Apache Kafka. This means that you can use Azure Event Hubs like Apache Kafka topics and can
send and receive messages by applying minor changes to the client configuration.
Azure Event Hubs includes features such as:
The ability to publish and consume events using HTTPS, AMQP 1.0, or Kafka 1.0+.
The Event Hubs Capture feature that saves streaming data into a Blob storage account or Azure Data Lake
Service account.
Clients using Event Hub SDKs act as intelligent consumer agents to simplify checkpointing, leasing, and
managing readers.
Here are some of the advantages and limitations of using the Azure Event Hubs solution:
Pros
Fully managed cloud-native service with ease of deployment, management, and monitoring.
Opens up the ability to integrate with different Azure services like:
Azure Stream Analytics
Azure Synapse Analytics
Azure Cosmos DB
Azure Data Lake Storage
Azure Machine Learning
Azure Functions
Azure Databricks
Integrated geo-disaster recovery and geo-replication is not available in most other offerings.
Available across Azure regions.
Cons
Feature parity with Apache Kafka. Native Kafka features like Transactions, Compression, Log Compaction, Kafka
Streams, Kafka Connect (currently in Preview) aren't available in production.
Schema Registry, which is one of the most sought after offerings from Confluent, is in Public Preview.
Quotas and limits for Azure Event Hubs are restrictive.
The single tenant Dedicated Tier offering of Event Hubs with more quota and relaxed limits comes at a higher
price.
Apache Kafka on HDInsight managed cluster
Kafka on HDInsight is a managed platform that provides a simplified configuration process that is tested and
supported by Microsoft. HDInsight uses native Kafka APIs, which means that you don't need to change client
application code to use this solution.
Kafka on HDInsight includes features such as:
Azure Managed Disks as the backing store for Kafka, providing up to 16 TB of storage per Kafka broker.
Rebalancing of Kafka partitions and replicas across update domains and fault domains.
You can easily change the number of worker nodes which host the Kafka broker after cluster creation.
Virtual Machine (VM)-level monitoring with disk and NIC metrics, and JMX metrics from Kafka is possible with
Azure Monitoring.
Extended throughput options at the cost of increased complexity and management.
Here are some of the advantages and limitations of using Apache Kafka on HDInsight:
Pros
Managed cluster offering that is customizable.
No license requirements, unless going for Confluent features
Complete Kafka feature set and APIs are available.
High availability ensures the configuration of fault domains and update domains and multi-region support for
replication.
Data retention is more configurable than the managed services mentioned above.
Cons
Kafka version support and compatibility always lag behind the published version.
Upgrading a cluster requires workload migration.
Disks can't be added to an existing cluster. To add more disk space, more cluster nodes must be added.
No public endpoint available, which makes virtual network integration mandatory and requires advanced
networking configurations.
Confluent Platform on Azure Marketplace IaaS solution
Confluent Platform is available as a bring-your-own-license offering in Azure Marketplace. It includes all
components from the Apache Kafka Core along with some Apache-licensed open-source additions like client tools,
pre-built connectors, and cluster-side services such as Schema Registry and REST Proxy.
Here are some of the advantages and limitations of using Confluent Platform on Azure Marketplace:
Pros
Automated cluster provisioning, management, and elastic scaling.
Complete support for all Confluent Platform components like Kafka Connect Workers, Kafka Streams Apps,
Schema Registry, REST Proxy, and Control Center.
Cons
The deployment available on Azure Marketplace is designed only for Development and Proof of concept (POC)
environments. Production deployment must be processed via email exchanges with [email protected].
Kafka version support and compatibility always lag behind the published version.
Deployment inside virtual network requires peering to work with clients requiring advanced network
configurations.
Bring Your Own License agreement.
Kafka on Azure Kubernetes Service IaaS solution
Azure Kubernetes Service (AKS) reduces the complexity and overhead of managing Kubernetes by offloading much
of that responsibility to Azure. Apache Kafka clusters are deployed on AKS in various configurations using the best
practices of Kubernetes operators like Strimzi or Confluent Kafka Operator.
Here are some of the advantages and limitations of using Kafka on AKS:
Pros
Uses standard Kubernetes deployment mechanisms, which are cloud agnostic.
Provides the complete Kafka feature set and APIs.
Supports additional features.
Cons
High availability and disaster recovery must be planned and configured manually.
A custom configuration to achieve the desired scale and throughput with an apt number of resources requires
much work.
Upgrades could be disruptive.
Lack of documentation and support for implementation in production.
Kafka on Azure VMs IaaS solution
The Apache Kafka software can be deployed on a group of Azure VMs to represent a Kafka cluster. The
configuration of the cluster is completely up to the user, but there are some recommended approaches.
Here are some of the advantages and limitations of using Kafka on Azure Virtual Machines:
Pros
Complete autonomy and control to configure and scale the Kafka cluster.
Numerous options to extend the deployment.
Cons
Selecting the right size of VMs is difficult.
You must implement your own scaling logic. User configurations determine efficacy.
Management and monitoring overhead.
Achieving the best performance demands knowledge of the ins and outs of Azure networking and Apache
Kafka.

Considerations
The following section presents the considerations for each of the Kafka on Azure solutions.
Feature comparison
The following table shows a comparison of features for each of the Kafka on Azure solutions this article discusses:

C O N F L UEN T
P L AT F O RM
EVEN T H UB S ON
C O N F L UEN T + KAFKA H DIN SIGH T M A RK ET P L A C KAFKA ON
F EAT URES C LO UD H EA D KAFKA E AKS A Z URE VM S

PaaS offering Fully managed Managed Azure Open Source Open Source
from external PaaS offering Hortonworks Marketplace Apache Kafka Apache Kafka
vendor that supports cluster offering implementatio implementatio
Kafka protocol offering n on your n on your own
own compute compute
platform platform

Model PaaS (external PaaS Managed IaaS IaaS IaaS IaaS


vendor)

Third-Par ty Purchase from Not needed Not Bring Your Not Not
Licensing Confluent mandatory Own License mandatory mandatory
(BYOL)

Compute PaaS – Not PaaS – Not Managed VMs AKS cluster VMs
Platform applicable applicable disks and VMs

Feature Set See Features See Features Complete for Complete for Complete and Complete and
available available extendable extendable
versions and versions +
extendable Schema
Registry,
Connectors,
KSQL

Data 5 TB Max 90 days, Customizable Customizable Customizable Customizable


retention 10 TB
included per
control unit

Schema Yes Public Preview Yes Yes Yes Yes


Registr y

Kafka Yes Preview Yes Yes Yes Yes


Connectors
API

Azure Stack N/A Preview Not Available Yes Preview Yes


Options
C O N F L UEN T
P L AT F O RM
EVEN T H UB S ON
C O N F L UEN T + KAFKA H DIN SIGH T M A RK ET P L A C KAFKA ON
F EAT URES C LO UD H EA D KAFKA E AKS A Z URE VM S

Vir tual Virtual Private Public Virtual Virtual Virtual Virtual


network/Pu Cloud endpoint network network network network
blic always availability in deployment deployment deployment
deployment exposed Preview recommended available

Configurabl Yes Yes Yes Yes Yes Yes


e message
retention
beyond 7
days

Idempotenc Yes Yes Yes Yes Yes Yes


y

Transaction Yes In preview Yes Yes Yes Yes

Kafka Yes In preview Yes Yes Yes Yes


Streams

Integrated Yes Yes Can be added Yes Can be added Can be added
logging and
monitoring

Kafka Yes 1.0+ 2.1, 1.1, 1.0, 2.2.0, 2.0.1 Yes Yes
version 0.10.1, 0.9.0
suppor t

Java Client Yes Yes Yes Yes Yes Yes

.NET/Python Yes Yes Yes Yes Yes Yes


/C++ Client

Declarative Yes Can be added Yes Yes Yes Yes


stream using another
processing Azure service

Log Yes In preview Yes Yes Yes Yes


compaction

REST Proxy Yes Yes Yes Yes Yes Yes

Geo- Must be Integrated Integrated Must be Must be Must be


replication configured configured configured configured

Authenticati Secure SSL, SASL SSL SSL or SASL SSL or SASL SSL or SASL
on Sockets Layer Plain
(SSL) or
Simple
Authenticatio
n and Security
Layer (SSAL)
C O N F L UEN T
P L AT F O RM
EVEN T H UB S ON
C O N F L UEN T + KAFKA H DIN SIGH T M A RK ET P L A C KAFKA ON
F EAT URES C LO UD H EA D KAFKA E AKS A Z URE VM S

Authorizatio Access Role-based RBAC in ACL based ACL based ACL based
n Control List Access Preview
(ACL) based Control
(RBAC)

Encr yption SSL/TLS SSL, Bring SSE/TLS, SSL/TLS SSL/TLS SSL/TLS


Your Own Key Customer-
(BYOK) in managed disk
Preview encryption

Automatic N/A - PaaS N/A - PaaS Yes Configure User must User must
data Self-Balancing configure configure
balancer feature

Monitoring Yes Yes Can be added Yes Can be added Can be added
metrics

Monitoring Confluent Azure Azure Confluent Azure Can be added


tools Control Monitoring Monitoring Control Monitoring
Center and open- Center and open-
source tools source tools

Managed Yes Yes Managed Managed No – IaaS No - IaaS


ser vice clusters deployment

Non- Yes Yes No No No No


disruptive
upgrades

Data Confluent MirrorMaker Managed disk Confluent MirrorMaker MirrorMaker


replication Replicator or similar export / Replicator or similar or similar
MirrorMaker
or similar

Kafka on Azure decision guidance matrix


The decision guidance matrix uses the following key:

SY M B O L F EAT URE SUP P O RT

Not supported natively or not applicable

Support with limitations or via extensions or via custom code

Full native support


C O N F L UEN T
P L AT F O RM
EVEN T H UB S ON
C O N F L UEN T + KAFKA H DIN SIGH T M A RK ET P L A C KAFKA ON
F EAT URE C LO UD H EA D KAFKA E AKS A Z URE VM S

Ease of
configuration/
deployment

Simplicity of
maintenance

Kafka native
support

Java, .NET,
Python clients

100%
managed
solution

Stream
processing
support

Portability to
other clouds

Failure
recovery

Meets low
latency
requirements

Meets high
throughput
requirements

Ecosystem
support

Extensibility
and flexibility

Supportability

Full feature
parity

Geo-
replication
C O N F L UEN T
P L AT F O RM
EVEN T H UB S ON
C O N F L UEN T + KAFKA H DIN SIGH T M A RK ET P L A C KAFKA ON
F EAT URE C LO UD H EA D KAFKA E AKS A Z URE VM S

Data
replication

Kafka version
support

Integrated
logging and
monitoring

Confluent
schema
registry

Configurable
message
retention

Non-
disruptive
upgrades

Next steps
Azure Event Grid documentation
What is Apache Kafka in Azure HDInsight
Confluent Cloud
Confluent Platform

Related resources
Partitioning in Event Hubs and Kafka
Web API design
11/2/2020 • 28 minutes to read • Edit Online

Most modern web applications expose APIs that clients can use to interact with the application. A well-designed
web API should aim to support:
Platform independence . Any client should be able to call the API, regardless of how the API is
implemented internally. This requires using standard protocols, and having a mechanism whereby the
client and the web service can agree on the format of the data to exchange.
Ser vice evolution . The web API should be able to evolve and add functionality independently from client
applications. As the API evolves, existing client applications should continue to function without
modification. All functionality should be discoverable so that client applications can fully use it.
This guidance describes issues that you should consider when designing a web API.

Introduction to REST
In 2000, Roy Fielding proposed Representational State Transfer (REST) as an architectural approach to designing
web services. REST is an architectural style for building distributed systems based on hypermedia. REST is
independent of any underlying protocol and is not necessarily tied to HTTP. However, most common REST
implementations use HTTP as the application protocol, and this guide focuses on designing REST APIs for HTTP.
A primary advantage of REST over HTTP is that it uses open standards, and does not bind the implementation of
the API or the client applications to any specific implementation. For example, a REST web service could be written
in ASP.NET, and client applications can use any language or toolset that can generate HTTP requests and parse
HTTP responses.
Here are some of the main design principles of RESTful APIs using HTTP:
REST APIs are designed around resources, which are any kind of object, data, or service that can be
accessed by the client.
A resource has an identifier, which is a URI that uniquely identifies that resource. For example, the URI for a
particular customer order might be:

https://adventure-works.com/orders/1

Clients interact with a service by exchanging representations of resources. Many web APIs use JSON as the
exchange format. For example, a GET request to the URI listed above might return this response body:

{"orderId":1,"orderValue":99.90,"productId":1,"quantity":1}

REST APIs use a uniform interface, which helps to decouple the client and service implementations. For
REST APIs built on HTTP, the uniform interface includes using standard HTTP verbs to perform operations
on resources. The most common operations are GET, POST, PUT, PATCH, and DELETE.
REST APIs use a stateless request model. HTTP requests should be independent and may occur in any order,
so keeping transient state information between requests is not feasible. The only place where information is
stored is in the resources themselves, and each request should be an atomic operation. This constraint
enables web services to be highly scalable, because there is no need to retain any affinity between clients
and specific servers. Any server can handle any request from any client. That said, other factors can limit
scalability. For example, many web services write to a backend data store, which may be hard to scale out.
For more information about strategies to scale out a data store, see Horizontal, vertical, and functional data
partitioning.
REST APIs are driven by hypermedia links that are contained in the representation. For example, the
following shows a JSON representation of an order. It contains links to get or update the customer
associated with the order.

{
"orderID":3,
"productID":2,
"quantity":4,
"orderValue":16.60,
"links": [
{"rel":"product","href":"https://adventure-works.com/customers/3", "action":"GET" },
{"rel":"product","href":"https://adventure-works.com/customers/3", "action":"PUT" }
]
}

In 2008, Leonard Richardson proposed the following maturity model for web APIs:
Level 0: Define one URI, and all operations are POST requests to this URI.
Level 1: Create separate URIs for individual resources.
Level 2: Use HTTP methods to define operations on resources.
Level 3: Use hypermedia (HATEOAS, described below).
Level 3 corresponds to a truly RESTful API according to Fielding's definition. In practice, many published web APIs
fall somewhere around level 2.

Organize the API around resources


Focus on the business entities that the web API exposes. For example, in an e-commerce system, the primary
entities might be customers and orders. Creating an order can be achieved by sending an HTTP POST request that
contains the order information. The HTTP response indicates whether the order was placed successfully or not.
When possible, resource URIs should be based on nouns (the resource) and not verbs (the operations on the
resource).

https://adventure-works.com/orders // Good

https://adventure-works.com/create-order // Avoid

A resource doesn't have to be based on a single physical data item. For example, an order resource might be
implemented internally as several tables in a relational database, but presented to the client as a single entity.
Avoid creating APIs that simply mirror the internal structure of a database. The purpose of REST is to model
entities and the operations that an application can perform on those entities. A client should not be exposed to the
internal implementation.
Entities are often grouped together into collections (orders, customers). A collection is a separate resource from
the item within the collection, and should have its own URI. For example, the following URI might represent the
collection of orders:

https://adventure-works.com/orders

Sending an HTTP GET request to the collection URI retrieves a list of items in the collection. Each item in the
collection also has its own unique URI. An HTTP GET request to the item's URI returns the details of that item.
Adopt a consistent naming convention in URIs. In general, it helps to use plural nouns for URIs that reference
collections. It's a good practice to organize URIs for collections and items into a hierarchy. For example,
/customers is the path to the customers collection, and /customers/5 is the path to the customer with ID equal to
5. This approach helps to keep the web API intuitive. Also, many web API frameworks can route requests based on
parameterized URI paths, so you could define a route for the path /customers/{id} .
Also consider the relationships between different types of resources and how you might expose these
associations. For example, the /customers/5/orders might represent all of the orders for customer 5. You could
also go in the other direction, and represent the association from an order back to a customer with a URI such as
/orders/99/customer . However, extending this model too far can become cumbersome to implement. A better
solution is to provide navigable links to associated resources in the body of the HTTP response message. This
mechanism is described in more detail in the section Use HATEOAS to enable navigation to related resources.
In more complex systems, it can be tempting to provide URIs that enable a client to navigate through several
levels of relationships, such as /customers/1/orders/99/products . However, this level of complexity can be difficult
to maintain and is inflexible if the relationships between resources change in the future. Instead, try to keep URIs
relatively simple. Once an application has a reference to a resource, it should be possible to use this reference to
find items related to that resource. The preceding query can be replaced with the URI /customers/1/orders to find
all the orders for customer 1, and then /orders/99/products to find the products in this order.

TIP
Avoid requiring resource URIs more complex than collection/item/collection.

Another factor is that all web requests impose a load on the web server. The more requests, the bigger the load.
Therefore, try to avoid "chatty" web APIs that expose a large number of small resources. Such an API may require
a client application to send multiple requests to find all of the data that it requires. Instead, you might want to
denormalize the data and combine related information into bigger resources that can be retrieved with a single
request. However, you need to balance this approach against the overhead of fetching data that the client doesn't
need. Retrieving large objects can increase the latency of a request and incur additional bandwidth costs. For more
information about these performance antipatterns, see Chatty I/O and Extraneous Fetching.
Avoid introducing dependencies between the web API and the underlying data sources. For example, if your data
is stored in a relational database, the web API doesn't need to expose each table as a collection of resources. In
fact, that's probably a poor design. Instead, think of the web API as an abstraction of the database. If necessary,
introduce a mapping layer between the database and the web API. That way, client applications are isolated from
changes to the underlying database scheme.
Finally, it might not be possible to map every operation implemented by a web API to a specific resource. You can
handle such non-resource scenarios through HTTP requests that invoke a function and return the results as an
HTTP response message. For example, a web API that implements simple calculator operations such as add and
subtract could provide URIs that expose these operations as pseudo resources and use the query string to specify
the parameters required. For example, a GET request to the URI /add?operand1=99&operand2=1 would return a
response message with the body containing the value 100. However, only use these forms of URIs sparingly.

Define operations in terms of HTTP methods


The HTTP protocol defines a number of methods that assign semantic meaning to a request. The common HTTP
methods used by most RESTful web APIs are:
GET retrieves a representation of the resource at the specified URI. The body of the response message contains
the details of the requested resource.
POST creates a new resource at the specified URI. The body of the request message provides the details of the
new resource. Note that POST can also be used to trigger operations that don't actually create resources.
PUT either creates or replaces the resource at the specified URI. The body of the request message specifies the
resource to be created or updated.
PATCH performs a partial update of a resource. The request body specifies the set of changes to apply to the
resource.
DELETE removes the resource at the specified URI.
The effect of a specific request should depend on whether the resource is a collection or an individual item. The
following table summarizes the common conventions adopted by most RESTful implementations using the e-
commerce example. Not all of these requests might be implemented—it depends on the specific scenario.

RESO URC E P O ST GET P UT DEL ET E

/customers Create a new Retrieve all customers Bulk update of Remove all customers
customer customers

/customers/1 Error Retrieve the details Update the details of Remove customer 1
for customer 1 customer 1 if it exists

/customers/1/orders Create a new order Retrieve all orders for Bulk update of orders Remove all orders for
for customer 1 customer 1 for customer 1 customer 1

The differences between POST, PUT, and PATCH can be confusing.


A POST request creates a resource. The server assigns a URI for the new resource, and returns that URI to
the client. In the REST model, you frequently apply POST requests to collections. The new resource is added
to the collection. A POST request can also be used to submit data for processing to an existing resource,
without any new resource being created.
A PUT request creates a resource or updates an existing resource. The client specifies the URI for the
resource. The request body contains a complete representation of the resource. If a resource with this URI
already exists, it is replaced. Otherwise a new resource is created, if the server supports doing so. PUT
requests are most frequently applied to resources that are individual items, such as a specific customer,
rather than collections. A server might support updates but not creation via PUT. Whether to support
creation via PUT depends on whether the client can meaningfully assign a URI to a resource before it exists.
If not, then use POST to create resources and PUT or PATCH to update.
A PATCH request performs a partial update to an existing resource. The client specifies the URI for the
resource. The request body specifies a set of changes to apply to the resource. This can be more efficient
than using PUT, because the client only sends the changes, not the entire representation of the resource.
Technically PATCH can also create a new resource (by specifying a set of updates to a "null" resource), if the
server supports this.
PUT requests must be idempotent. If a client submits the same PUT request multiple times, the results should
always be the same (the same resource will be modified with the same values). POST and PATCH requests are not
guaranteed to be idempotent.

Conform to HTTP semantics


This section describes some typical considerations for designing an API that conforms to the HTTP specification.
However, it doesn't cover every possible detail or scenario. When in doubt, consult the HTTP specifications.
Media types
As mentioned earlier, clients and servers exchange representations of resources. For example, in a POST request,
the request body contains a representation of the resource to create. In a GET request, the response body contains
a representation of the fetched resource.
In the HTTP protocol, formats are specified through the use of media types, also called MIME types. For non-binary
data, most web APIs support JSON (media type = application/json) and possibly XML (media type =
application/xml).
The Content-Type header in a request or response specifies the format of the representation. Here is an example of
a POST request that includes JSON data:

POST https://adventure-works.com/orders HTTP/1.1


Content-Type: application/json; charset=utf-8
Content-Length: 57

{"Id":1,"Name":"Gizmo","Category":"Widgets","Price":1.99}

If the server doesn't support the media type, it should return HTTP status code 415 (Unsupported Media Type).
A client request can include an Accept header that contains a list of media types the client will accept from the
server in the response message. For example:

GET https://adventure-works.com/orders/2 HTTP/1.1


Accept: application/json

If the server cannot match any of the media type(s) listed, it should return HTTP status code 406 (Not Acceptable).
GET methods
A successful GET method typically returns HTTP status code 200 (OK). If the resource cannot be found, the method
should return 404 (Not Found).
POST methods
If a POST method creates a new resource, it returns HTTP status code 201 (Created). The URI of the new resource
is included in the Location header of the response. The response body contains a representation of the resource.
If the method does some processing but does not create a new resource, the method can return HTTP status code
200 and include the result of the operation in the response body. Alternatively, if there is no result to return, the
method can return HTTP status code 204 (No Content) with no response body.
If the client puts invalid data into the request, the server should return HTTP status code 400 (Bad Request). The
response body can contain additional information about the error or a link to a URI that provides more details.
PUT methods
If a PUT method creates a new resource, it returns HTTP status code 201 (Created), as with a POST method. If the
method updates an existing resource, it returns either 200 (OK) or 204 (No Content). In some cases, it might not
be possible to update an existing resource. In that case, consider returning HTTP status code 409 (Conflict).
Consider implementing bulk HTTP PUT operations that can batch updates to multiple resources in a collection. The
PUT request should specify the URI of the collection, and the request body should specify the details of the
resources to be modified. This approach can help to reduce chattiness and improve performance.
PATCH methods
With a PATCH request, the client sends a set of updates to an existing resource, in the form of a patch document.
The server processes the patch document to perform the update. The patch document doesn't describe the whole
resource, only a set of changes to apply. The specification for the PATCH method (RFC 5789) doesn't define a
particular format for patch documents. The format must be inferred from the media type in the request.
JSON is probably the most common data format for web APIs. There are two main JSON-based patch formats,
called JSON patch and JSON merge patch.
JSON merge patch is somewhat simpler. The patch document has the same structure as the original JSON
resource, but includes just the subset of fields that should be changed or added. In addition, a field can be deleted
by specifying null for the field value in the patch document. (That means merge patch is not suitable if the
original resource can have explicit null values.)
For example, suppose the original resource has the following JSON representation:

{
"name":"gizmo",
"category":"widgets",
"color":"blue",
"price":10
}

Here is a possible JSON merge patch for this resource:

{
"price":12,
"color":null,
"size":"small"
}

This tells the server to update price , delete color , and add size , while name and category are not modified.
For the exact details of JSON merge patch, see RFC 7396. The media type for JSON merge patch is
application/merge-patch+json .

Merge patch is not suitable if the original resource can contain explicit null values, due to the special meaning of
null in the patch document. Also, the patch document doesn't specify the order that the server should apply the
updates. That may or may not matter, depending on the data and the domain. JSON patch, defined in RFC 6902, is
more flexible. It specifies the changes as a sequence of operations to apply. Operations include add, remove,
replace, copy, and test (to validate values). The media type for JSON patch is application/json-patch+json .
Here are some typical error conditions that might be encountered when processing a PATCH request, along with
the appropriate HTTP status code.

ERRO R C O N DIT IO N H T T P STAT US C O DE

The patch document format isn't supported. 415 (Unsupported Media Type)

Malformed patch document. 400 (Bad Request)

The patch document is valid, but the changes can't be applied 409 (Conflict)
to the resource in its current state.

DELETE methods
If the delete operation is successful, the web server should respond with HTTP status code 204, indicating that the
process has been successfully handled, but that the response body contains no further information. If the resource
doesn't exist, the web server can return HTTP 404 (Not Found).
Asynchronous operations
Sometimes a POST, PUT, PATCH, or DELETE operation might require processing that takes a while to complete. If
you wait for completion before sending a response to the client, it may cause unacceptable latency. If so, consider
making the operation asynchronous. Return HTTP status code 202 (Accepted) to indicate the request was accepted
for processing but is not completed.
You should expose an endpoint that returns the status of an asynchronous request, so the client can monitor the
status by polling the status endpoint. Include the URI of the status endpoint in the Location header of the 202
response. For example:

HTTP/1.1 202 Accepted


Location: /api/status/12345

If the client sends a GET request to this endpoint, the response should contain the current status of the request.
Optionally, it could also include an estimated time to completion or a link to cancel the operation.

HTTP/1.1 200 OK
Content-Type: application/json

{
"status":"In progress",
"link": { "rel":"cancel", "method":"delete", "href":"/api/status/12345" }
}

If the asynchronous operation creates a new resource, the status endpoint should return status code 303 (See
Other) after the operation completes. In the 303 response, include a Location header that gives the URI of the new
resource:

HTTP/1.1 303 See Other


Location: /api/orders/12345

For more information, see Asynchronous Request-Reply pattern.

Filter and paginate data


Exposing a collection of resources through a single URI can lead to applications fetching large amounts of data
when only a subset of the information is required. For example, suppose a client application needs to find all
orders with a cost over a specific value. It might retrieve all orders from the /orders URI and then filter these
orders on the client side. Clearly this process is highly inefficient. It wastes network bandwidth and processing
power on the server hosting the web API.
Instead, the API can allow passing a filter in the query string of the URI, such as /orders?minCost=n. The web API
is then responsible for parsing and handling the minCost parameter in the query string and returning the filtered
results on the server side.
GET requests over collection resources can potentially return a large number of items. You should design a web
API to limit the amount of data returned by any single request. Consider supporting query strings that specify the
maximum number of items to retrieve and a starting offset into the collection. For example:

/orders?limit=25&offset=50

Also consider imposing an upper limit on the number of items returned, to help prevent Denial of Service attacks.
To assist client applications, GET requests that return paginated data should also include some form of metadata
that indicate the total number of resources available in the collection.
You can use a similar strategy to sort data as it is fetched, by providing a sort parameter that takes a field name as
the value, such as /orders?sort=ProductID. However, this approach can have a negative effect on caching, because
query string parameters form part of the resource identifier used by many cache implementations as the key to
cached data.
You can extend this approach to limit the fields returned for each item, if each item contains a large amount of
data. For example, you could use a query string parameter that accepts a comma-delimited list of fields, such as
/orders?fields=ProductID,Quantity.
Give all optional parameters in query strings meaningful defaults. For example, set the limit parameter to 10
and the offset parameter to 0 if you implement pagination, set the sort parameter to the key of the resource if
you implement ordering, and set the fields parameter to all fields in the resource if you support projections.

Support partial responses for large binary resources


A resource may contain large binary fields, such as files or images. To overcome problems caused by unreliable
and intermittent connections and to improve response times, consider enabling such resources to be retrieved in
chunks. To do this, the web API should support the Accept-Ranges header for GET requests for large resources.
This header indicates that the GET operation supports partial requests. The client application can submit GET
requests that return a subset of a resource, specified as a range of bytes.
Also, consider implementing HTTP HEAD requests for these resources. A HEAD request is similar to a GET request,
except that it only returns the HTTP headers that describe the resource, with an empty message body. A client
application can issue a HEAD request to determine whether to fetch a resource by using partial GET requests. For
example:

HEAD https://adventure-works.com/products/10?fields=productImage HTTP/1.1

Here is an example response message:

HTTP/1.1 200 OK

Accept-Ranges: bytes
Content-Type: image/jpeg
Content-Length: 4580

The Content-Length header gives the total size of the resource, and the Accept-Ranges header indicates that the
corresponding GET operation supports partial results. The client application can use this information to retrieve
the image in smaller chunks. The first request fetches the first 2500 bytes by using the Range header:

GET https://adventure-works.com/products/10?fields=productImage HTTP/1.1


Range: bytes=0-2499

The response message indicates that this is a partial response by returning HTTP status code 206. The Content-
Length header specifies the actual number of bytes returned in the message body (not the size of the resource),
and the Content-Range header indicates which part of the resource this is (bytes 0-2499 out of 4580):

HTTP/1.1 206 Partial Content

Accept-Ranges: bytes
Content-Type: image/jpeg
Content-Length: 2500
Content-Range: bytes 0-2499/4580

[...]

A subsequent request from the client application can retrieve the remainder of the resource.
Use HATEOAS to enable navigation to related resources
One of the primary motivations behind REST is that it should be possible to navigate the entire set of resources
without requiring prior knowledge of the URI scheme. Each HTTP GET request should return the information
necessary to find the resources related directly to the requested object through hyperlinks included in the
response, and it should also be provided with information that describes the operations available on each of these
resources. This principle is known as HATEOAS, or Hypertext as the Engine of Application State. The system is
effectively a finite state machine, and the response to each request contains the information necessary to move
from one state to another; no other information should be necessary.

NOTE
Currently there are no general-purpose standards that define how to model the HATEOAS principle. The examples shown in
this section illustrate one possible, proprietary solution.

For example, to handle the relationship between an order and a customer, the representation of an order could
include links that identify the available operations for the customer of the order. Here is a possible representation:

{
"orderID":3,
"productID":2,
"quantity":4,
"orderValue":16.60,
"links":[
{
"rel":"customer",
"href":"https://adventure-works.com/customers/3",
"action":"GET",
"types":["text/xml","application/json"]
},
{
"rel":"customer",
"href":"https://adventure-works.com/customers/3",
"action":"PUT",
"types":["application/x-www-form-urlencoded"]
},
{
"rel":"customer",
"href":"https://adventure-works.com/customers/3",
"action":"DELETE",
"types":[]
},
{
"rel":"self",
"href":"https://adventure-works.com/orders/3",
"action":"GET",
"types":["text/xml","application/json"]
},
{
"rel":"self",
"href":"https://adventure-works.com/orders/3",
"action":"PUT",
"types":["application/x-www-form-urlencoded"]
},
{
"rel":"self",
"href":"https://adventure-works.com/orders/3",
"action":"DELETE",
"types":[]
}]
}
In this example, the links array has a set of links. Each link represents an operation on a related entity. The data
for each link includes the relationship ("customer"), the URI ( https://adventure-works.com/customers/3 ), the HTTP
method, and the supported MIME types. This is all the information that a client application needs to be able to
invoke the operation.
The links array also includes self-referencing information about the resource itself that has been retrieved. These
have the relationship self.
The set of links that are returned may change, depending on the state of the resource. This is what is meant by
hypertext being the "engine of application state."

Versioning a RESTful web API


It is highly unlikely that a web API will remain static. As business requirements change new collections of
resources may be added, the relationships between resources might change, and the structure of the data in
resources might be amended. While updating a web API to handle new or differing requirements is a relatively
straightforward process, you must consider the effects that such changes will have on client applications
consuming the web API. The issue is that although the developer designing and implementing a web API has full
control over that API, the developer does not have the same degree of control over client applications, which may
be built by third-party organizations operating remotely. The primary imperative is to enable existing client
applications to continue functioning unchanged while allowing new client applications to take advantage of new
features and resources.
Versioning enables a web API to indicate the features and resources that it exposes, and a client application can
submit requests that are directed to a specific version of a feature or resource. The following sections describe
several different approaches, each of which has its own benefits and trade-offs.
No versioning
This is the simplest approach, and may be acceptable for some internal APIs. Significant changes could be
represented as new resources or new links. Adding content to existing resources might not present a breaking
change as client applications that are not expecting to see this content will ignore it.
For example, a request to the URI https://adventure-works.com/customers/3 should return the details of a single
customer containing id , name , and address fields expected by the client application:

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{"id":3,"name":"Contoso LLC","address":"1 Microsoft Way Redmond WA 98053"}

NOTE
For simplicity, the example responses shown in this section do not include HATEOAS links.

If the DateCreated field is added to the schema of the customer resource, then the response would look like this:

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{"id":3,"name":"Contoso LLC","dateCreated":"2014-09-04T12:11:38.0376089Z","address":"1 Microsoft Way Redmond


WA 98053"}

Existing client applications might continue functioning correctly if they are capable of ignoring unrecognized
fields, while new client applications can be designed to handle this new field. However, if more radical changes to
the schema of resources occur (such as removing or renaming fields) or the relationships between resources
change then these may constitute breaking changes that prevent existing client applications from functioning
correctly. In these situations, you should consider one of the following approaches.
URI versioning
Each time you modify the web API or change the schema of resources, you add a version number to the URI for
each resource. The previously existing URIs should continue to operate as before, returning resources that
conform to their original schema.
Extending the previous example, if the address field is restructured into subfields containing each constituent part
of the address (such as streetAddress , city , state , and zipCode ), this version of the resource could be exposed
through a URI containing a version number, such as https://adventure-works.com/v2/customers/3 :

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{"id":3,"name":"Contoso LLC","dateCreated":"2014-09-04T12:11:38.0376089Z","address":{"streetAddress":"1
Microsoft Way","city":"Redmond","state":"WA","zipCode":98053}}

This versioning mechanism is very simple but depends on the server routing the request to the appropriate
endpoint. However, it can become unwieldy as the web API matures through several iterations and the server has
to support a number of different versions. Also, from a purist's point of view, in all cases the client applications are
fetching the same data (customer 3), so the URI should not really be different depending on the version. This
scheme also complicates implementation of HATEOAS as all links will need to include the version number in their
URIs.
Query string versioning
Rather than providing multiple URIs, you can specify the version of the resource by using a parameter within the
query string appended to the HTTP request, such as https://adventure-works.com/customers/3?version=2 . The
version parameter should default to a meaningful value such as 1 if it is omitted by older client applications.
This approach has the semantic advantage that the same resource is always retrieved from the same URI, but it
depends on the code that handles the request to parse the query string and send back the appropriate HTTP
response. This approach also suffers from the same complications for implementing HATEOAS as the URI
versioning mechanism.

NOTE
Some older web browsers and web proxies will not cache responses for requests that include a query string in the URI. This
can degrade performance for web applications that use a web API and that run from within such a web browser.

Header versioning
Rather than appending the version number as a query string parameter, you could implement a custom header
that indicates the version of the resource. This approach requires that the client application adds the appropriate
header to any requests, although the code handling the client request could use a default value (version 1) if the
version header is omitted. The following examples use a custom header named Custom-Header. The value of this
header indicates the version of web API.
Version 1:

GET https://adventure-works.com/customers/3 HTTP/1.1


Custom-Header: api-version=1
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{"id":3,"name":"Contoso LLC","address":"1 Microsoft Way Redmond WA 98053"}

Version 2:

GET https://adventure-works.com/customers/3 HTTP/1.1


Custom-Header: api-version=2

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{"id":3,"name":"Contoso LLC","dateCreated":"2014-09-04T12:11:38.0376089Z","address":{"streetAddress":"1
Microsoft Way","city":"Redmond","state":"WA","zipCode":98053}}

As with the previous two approaches, implementing HATEOAS requires including the appropriate custom header
in any links.
Media type versioning
When a client application sends an HTTP GET request to a web server it should stipulate the format of the content
that it can handle by using an Accept header, as described earlier in this guidance. Frequently the purpose of the
Accept header is to allow the client application to specify whether the body of the response should be XML, JSON,
or some other common format that the client can parse. However, it is possible to define custom media types that
include information enabling the client application to indicate which version of a resource it is expecting. The
following example shows a request that specifies an Accept header with the value application/vnd.adventure-
works.v1+json. The vnd.adventure-works.v1 element indicates to the web server that it should return version 1 of
the resource, while the json element specifies that the format of the response body should be JSON:

GET https://adventure-works.com/customers/3 HTTP/1.1


Accept: application/vnd.adventure-works.v1+json

The code handling the request is responsible for processing the Accept header and honoring it as far as possible
(the client application may specify multiple formats in the Accept header, in which case the web server can choose
the most appropriate format for the response body). The web server confirms the format of the data in the
response body by using the Content-Type header:

HTTP/1.1 200 OK
Content-Type: application/vnd.adventure-works.v1+json; charset=utf-8

{"id":3,"name":"Contoso LLC","address":"1 Microsoft Way Redmond WA 98053"}

If the Accept header does not specify any known media types, the web server could generate an HTTP 406 (Not
Acceptable) response message or return a message with a default media type.
This approach is arguably the purest of the versioning mechanisms and lends itself naturally to HATEOAS, which
can include the MIME type of related data in resource links.
NOTE
When you select a versioning strategy, you should also consider the implications on performance, especially caching on the
web server. The URI versioning and Query String versioning schemes are cache-friendly inasmuch as the same URI/query
string combination refers to the same data each time.
The Header versioning and Media Type versioning mechanisms typically require additional logic to examine the values in the
custom header or the Accept header. In a large-scale environment, many clients using different versions of a web API can
result in a significant amount of duplicated data in a server-side cache. This issue can become acute if a client application
communicates with a web server through a proxy that implements caching, and that only forwards a request to the web
server if it does not currently hold a copy of the requested data in its cache.

Open API Initiative


The Open API Initiative was created by an industry consortium to standardize REST API descriptions across
vendors. As part of this initiative, the Swagger 2.0 specification was renamed the OpenAPI Specification (OAS) and
brought under the Open API Initiative.
You may want to adopt OpenAPI for your web APIs. Some points to consider:
The OpenAPI Specification comes with a set of opinionated guidelines on how a REST API should be
designed. That has advantages for interoperability, but requires more care when designing your API to
conform to the specification.
OpenAPI promotes a contract-first approach, rather than an implementation-first approach. Contract-first
means you design the API contract (the interface) first and then write code that implements the contract.
Tools like Swagger can generate client libraries or documentation from API contracts. For example, see
ASP.NET Web API help pages using Swagger.

More information
Microsoft REST API guidelines. Detailed recommendations for designing public REST APIs.
Web API checklist. A useful list of items to consider when designing and implementing a web API.
Open API Initiative. Documentation and implementation details on Open API.
Web API implementation
11/2/2020 • 46 minutes to read • Edit Online

A carefully designed RESTful web API defines the resources, relationships, and navigation schemes that are
accessible to client applications. When you implement and deploy a web API, you should consider the physical
requirements of the environment hosting the web API and the way in which the web API is constructed rather than
the logical structure of the data. This guidance focuses on best practices for implementing a web API and
publishing it to make it available to client applications. For detailed information about web API design, see Web API
design.

Processing requests
Consider the following points when you implement the code to handle requests.
GET, PUT, DELETE, HEAD, and PATCH actions should be idempotent
The code that implements these requests should not impose any side-effects. The same request repeated over the
same resource should result in the same state. For example, sending multiple DELETE requests to the same URI
should have the same effect, although the HTTP status code in the response messages may be different. The first
DELETE request might return status code 204 (No Content), while a subsequent DELETE request might return status
code 404 (Not Found).

NOTE
The article Idempotency Patterns on Jonathan Oliver's blog provides an overview of idempotency and how it relates to data
management operations.

POST actions that create new resources should not have unrelated side -effects
If a POST request is intended to create a new resource, the effects of the request should be limited to the new
resource (and possibly any directly related resources if there is some sort of linkage involved) For example, in an e-
commerce system, a POST request that creates a new order for a customer might also amend inventory levels and
generate billing information, but it should not modify information not directly related to the order or have any
other side-effects on the overall state of the system.
Avoid implementing chatty POST, PUT, and DELETE operations
Support POST, PUT and DELETE requests over resource collections. A POST request can contain the details for
multiple new resources and add them all to the same collection, a PUT request can replace the entire set of
resources in a collection, and a DELETE request can remove an entire collection.
The OData support included in ASP.NET Web API 2 provides the ability to batch requests. A client application can
package up several web API requests and send them to the server in a single HTTP request, and receive a single
HTTP response that contains the replies to each request. For more information, Introducing batch support in Web
API and Web API OData.
Follow the HTTP specification when sending a response
A web API must return messages that contain the correct HTTP status code to enable the client to determine how to
handle the result, the appropriate HTTP headers so that the client understands the nature of the result, and a
suitably formatted body to enable the client to parse the result.
For example, a POST operation should return status code 201 (Created) and the response message should include
the URI of the newly created resource in the Location header of the response message.
Support content negotiation
The body of a response message may contain data in a variety of formats. For example, an HTTP GET request could
return data in JSON, or XML format. When the client submits a request, it can include an Accept header that
specifies the data formats that it can handle. These formats are specified as media types. For example, a client that
issues a GET request that retrieves an image can specify an Accept header that lists the media types that the client
can handle, such as image/jpeg, image/gif, image/png . When the web API returns the result, it should format the
data by using one of these media types and specify the format in the Content-Type header of the response.
If the client does not specify an Accept header, then use a sensible default format for the response body. As an
example, the ASP.NET Web API framework defaults to JSON for text-based data.
Provide links to support HATEOAS -style navigation and discovery of resources
The HATEOAS approach enables a client to navigate and discover resources from an initial starting point. This is
achieved by using links containing URIs; when a client issues an HTTP GET request to obtain a resource, the
response should contain URIs that enable a client application to quickly locate any directly related resources. For
example, in a web API that supports an e-commerce solution, a customer may have placed many orders. When a
client application retrieves the details for a customer, the response should include links that enable the client
application to send HTTP GET requests that can retrieve these orders. Additionally, HATEOAS-style links should
describe the other operations (POST, PUT, DELETE, and so on) that each linked resource supports together with the
corresponding URI to perform each request. This approach is described in more detail in API design.
Currently there are no standards that govern the implementation of HATEOAS, but the following example illustrates
one possible approach. In this example, an HTTP GET request that finds the details for a customer returns a
response that includes HATEOAS links that reference the orders for that customer:

GET https://adventure-works.com/customers/2 HTTP/1.1


Accept: text/json
...

HTTP/1.1 200 OK
...
Content-Type: application/json; charset=utf-8
...
Content-Length: ...
{"CustomerID":2,"CustomerName":"Bert","Links":[
{"rel":"self",
"href":"https://adventure-works.com/customers/2",
"action":"GET",
"types":["text/xml","application/json"]},
{"rel":"self",
"href":"https://adventure-works.com/customers/2",
"action":"PUT",
"types":["application/x-www-form-urlencoded"]},
{"rel":"self",
"href":"https://adventure-works.com/customers/2",
"action":"DELETE",
"types":[]},
{"rel":"orders",
"href":"https://adventure-works.com/customers/2/orders",
"action":"GET",
"types":["text/xml","application/json"]},
{"rel":"orders",
"href":"https://adventure-works.com/customers/2/orders",
"action":"POST",
"types":["application/x-www-form-urlencoded"]}
]}

In this example, the customer data is represented by the Customer class shown in the following code snippet. The
HATEOAS links are held in the Links collection property:

public class Customer


{
public int CustomerID { get; set; }
public string CustomerName { get; set; }
public List<Link> Links { get; set; }
...
}

public class Link


{
public string Rel { get; set; }
public string Href { get; set; }
public string Action { get; set; }
public string [] Types { get; set; }
}

The HTTP GET operation retrieves the customer data from storage and constructs a Customer object, and then
populates the Links collection. The result is formatted as a JSON response message. Each link comprises the
following fields:
The relationship between the object being returned and the object described by the link. In this case self
indicates that the link is a reference back to the object itself (similar to a this pointer in many object-oriented
languages), and orders is the name of a collection containing the related order information.
The hyperlink ( Href ) for the object being described by the link in the form of a URI.
The type of HTTP request ( Action ) that can be sent to this URI.
The format of any data ( Types ) that should be provided in the HTTP request or that can be returned in the
response, depending on the type of the request.
The HATEOAS links shown in the example HTTP response indicate that a client application can perform the
following operations:
An HTTP GET request to the URI https://adventure-works.com/customers/2 to fetch the details of the customer
(again). The data can be returned as XML or JSON.
An HTTP PUT request to the URI https://adventure-works.com/customers/2 to modify the details of the customer.
The new data must be provided in the request message in x-www-form-urlencoded format.
An HTTP DELETE request to the URI https://adventure-works.com/customers/2 to delete the customer. The
request does not expect any additional information or return data in the response message body.
An HTTP GET request to the URI https://adventure-works.com/customers/2/orders to find all the orders for the
customer. The data can be returned as XML or JSON.
An HTTP POST request to the URI https://adventure-works.com/customers/2/orders to create a new order for this
customer. The data must be provided in the request message in x-www-form-urlencoded format.

Handling exceptions
Consider the following points if an operation throws an uncaught exception.
Capture exceptions and return a meaningful response to clients
The code that implements an HTTP operation should provide comprehensive exception handling rather than letting
uncaught exceptions propagate to the framework. If an exception makes it impossible to complete the operation
successfully, the exception can be passed back in the response message, but it should include a meaningful
description of the error that caused the exception. The exception should also include the appropriate HTTP status
code rather than simply returning status code 500 for every situation. For example, if a user request causes a
database update that violates a constraint (such as attempting to delete a customer that has outstanding orders),
you should return status code 409 (Conflict) and a message body indicating the reason for the conflict. If some
other condition renders the request unachievable, you can return status code 400 (Bad Request). You can find a full
list of HTTP status codes on the Status code definitions page on the W3C website.
The code example traps different conditions and returns an appropriate response.

[HttpDelete]
[Route("customers/{id:int}")]
public IHttpActionResult DeleteCustomer(int id)
{
try
{
// Find the customer to be deleted in the repository
var customerToDelete = repository.GetCustomer(id);

// If there is no such customer, return an error response


// with status code 404 (Not Found)
if (customerToDelete == null)
{
return NotFound();
}

// Remove the customer from the repository


// The DeleteCustomer method returns true if the customer
// was successfully deleted
if (repository.DeleteCustomer(id))
{
// Return a response message with status code 204 (No Content)
// To indicate that the operation was successful
return StatusCode(HttpStatusCode.NoContent);
}
else
{
// Otherwise return a 400 (Bad Request) error response
return BadRequest(Strings.CustomerNotDeleted);
}
}
catch
{
// If an uncaught exception occurs, return an error response
// with status code 500 (Internal Server Error)
return InternalServerError();
}
}

TIP
Do not include information that could be useful to an attacker attempting to penetrate your API.

Many web servers trap error conditions themselves before they reach the web API. For example, if you configure
authentication for a web site and the user fails to provide the correct authentication information, the web server
should respond with status code 401 (Unauthorized). Once a client has been authenticated, your code can perform
its own checks to verify that the client should be able access the requested resource. If this authorization fails, you
should return status code 403 (Forbidden).
Handle exceptions consistently and log information about errors
To handle exceptions in a consistent manner, consider implementing a global error handling strategy across the
entire web API. You should also incorporate error logging which captures the full details of each exception; this
error log can contain detailed information as long as it is not made accessible over the web to clients.
Distinguish between client-side errors and server-side errors
The HTTP protocol distinguishes between errors that occur due to the client application (the HTTP 4xx status codes),
and errors that are caused by a mishap on the server (the HTTP 5xx status codes). Make sure that you respect this
convention in any error response messages.

Optimizing client-side data access


In a distributed environment such as that involving a web server and client applications, one of the primary sources
of concern is the network. This can act as a considerable bottleneck, especially if a client application is frequently
sending requests or receiving data. Therefore you should aim to minimize the amount of traffic that flows across
the network. Consider the following points when you implement the code to retrieve and maintain data:
Support client-side caching
The HTTP 1.1 protocol supports caching in clients and intermediate servers through which a request is routed by
the use of the Cache-Control header. When a client application sends an HTTP GET request to the web API, the
response can include a Cache-Control header that indicates whether the data in the body of the response can be
safely cached by the client or an intermediate server through which the request has been routed, and for how long
before it should expire and be considered out-of-date. The following example shows an HTTP GET request and the
corresponding response that includes a Cache-Control header:

GET https://adventure-works.com/orders/2 HTTP/1.1

HTTP/1.1 200 OK
...
Cache-Control: max-age=600, private
Content-Type: text/json; charset=utf-8
Content-Length: ...
{"orderID":2,"productID":4,"quantity":2,"orderValue":10.00}

In this example, the Cache-Control header specifies that the data returned should be expired after 600 seconds, and
is only suitable for a single client and must not be stored in a shared cache used by other clients (it is private). The
Cache-Control header could specify public rather than private in which case the data can be stored in a shared
cache, or it could specify no-store in which case the data must not be cached by the client. The following code
example shows how to construct a Cache-Control header in a response message:
public class OrdersController : ApiController
{
...
[Route("api/orders/{id:int:min(0)}")]
[HttpGet]
public IHttpActionResult FindOrderByID(int id)
{
// Find the matching order
Order order = ...;
...
// Create a Cache-Control header for the response
var cacheControlHeader = new CacheControlHeaderValue();
cacheControlHeader.Private = true;
cacheControlHeader.MaxAge = new TimeSpan(0, 10, 0);
...

// Return a response message containing the order and the cache control header
OkResultWithCaching<Order> response = new OkResultWithCaching<Order>(order, this)
{
CacheControlHeader = cacheControlHeader
};
return response;
}
...
}

This code uses a custom IHttpActionResult class named OkResultWithCaching . This class enables the controller to
set the cache header contents:

public class OkResultWithCaching<T> : OkNegotiatedContentResult<T>


{
public OkResultWithCaching(T content, ApiController controller)
: base(content, controller) { }

public OkResultWithCaching(T content, IContentNegotiator contentNegotiator, HttpRequestMessage request,


IEnumerable<MediaTypeFormatter> formatters)
: base(content, contentNegotiator, request, formatters) { }

public CacheControlHeaderValue CacheControlHeader { get; set; }


public EntityTagHeaderValue ETag { get; set; }

public override async Task<HttpResponseMessage> ExecuteAsync(CancellationToken cancellationToken)


{
HttpResponseMessage response;
try
{
response = await base.ExecuteAsync(cancellationToken);
response.Headers.CacheControl = this.CacheControlHeader;
response.Headers.ETag = ETag;
}
catch (OperationCanceledException)
{
response = new HttpResponseMessage(HttpStatusCode.Conflict) {ReasonPhrase = "Operation was
cancelled"};
}
return response;
}
}
NOTE
The HTTP protocol also defines the no-cache directive for the Cache-Control header. Rather confusingly, this directive does
not mean "do not cache" but rather "revalidate the cached information with the server before returning it"; the data can still
be cached, but it is checked each time it is used to ensure that it is still current.

Cache management is the responsibility of the client application or intermediate server, but if properly
implemented it can save bandwidth and improve performance by removing the need to fetch data that has already
been recently retrieved.
The max-age value in the Cache-Control header is only a guide and not a guarantee that the corresponding data
won't change during the specified time. The web API should set the max-age to a suitable value depending on the
expected volatility of the data. When this period expires, the client should discard the object from the cache.

NOTE
Most modern web browsers support client-side caching by adding the appropriate cache-control headers to requests and
examining the headers of the results, as described. However, some older browsers will not cache the values returned from a
URL that includes a query string. This is not usually an issue for custom client applications which implement their own cache
management strategy based on the protocol discussed here.
Some older proxies exhibit the same behavior and might not cache requests based on URLs with query strings. This could be
an issue for custom client applications that connect to a web server through such a proxy.

Provide ETags to optimize query processing


When a client application retrieves an object, the response message can also include an ETag (Entity Tag). An ETag is
an opaque string that indicates the version of a resource; each time a resource changes the ETag is also modified.
This ETag should be cached as part of the data by the client application. The following code example shows how to
add an ETag as part of the response to an HTTP GET request. This code uses the GetHashCode method of an object
to generate a numeric value that identifies the object (you can override this method if necessary and generate your
own hash using an algorithm such as MD5) :

public class OrdersController : ApiController


{
...
public IHttpActionResult FindOrderByID(int id)
{
// Find the matching order
Order order = ...;
...

var hashedOrder = order.GetHashCode();


string hashedOrderEtag = $"\"{hashedOrder}\"";
var eTag = new EntityTagHeaderValue(hashedOrderEtag);

// Return a response message containing the order and the cache control header
OkResultWithCaching<Order> response = new OkResultWithCaching<Order>(order, this)
{
...,
ETag = eTag
};
return response;
}
...
}

The response message posted by the web API looks like this:
HTTP/1.1 200 OK
...
Cache-Control: max-age=600, private
Content-Type: text/json; charset=utf-8
ETag: "2147483648"
Content-Length: ...
{"orderID":2,"productID":4,"quantity":2,"orderValue":10.00}

TIP
For security reasons, do not allow sensitive data or data returned over an authenticated (HTTPS) connection to be cached.

A client application can issue a subsequent GET request to retrieve the same resource at any time, and if the
resource has changed (it has a different ETag) the cached version should be discarded and the new version added
to the cache. If a resource is large and requires a significant amount of bandwidth to transmit back to the client,
repeated requests to fetch the same data can become inefficient. To combat this, the HTTP protocol defines the
following process for optimizing GET requests that you should support in a web API:
The client constructs a GET request containing the ETag for the currently cached version of the resource
referenced in an If-None-Match HTTP header:

GET https://adventure-works.com/orders/2 HTTP/1.1


If-None-Match: "2147483648"

The GET operation in the web API obtains the current ETag for the requested data (order 2 in the above
example), and compares it to the value in the If-None-Match header.
If the current ETag for the requested data matches the ETag provided by the request, the resource has not
changed and the web API should return an HTTP response with an empty message body and a status code
of 304 (Not Modified).
If the current ETag for the requested data does not match the ETag provided by the request, then the data has
changed and the web API should return an HTTP response with the new data in the message body and a
status code of 200 (OK).
If the requested data no longer exists then the web API should return an HTTP response with the status code
of 404 (Not Found).
The client uses the status code to maintain the cache. If the data has not changed (status code 304) then the
object can remain cached and the client application should continue to use this version of the object. If the
data has changed (status code 200) then the cached object should be discarded and the new one inserted. If
the data is no longer available (status code 404) then the object should be removed from the cache.

NOTE
If the response header contains the Cache-Control header no-store then the object should always be removed from the
cache regardless of the HTTP status code.

The code below shows the FindOrderByID method extended to support the If-None-Match header. Notice that if the
If-None-Match header is omitted, the specified order is always retrieved:
public class OrdersController : ApiController
{
[Route("api/orders/{id:int:min(0)}")]
[HttpGet]
public IHttpActionResult FindOrderByID(int id)
{
try
{
// Find the matching order
Order order = ...;

// If there is no such order then return NotFound


if (order == null)
{
return NotFound();
}

// Generate the ETag for the order


var hashedOrder = order.GetHashCode();
string hashedOrderEtag = $"\"{hashedOrder}\"";

// Create the Cache-Control and ETag headers for the response


IHttpActionResult response;
var cacheControlHeader = new CacheControlHeaderValue();
cacheControlHeader.Public = true;
cacheControlHeader.MaxAge = new TimeSpan(0, 10, 0);
var eTag = new EntityTagHeaderValue(hashedOrderEtag);

// Retrieve the If-None-Match header from the request (if it exists)


var nonMatchEtags = Request.Headers.IfNoneMatch;

// If there is an ETag in the If-None-Match header and


// this ETag matches that of the order just retrieved,
// then create a Not Modified response message
if (nonMatchEtags.Count > 0 &&
String.CompareOrdinal(nonMatchEtags.First().Tag, hashedOrderEtag) == 0)
{
response = new EmptyResultWithCaching()
{
StatusCode = HttpStatusCode.NotModified,
CacheControlHeader = cacheControlHeader,
ETag = eTag
};
}
// Otherwise create a response message that contains the order details
else
{
response = new OkResultWithCaching<Order>(order, this)
{
CacheControlHeader = cacheControlHeader,
ETag = eTag
};
}

return response;
}
catch
{
return InternalServerError();
}
}
...
}

This example incorporates an additional custom IHttpActionResult class named EmptyResultWithCaching . This class
simply acts as a wrapper around an HttpResponseMessage object that does not contain a response body:
public class EmptyResultWithCaching : IHttpActionResult
{
public CacheControlHeaderValue CacheControlHeader { get; set; }
public EntityTagHeaderValue ETag { get; set; }
public HttpStatusCode StatusCode { get; set; }
public Uri Location { get; set; }

public async Task<HttpResponseMessage> ExecuteAsync(CancellationToken cancellationToken)


{
HttpResponseMessage response = new HttpResponseMessage(StatusCode);
response.Headers.CacheControl = this.CacheControlHeader;
response.Headers.ETag = this.ETag;
response.Headers.Location = this.Location;
return response;
}
}

TIP
In this example, the ETag for the data is generated by hashing the data retrieved from the underlying data source. If the ETag
can be computed in some other way, then the process can be optimized further and the data only needs to be fetched from
the data source if it has changed. This approach is especially useful if the data is large or accessing the data source can result
in significant latency (for example, if the data source is a remote database).

Use ETags to Support Optimistic Concurrency


To enable updates over previously cached data, the HTTP protocol supports an optimistic concurrency strategy. If,
after fetching and caching a resource, the client application subsequently sends a PUT or DELETE request to change
or remove the resource, it should include in If-Match header that references the ETag. The web API can then use this
information to determine whether the resource has already been changed by another user since it was retrieved
and send an appropriate response back to the client application as follows:
The client constructs a PUT request containing the new details for the resource and the ETag for the currently
cached version of the resource referenced in an If-Match HTTP header. The following example shows a PUT
request that updates an order:

PUT https://adventure-works.com/orders/1 HTTP/1.1


If-Match: "2282343857"
Content-Type: application/x-www-form-urlencoded
Content-Length: ...
productID=3&quantity=5&orderValue=250

The PUT operation in the web API obtains the current ETag for the requested data (order 1 in the above
example), and compares it to the value in the If-Match header.
If the current ETag for the requested data matches the ETag provided by the request, the resource has not
changed and the web API should perform the update, returning a message with HTTP status code 204 (No
Content) if it is successful. The response can include Cache-Control and ETag headers for the updated
version of the resource. The response should always include the Location header that references the URI of
the newly updated resource.
If the current ETag for the requested data does not match the ETag provided by the request, then the data has
been changed by another user since it was fetched and the web API should return an HTTP response with an
empty message body and a status code of 412 (Precondition Failed).
If the resource to be updated no longer exists then the web API should return an HTTP response with the
status code of 404 (Not Found).
The client uses the status code and response headers to maintain the cache. If the data has been updated
(status code 204) then the object can remain cached (as long as the Cache-Control header does not specify
no-store) but the ETag should be updated. If the data was changed by another user changed (status code
412) or not found (status code 404) then the cached object should be discarded.
The next code example shows an implementation of the PUT operation for the Orders controller:

public class OrdersController : ApiController


{
[HttpPut]
[Route("api/orders/{id:int}")]
public IHttpActionResult UpdateExistingOrder(int id, DTOOrder order)
{
try
{
var baseUri = Constants.GetUriFromConfig();
var orderToUpdate = this.ordersRepository.GetOrder(id);
if (orderToUpdate == null)
{
return NotFound();
}

var hashedOrder = orderToUpdate.GetHashCode();


string hashedOrderEtag = $"\"{hashedOrder}\"";

// Retrieve the If-Match header from the request (if it exists)


var matchEtags = Request.Headers.IfMatch;

// If there is an ETag in the If-Match header and


// this ETag matches that of the order just retrieved,
// or if there is no ETag, then update the Order
if (((matchEtags.Count > 0 &&
String.CompareOrdinal(matchEtags.First().Tag, hashedOrderEtag) == 0)) ||
matchEtags.Count == 0)
{
// Modify the order
orderToUpdate.OrderValue = order.OrderValue;
orderToUpdate.ProductID = order.ProductID;
orderToUpdate.Quantity = order.Quantity;

// Save the order back to the data store


// ...

// Create the No Content response with Cache-Control, ETag, and Location headers
var cacheControlHeader = new CacheControlHeaderValue();
cacheControlHeader.Private = true;
cacheControlHeader.MaxAge = new TimeSpan(0, 10, 0);

hashedOrder = order.GetHashCode();
hashedOrderEtag = $"\"{hashedOrder}\"";
var eTag = new EntityTagHeaderValue(hashedOrderEtag);

var location = new Uri($"{baseUri}/{Constants.ORDERS}/{id}");


var response = new EmptyResultWithCaching()
{
StatusCode = HttpStatusCode.NoContent,
CacheControlHeader = cacheControlHeader,
ETag = eTag,
Location = location
};

return response;
}

// Otherwise return a Precondition Failed response


return StatusCode(HttpStatusCode.PreconditionFailed);
}
}
catch
{
return InternalServerError();
}
}
...
}

TIP
Use of the If-Match header is entirely optional, and if it is omitted the web API will always attempt to update the specified
order, possibly blindly overwriting an update made by another user. To avoid problems due to lost updates, always provide an
If-Match header.

Handling large requests and responses


There may be occasions when a client application needs to issue requests that send or receive data that may be
several megabytes (or bigger) in size. Waiting while this amount of data is transmitted could cause the client
application to become unresponsive. Consider the following points when you need to handle requests that include
significant amounts of data:
Optimize requests and responses that involve large objects
Some resources may be large objects or include large fields, such as graphics images or other types of binary data.
A web API should support streaming to enable optimized uploading and downloading of these resources.
The HTTP protocol provides the chunked transfer encoding mechanism to stream large data objects back to a client.
When the client sends an HTTP GET request for a large object, the web API can send the reply back in piecemeal
chunks over an HTTP connection. The length of the data in the reply may not be known initially (it might be
generated), so the server hosting the web API should send a response message with each chunk that specifies the
Transfer-Encoding: Chunked header rather than a Content-Length header. The client application can receive each
chunk in turn to build up the complete response. The data transfer completes when the server sends back a final
chunk with zero size.
A single request could conceivably result in a massive object that consumes considerable resources. If during the
streaming process the web API determines that the amount of data in a request has exceeded some acceptable
bounds, it can abort the operation and return a response message with status code 413 (Request Entity Too Large).
You can minimize the size of large objects transmitted over the network by using HTTP compression. This approach
helps to reduce the amount of network traffic and the associated network latency, but at the cost of requiring
additional processing at the client and the server hosting the web API. For example, a client application that expects
to receive compressed data can include an Accept-Encoding: gzip request header (other data compression
algorithms can also be specified). If the server supports compression it should respond with the content held in
gzip format in the message body and the Content-Encoding: gzip response header.
You can combine encoded compression with streaming; compress the data first before streaming it, and specify the
gzip content encoding and chunked transfer encoding in the message headers. Also note that some web servers
(such as Internet Information Server) can be configured to automatically compress HTTP responses regardless of
whether the web API compresses the data or not.
Implement partial responses for clients that do not support asynchronous operations
As an alternative to asynchronous streaming, a client application can explicitly request data for large objects in
chunks, known as partial responses. The client application sends an HTTP HEAD request to obtain information
about the object. If the web API supports partial responses if should respond to the HEAD request with a response
message that contains an Accept-Ranges header and a Content-Length header that indicates the total size of the
object, but the body of the message should be empty. The client application can use this information to construct a
series of GET requests that specify a range of bytes to receive. The web API should return a response message with
HTTP status 206 (Partial Content), a Content-Length header that specifies the actual amount of data included in the
body of the response message, and a Content-Range header that indicates which part (such as bytes 4000 to 8000)
of the object this data represents.
HTTP HEAD requests and partial responses are described in more detail in API design.
Avoid sending unnecessary 100-Continue status messages in client applications
A client application that is about to send a large amount of data to a server may determine first whether the server
is actually willing to accept the request. Prior to sending the data, the client application can submit an HTTP request
with an Expect: 100-Continue header, a Content-Length header that indicates the size of the data, but an empty
message body. If the server is willing to handle the request, it should respond with a message that specifies the
HTTP status 100 (Continue). The client application can then proceed and send the complete request including the
data in the message body.
If you are hosting a service by using IIS, the HTTP.sys driver automatically detects and handles Expect: 100-Continue
headers before passing requests to your web application. This means that you are unlikely to see these headers in
your application code, and you can assume that IIS has already filtered any messages that it deems to be unfit or
too large.
If you are building client applications by using the .NET Framework, then all POST and PUT messages will first send
messages with Expect: 100-Continue headers by default. As with the server-side, the process is handled
transparently by the .NET Framework. However, this process results in each POST and PUT request causing two
round-trips to the server, even for small requests. If your application is not sending requests with large amounts of
data, you can disable this feature by using the ServicePointManager class to create ServicePoint objects in the
client application. A ServicePoint object handles the connections that the client makes to a server based on the
scheme and host fragments of URIs that identify resources on the server. You can then set the Expect100Continue
property of the ServicePoint object to false. All subsequent POST and PUT requests made by the client through a
URI that matches the scheme and host fragments of the ServicePoint object will be sent without Expect: 100-
Continue headers. The following code shows how to configure a ServicePoint object that configures all requests
sent to URIs with a scheme of http and a host of www.contoso.com .

Uri uri = new Uri("https://www.contoso.com/");


ServicePoint sp = ServicePointManager.FindServicePoint(uri);
sp.Expect100Continue = false;

You can also set the static Expect100Continue property of the ServicePointManager class to specify the default value
of this property for all subsequently created ServicePoint objects.
Support pagination for requests that may return large numbers of objects
If a collection contains a large number of resources, issuing a GET request to the corresponding URI could result in
significant processing on the server hosting the web API affecting performance, and generate a significant amount
of network traffic resulting in increased latency.
To handle these cases, the web API should support query strings that enable the client application to refine requests
or fetch data in more manageable, discrete blocks (or pages). The code below shows the GetAllOrders method in
the Orders controller. This method retrieves the details of orders. If this method was unconstrained, it could
conceivably return a large amount of data. The limit and offset parameters are intended to reduce the volume
of data to a smaller subset, in this case only the first 10 orders by default:
public class OrdersController : ApiController
{
...
[Route("api/orders")]
[HttpGet]
public IEnumerable<Order> GetAllOrders(int limit=10, int offset=0)
{
// Find the number of orders specified by the limit parameter
// starting with the order specified by the offset parameter
var orders = ...
return orders;
}
...
}

A client application can issue a request to retrieve 30 orders starting at offset 50 by using the URI
https://www.adventure-works.com/api/orders?limit=30&offset=50 .

TIP
Avoid enabling client applications to specify query strings that result in a URI that is more than 2000 characters long. Many
web clients and servers cannot handle URIs that are this long.

Maintaining responsiveness, scalability, and availability


The same web API might be used by many client applications running anywhere in the world. It is important to
ensure that the web API is implemented to maintain responsiveness under a heavy load, to be scalable to support a
highly varying workload, and to guarantee availability for clients that perform business-critical operations.
Consider the following points when determining how to meet these requirements:
Provide asynchronous support for long-running requests
A request that might take a long time to process should be performed without blocking the client that submitted
the request. The web API can perform some initial checking to validate the request, initiate a separate task to
perform the work, and then return a response message with HTTP code 202 (Accepted). The task could run
asynchronously as part of the web API processing, or it could be offloaded to a background task.
The web API should also provide a mechanism to return the results of the processing to the client application. You
can achieve this by providing a polling mechanism for client applications to periodically query whether the
processing has finished and obtain the result, or enabling the web API to send a notification when the operation
has completed.
You can implement a simple polling mechanism by providing a polling URI that acts as a virtual resource using the
following approach:
1. The client application sends the initial request to the web API.
2. The web API stores information about the request in a table held in table storage or Microsoft Azure Cache, and
generates a unique key for this entry, possibly in the form of a GUID.
3. The web API initiates the processing as a separate task. The web API records the state of the task in the table as
Running.
4. The web API returns a response message with HTTP status code 202 (Accepted), and the GUID of the table entry
in the body of the message.
5. When the task has completed, the web API stores the results in the table, and sets the state of the task to
Complete. Note that if the task fails, the web API could also store information about the failure and set the status
to Failed.
6. While the task is running, the client can continue performing its own processing. It can periodically send a
request to the URI /polling/{guid} where {guid} is the GUID returned in the 202 response message by the web
API.
7. The web API at the /polling/{guid} URI queries the state of the corresponding task in the table and returns a
response message with HTTP status code 200 (OK) containing this state (Running, Complete, or Failed). If the
task has completed or failed, the response message can also include the results of the processing or any
information available about the reason for the failure.
Options for implementing notifications include:
Using a notification hub to push asynchronous responses to client applications. For more information, see Send
notifications to specific users by using Azure Notification Hubs.
Using the Comet model to retain a persistent network connection between the client and the server hosting the
web API, and using this connection to push messages from the server back to the client. The MSDN magazine
article Building a Simple Comet Application in the Microsoft .NET Framework describes an example solution.
Using SignalR to push data in real time from the web server to the client over a persistent network connection.
SignalR is available for ASP.NET web applications as a NuGet package. You can find more information on the
ASP.NET SignalR website.
Ensure that each request is stateless
Each request should be considered atomic. There should be no dependencies between one request made by a client
application and any subsequent requests submitted by the same client. This approach assists in scalability;
instances of the web service can be deployed on a number of servers. Client requests can be directed at any of
these instances and the results should always be the same. It also improves availability for a similar reason; if a web
server fails requests can be routed to another instance (by using Azure Traffic Manager) while the server is
restarted with no ill effects on client applications.
Track clients and implement throttling to reduce the chances of DOS attacks
If a specific client makes a large number of requests within a given period of time it might monopolize the service
and affect the performance of other clients. To mitigate this issue, a web API can monitor calls from client
applications either by tracking the IP address of all incoming requests or by logging each authenticated access. You
can use this information to limit resource access. If a client exceeds a defined limit, the web API can return a
response message with status 503 (Service Unavailable) and include a Retry-After header that specifies when the
client can send the next request without it being declined. This strategy can help to reduce the chances of a Denial
Of Service (DOS) attack from a set of clients stalling the system.
Manage persistent HTTP connections carefully
The HTTP protocol supports persistent HTTP connections where they are available. The HTTP 1.0 specification
added the Connection:Keep-Alive header that enables a client application to indicate to the server that it can use the
same connection to send subsequent requests rather than opening new ones. The connection closes automatically
if the client does not reuse the connection within a period defined by the host. This behavior is the default in HTTP
1.1 as used by Azure services, so there is no need to include Keep-Alive headers in messages.
Keeping a connection open can help to improve responsiveness by reducing latency and network congestion, but it
can be detrimental to scalability by keeping unnecessary connections open for longer than required, limiting the
ability of other concurrent clients to connect. It can also affect battery life if the client application is running on a
mobile device; if the application only makes occasional requests to the server, maintaining an open connection can
cause the battery to drain more quickly. To ensure that a connection is not made persistent with HTTP 1.1, the client
can include a Connection:Close header with messages to override the default behavior. Similarly, if a server is
handling a very large number of clients it can include a Connection:Close header in response messages which
should close the connection and save server resources.
NOTE
Persistent HTTP connections are a purely optional feature to reduce the network overhead associated with repeatedly
establishing a communications channel. Neither the web API nor the client application should depend on a persistent HTTP
connection being available. Do not use persistent HTTP connections to implement Comet-style notification systems; instead
you should use sockets (or web sockets if available) at the TCP layer. Finally, note Keep-Alive headers are of limited use if a
client application communicates with a server via a proxy; only the connection with the client and the proxy will be persistent.

Publishing and managing a web API


To make a web API available for client applications, the web API must be deployed to a host environment. This
environment is typically a web server, although it may be some other type of host process. You should consider the
following points when publishing a web API:
All requests must be authenticated and authorized, and the appropriate level of access control must be enforced.
A commercial web API might be subject to various quality guarantees concerning response times. It is important
to ensure that host environment is scalable if the load can vary significantly over time.
It may be necessary to meter requests for monetization purposes.
It might be necessary to regulate the flow of traffic to the web API, and implement throttling for specific clients
that have exhausted their quotas.
Regulatory requirements might mandate logging and auditing of all requests and responses.
To ensure availability, it may be necessary to monitor the health of the server hosting the web API and restart it
if necessary.
It is useful to be able to decouple these issues from the technical issues concerning the implementation of the web
API. For this reason, consider creating a façade, running as a separate process and that routes requests to the web
API. The façade can provide the management operations and forward validated requests to the web API. Using a
façade can also bring many functional advantages, including:
Acting as an integration point for multiple web APIs.
Transforming messages and translating communications protocols for clients built by using varying
technologies.
Caching requests and responses to reduce load on the server hosting the web API.

Testing a web API


A web API should be tested as thoroughly as any other piece of software. You should consider creating unit tests to
validate the functionality.
The nature of a web API brings its own additional requirements to verify that it operates correctly. You should pay
particular attention to the following aspects:
Test all routes to verify that they invoke the correct operations. Be especially aware of HTTP status code 405
(Method Not Allowed) being returned unexpectedly as this can indicate a mismatch between a route and the
HTTP methods (GET, POST, PUT, DELETE) that can be dispatched to that route.
Send HTTP requests to routes that do not support them, such as submitting a POST request to a specific
resource (POST requests should only be sent to resource collections). In these cases, the only valid response
should be status code 405 (Not Allowed).
Verify that all routes are protected properly and are subject to the appropriate authentication and
authorization checks.
NOTE
Some aspects of security such as user authentication are most likely to be the responsibility of the host environment
rather than the web API, but it is still necessary to include security tests as part of the deployment process.

Test the exception handling performed by each operation and verify that an appropriate and meaningful
HTTP response is passed back to the client application.
Verify that request and response messages are well-formed. For example, if an HTTP POST request contains
the data for a new resource in x-www-form-urlencoded format, confirm that the corresponding operation
correctly parses the data, creates the resources, and returns a response containing the details of the new
resource, including the correct Location header.
Verify all links and URIs in response messages. For example, an HTTP POST message should return the URI
of the newly created resource. All HATEOAS links should be valid.
Ensure that each operation returns the correct status codes for different combinations of input. For example:
If a query is successful, it should return status code 200 (OK)
If a resource is not found, the operation should return HTTP status code 404 (Not Found).
If the client sends a request that successfully deletes a resource, the status code should be 204 (No
Content).
If the client sends a request that creates a new resource, the status code should be 201 (Created).
Watch out for unexpected response status codes in the 5xx range. These messages are usually reported by the host
server to indicate that it was unable to fulfill a valid request.
Test the different request header combinations that a client application can specify and ensure that the web
API returns the expected information in response messages.
Test query strings. If an operation can take optional parameters (such as pagination requests), test the
different combinations and order of parameters.
Verify that asynchronous operations complete successfully. If the web API supports streaming for requests
that return large binary objects (such as video or audio), ensure that client requests are not blocked while
the data is streamed. If the web API implements polling for long-running data modification operations, verify
that the operations report their status correctly as they proceed.
You should also create and run performance tests to check that the web API operates satisfactorily under duress.
You can build a web performance and load test project by using Visual Studio Ultimate. For more information, see
Run performance tests on an application before a release.

Using Azure API Management


On Azure, consider using Azure API Management to publish and manage a web API. Using this facility, you can
generate a service that acts as a façade for one or more web APIs. The service is itself a scalable web service that
you can create and configure by using the Azure portal. You can use this service to publish and manage a web API
as follows:
1. Deploy the web API to a website, Azure cloud service, or Azure virtual machine.
2. Connect the API management service to the web API. Requests sent to the URL of the management API are
mapped to URIs in the web API. The same API management service can route requests to more than one
web API. This enables you to aggregate multiple web APIs into a single management service. Similarly, the
same web API can be referenced from more than one API management service if you need to restrict or
partition the functionality available to different applications.
NOTE
The URIs in HATEOAS links generated as part of the response for HTTP GET requests should reference the URL of the
API management service and not the web server hosting the web API.

3. For each web API, specify the HTTP operations that the web API exposes together with any optional
parameters that an operation can take as input. You can also configure whether the API management service
should cache the response received from the web API to optimize repeated requests for the same data.
Record the details of the HTTP responses that each operation can generate. This information is used to
generate documentation for developers, so it is important that it is accurate and complete.
You can either define operations manually using the wizards provided by the Azure portal, or you can
import them from a file containing the definitions in WADL or Swagger format.
4. Configure the security settings for communications between the API management service and the web
server hosting the web API. The API management service currently supports Basic authentication and
mutual authentication using certificates, and OAuth 2.0 user authorization.
5. Create a product. A product is the unit of publication; you add the web APIs that you previously connected to
the management service to the product. When the product is published, the web APIs become available to
developers.

NOTE
Prior to publishing a product, you can also define user-groups that can access the product and add users to these
groups. This gives you control over the developers and applications that can use the web API. If a web API is subject
to approval, prior to being able to access it a developer must send a request to the product administrator. The
administrator can grant or deny access to the developer. Existing developers can also be blocked if circumstances
change.

6. Configure policies for each web API. Policies govern aspects such as whether cross-domain calls should be
allowed, how to authenticate clients, whether to convert between XML and JSON data formats transparently,
whether to restrict calls from a given IP range, usage quotas, and whether to limit the call rate. Policies can
be applied globally across the entire product, for a single web API in a product, or for individual operations
in a web API.
For more information, see the API Management documentation.

TIP
Azure provides the Azure Traffic Manager which enables you to implement failover and load-balancing, and reduce latency
across multiple instances of a web site hosted in different geographic locations. You can use Azure Traffic Manager in
conjunction with the API Management Service; the API Management Service can route requests to instances of a web site
through Azure Traffic Manager. For more information, see Traffic Manager routing methods.
In this structure, if you are using custom DNS names for your web sites, you should configure the appropriate CNAME record
for each web site to point to the DNS name of the Azure Traffic Manager web site.

Supporting client-side developers


Developers constructing client applications typically require information on how to access the web API, and
documentation concerning the parameters, data types, return types, and return codes that describe the different
requests and responses between the web service and the client application.
Document the REST operations for a web API
The Azure API Management Service includes a developer portal that describes the REST operations exposed by a
web API. When a product has been published it appears on this portal. Developers can use this portal to sign up for
access; the administrator can then approve or deny the request. If the developer is approved, they are assigned a
subscription key that is used to authenticate calls from the client applications that they develop. This key must be
provided with each web API call otherwise it will be rejected.
This portal also provides:
Documentation for the product, listing the operations that it exposes, the parameters required, and the different
responses that can be returned. Note that this information is generated from the details provided in step 3 in
the list in the Publishing a web API by using the Microsoft Azure API Management Service section.
Code snippets that show how to invoke operations from several languages, including JavaScript, C#, Java, Ruby,
Python, and PHP.
A developers' console that enables a developer to send an HTTP request to test each operation in the product
and view the results.
A page where the developer can report any issues or problems found.
The Azure portal enables you to customize the developer portal to change the styling and layout to match the
branding of your organization.
Implement a client SDK
Building a client application that invokes REST requests to access a web API requires writing a significant amount of
code to construct each request and format it appropriately, send the request to the server hosting the web service,
and parse the response to work out whether the request succeeded or failed and extract any data returned. To
insulate the client application from these concerns, you can provide an SDK that wraps the REST interface and
abstracts these low-level details inside a more functional set of methods. A client application uses these methods,
which transparently convert calls into REST requests and then convert the responses back into method return
values. This is a common technique that is implemented by many services, including the Azure SDK.
Creating a client-side SDK is a considerable undertaking as it has to be implemented consistently and tested
carefully. However, much of this process can be made mechanical, and many vendors supply tools that can
automate many of these tasks.

Monitoring a web API


Depending on how you have published and deployed your web API you can monitor the web API directly, or you
can gather usage and health information by analyzing the traffic that passes through the API Management service.
Monitoring a web API directly
If you have implemented your web API by using the ASP.NET Web API template (either as a Web API project or as a
Web role in an Azure cloud service) and Visual Studio 2013, you can gather availability, performance, and usage
data by using ASP.NET Application Insights. Application Insights is a package that transparently tracks and records
information about requests and responses when the web API is deployed to the cloud; once the package is installed
and configured, you don't need to amend any code in your web API to use it. When you deploy the web API to an
Azure web site, all traffic is examined and the following statistics are gathered:
Server response time.
Number of server requests and the details of each request.
The top slowest requests in terms of average response time.
The details of any failed requests.
The number of sessions initiated by different browsers and user agents.
The most frequently viewed pages (primarily useful for web applications rather than web APIs).
The different user roles accessing the web API.
You can view this data in real time in the Azure portal. You can also create web tests that monitor the health of the
web API. A web test sends a periodic request to a specified URI in the web API and captures the response. You can
specify the definition of a successful response (such as HTTP status code 200), and if the request does not return
this response you can arrange for an alert to be sent to an administrator. If necessary, the administrator can restart
the server hosting the web API if it has failed.
For more information, see Application Insights - Get started with ASP.NET.
Monitoring a web API through the API Management Service
If you have published your web API by using the API Management service, the API Management page on the Azure
portal contains a dashboard that enables you to view the overall performance of the service. The Analytics page
enables you to drill down into the details of how the product is being used. This page contains the following tabs:
Usage . This tab provides information about the number of API calls made and the bandwidth used to handle
these calls over time. You can filter usage details by product, API, and operation.
Health . This tab enables you to view the outcome of API requests (the HTTP status codes returned), the
effectiveness of the caching policy, the API response time, and the service response time. Again, you can filter
health data by product, API, and operation.
Activity . This tab provides a text summary of the numbers of successful calls, failed calls, blocked calls, average
response time, and response times for each product, web API, and operation. This page also lists the number of
calls made by each developer.
At a glance . This tab displays a summary of the performance data, including the developers responsible for
making the most API calls, and the products, web APIs, and operations that received these calls.
You can use this information to determine whether a particular web API or operation is causing a bottleneck, and if
necessary scale the host environment and add more servers. You can also ascertain whether one or more
applications are using a disproportionate volume of resources and apply the appropriate policies to set quotas and
limit call rates.

NOTE
You can change the details for a published product, and the changes are applied immediately. For example, you can add or
remove an operation from a web API without requiring that you republish the product that contains the web API.

More information
ASP.NET Web API OData contains examples and further information on implementing an OData web API by
using ASP.NET.
Introducing batch support in Web API and Web API OData describes how to implement batch operations in a
web API by using OData.
Idempotency patterns on Jonathan Oliver's blog provides an overview of idempotency and how it relates to
data management operations.
Status code definitions on the W3C website contains a full list of HTTP status codes and their descriptions.
Run background tasks with WebJobs provides information and examples on using WebJobs to perform
background operations.
Azure Notification Hubs notify users shows how to use an Azure Notification Hub to push asynchronous
responses to client applications.
API Management describes how to publish a product that provides controlled and secure access to a web API.
Azure API Management REST API reference describes how to use the API Management REST API to build custom
management applications.
Traffic Manager routing methods summarizes how Azure Traffic Manager can be used to load-balance requests
across multiple instances of a website hosting a web API.
Application Insights - Get started with ASP.NET provides detailed information on installing and configuring
Application Insights in an ASP.NET Web API project.
Autoscaling
11/2/2020 • 15 minutes to read • Edit Online

Autoscaling is the process of dynamically allocating resources to match performance requirements. As the
volume of work grows, an application may need additional resources to maintain the desired performance levels
and satisfy service-level agreements (SLAs). As demand slackens and the additional resources are no longer
needed, they can be de-allocated to minimize costs.
Autoscaling takes advantage of the elasticity of cloud-hosted environments while easing management overhead.
It reduces the need for an operator to continually monitor the performance of a system and make decisions
about adding or removing resources.
There are two main ways that an application can scale:
Ver tical scaling , also called scaling up and down, means changing the capacity of a resource. For
example, you could move an application to a larger VM size. Vertical scaling often requires making the
system temporarily unavailable while it is being redeployed. Therefore, it's less common to automate
vertical scaling.
Horizontal scaling , also called scaling out and in, means adding or removing instances of a resource.
The application continues running without interruption as new resources are provisioned. When the
provisioning process is complete, the solution is deployed on these additional resources. If demand drops,
the additional resources can be shut down cleanly and deallocated.
Many cloud-based systems, including Microsoft Azure, support automatic horizontal scaling. The rest of this
article focuses on horizontal scaling.

NOTE
Autoscaling mostly applies to compute resources. While it's possible to horizontally scale a database or message queue,
this usually involves data partitioning, which is generally not automated.

Overview
An autoscaling strategy typically involves the following pieces:
Instrumentation and monitoring systems at the application, service, and infrastructure levels. These systems
capture key metrics, such as response times, queue lengths, CPU utilization, and memory usage.
Decision-making logic that evaluates these metrics against predefined thresholds or schedules, and decides
whether to scale.
Components that scale the system.
Testing, monitoring, and tuning of the autoscaling strategy to ensure that it functions as expected.
Azure provides built-in autoscaling mechanisms that address common scenarios. If a particular service or
technology does not have built-in autoscaling functionality, or if you have specific autoscaling requirements
beyond its capabilities, you might consider a custom implementation. A custom implementation would collect
operational and system metrics, analyze the metrics, and then scale resources accordingly.

Configure autoscaling for an Azure solution


Azure provides built-in autoscaling for most compute options.
Azure Vir tual Machines autoscale via virtual machine scale sets, which manage a set of Azure virtual
machines as a group. See How to use automatic scaling and virtual machine scale sets.
Ser vice Fabric also supports autoscaling through virtual machine scale sets. Every node type in a
Service Fabric cluster is set up as a separate virtual machine scale set. That way, each node type can be
scaled in or out independently. See Scale a Service Fabric cluster in or out using autoscale rules.
Azure App Ser vice has built-in autoscaling. Autoscale settings apply to all of the apps within an App
Service. See Scale instance count manually or automatically.
Azure Cloud Ser vices has built-in autoscaling at the role level. See How to configure auto scaling for a
Cloud Service in the portal.
These compute options all use Azure Monitor autoscale to provide a common set of autoscaling functionality.
Azure Functions differs from the previous compute options, because you don't need to configure any
autoscale rules. Instead, Azure Functions automatically allocates compute power when your code is running,
scaling out as necessary to handle load. For more information, see Choose the correct hosting plan for Azure
Functions.
Finally, a custom autoscaling solution can sometimes be useful. For example, you could use Azure diagnostics
and application-based metrics, along with custom code to monitor and export the application metrics. Then you
could define custom rules based on these metrics, and use Resource Manager REST APIs to trigger autoscaling.
However, a custom solution is not simple to implement, and should be considered only if none of the previous
approaches can fulfill your requirements.
Use the built-in autoscaling features of the platform, if they meet your requirements. If not, carefully consider
whether you really need more complex scaling features. Examples of additional requirements may include more
granularity of control, different ways to detect trigger events for scaling, scaling across subscriptions, and scaling
other types of resources.

Use Azure Monitor autoscale


Azure Monitor autoscale provide a common set of autoscaling functionality for virtual machine scale sets, Azure
App Service, and Azure Cloud Service. Scaling can be performed on a schedule, or based on a runtime metric,
such as CPU or memory usage.
Examples:
Scale out to 10 instances on weekdays, and scale in to 4 instances on Saturday and Sunday.
Scale out by one instance if average CPU usage is above 70%, and scale in by one instance if CPU usage falls
below 50%.
Scale out by one instance if the number of messages in a queue exceeds a certain threshold.
Scale up the resource when load increases to ensure availability. Similarly, at times of low usage, scale down, so
you can optimize cost. Always use a scale-out and scale-in rule combination. Otherwise, the autoscaling takes
place only in one direction until it reaches the threshold (maximum or minimum instance counts) set in the
profile.
Select a default instance count that's safe for your workload. It's scaled based on that value if maximum or
minimum instance counts are not set.
For a list of built-in metrics, see Azure Monitor autoscaling common metrics. You can also implement custom
metrics by using Application Insights.
You can configure autoscaling by using PowerShell, the Azure CLI, an Azure Resource Manager template, or the
Azure portal. For more detailed control, use the Azure Resource Manager REST API. The Azure Monitoring
Service Management Library and the Microsoft Insights Library (in preview) are SDKs that allow collecting
metrics from different resources, and perform autoscaling by making use of the REST APIs. For resources where
Azure Resource Manager support isn't available, or if you are using Azure Cloud Services, the Service
Management REST API can be used for autoscaling. In all other cases, use Azure Resource Manager.
Consider the following points when using Azure autoscale:
Consider whether you can predict the load on the application accurately enough to use scheduled
autoscaling, adding and removing instances to meet anticipated peaks in demand. If this isn't possible, use
reactive autoscaling based on runtime metrics, in order to handle unpredictable changes in demand.
Typically, you can combine these approaches. For example, create a strategy that adds resources based on
a schedule of the times when you know the application is busiest. This helps to ensure that capacity is
available when required, without any delay from starting new instances. For each scheduled rule, define
metrics that allow reactive autoscaling during that period to ensure that the application can handle
sustained but unpredictable peaks in demand.
It's often difficult to understand the relationship between metrics and capacity requirements, especially
when an application is initially deployed. Provision a little extra capacity at the beginning, and then
monitor and tune the autoscaling rules to bring the capacity closer to the actual load.
Configure the autoscaling rules, and then monitor the performance of your application over time. Use the
results of this monitoring to adjust the way in which the system scales if necessary. However, keep in mind
that autoscaling is not an instantaneous process. It takes time to react to a metric such as average CPU
utilization exceeding (or falling below) a specified threshold.
Autoscaling rules that use a detection mechanism based on a measured trigger attribute (such as CPU
usage or queue length) use an aggregated value over time, rather than instantaneous values, to trigger an
autoscaling action. By default, the aggregate is an average of the values. This prevents the system from
reacting too quickly, or causing rapid oscillation. It also allows time for new instances that are
automatically started to settle into running mode, preventing additional autoscaling actions from
occurring while the new instances are starting up. For Azure Cloud Services and Azure Virtual Machines,
the default period for the aggregation is 45 minutes, so it can take up to this period of time for the metric
to trigger autoscaling in response to spikes in demand. You can change the aggregation period by using
the SDK, but periods of less than 25 minutes may cause unpredictable results. For Web Apps, the
averaging period is much shorter, allowing new instances to be available in about five minutes after a
change to the average trigger measure.
Avoid flapping where scale-in and scale-out actions continually go back and forth. Suppose there are two
instances, and upper limit is 80% CPU, lower limit is 60%. When the load is at 85%, another instance is
added. After some time, the load decreases to 60%. Before scaling in, the autoscale service calculates the
distribution of total load (of three instances) when an instance is removed, taking it to 90%. This means it
would have to scale out again immediately. So, it skips scaling-in and you might never see the expected
scaling results.
The flapping situation can be controlled by choosing an adequate margin between the scale-out and
scale-in thresholds.
Manual scaling is reset by maximum and minimum number of instances used for autoscaling. If you
manually update the instance count to a value higher or lower than the maximum value, the autoscale
engine automatically scales back to the minimum (if lower) or the maximum (if higher). For example, you
set the range between 3 and 6. If you have one running instance, the autoscale engine scales to three
instances on its next run. Likewise, if you manually set the scale to eight instances, on the next run
autoscale will scale it back to six instances on its next run. Manual scaling is temporary unless you reset
the autoscale rules as well.
The autoscale engine processes only one profile at a time. If a condition is not met, then it checks for the
next profile. Keep key metrics out of the default profile because that profile is checked last. Within a
profile, you can have multiple rules. On scale-out, autoscale runs if any rule is met. On scale-in, autoscale
require all rules to be met.
For details about how Azure Monitor scales, see Best practices for Autoscale.
If you configure autoscaling using the SDK rather than the portal, you can specify a more detailed
schedule during which the rules are active. You can also create your own metrics and use them with or
without any of the existing ones in your autoscaling rules. For example, you may wish to use alternative
counters, such as the number of requests per second or the average memory availability, or use custom
counters to measure specific business processes.
When autoscaling Service Fabric, the node types in your cluster are made of virtual machine scale sets at
the back end, so you need to set up autoscale rules for each node type. Take into account the number of
nodes that you must have before you set up autoscaling. The minimum number of nodes that you must
have for the primary node type is driven by the reliability level you have chosen. For more information,
see scale a Service Fabric cluster in or out using autoscale rules.
You can use the portal to link resources such as SQL Database instances and queues to a Cloud Service
instance. This allows you to more easily access the separate manual and automatic scaling configuration
options for each of the linked resources. For more information, see How to: Link a resource to a cloud
service.
When you configure multiple policies and rules, they could conflict with each other. Autoscale uses the
following conflict resolution rules to ensure that there is always a sufficient number of instances running:
Scale-out operations always take precedence over scale-in operations.
When scale-out operations conflict, the rule that initiates the largest increase in the number of
instances takes precedence.
When scale in operations conflict, the rule that initiates the smallest decrease in the number of
instances takes precedence.
In an App Service Environment, any worker pool or front-end metrics can be used to define autoscale
rules. For more information, see Autoscaling and App Service Environment.

Application design considerations


Autoscaling isn't an instant solution. Simply adding resources to a system or running more instances of a
process doesn't guarantee that the performance of the system will improve. Consider the following points when
designing an autoscaling strategy:
The system must be designed to be horizontally scalable. Avoid making assumptions about instance
affinity; do not design solutions that require that the code is always running in a specific instance of a
process. When scaling a cloud service or web site horizontally, don't assume that a series of requests from
the same source will always be routed to the same instance. For the same reason, design services to be
stateless to avoid requiring a series of requests from an application to always be routed to the same
instance of a service. When designing a service that reads messages from a queue and processes them,
don't make any assumptions about which instance of the service handles a specific message. Autoscaling
could start additional instances of a service as the queue length grows. The Competing Consumers
pattern describes how to handle this scenario.
If the solution implements a long-running task, design this task to support both scaling out and scaling in.
Without due care, such a task could prevent an instance of a process from being shut down cleanly when
the system scales in, or it could lose data if the process is forcibly terminated. Ideally, refactor a long-
running task and break up the processing that it performs into smaller, discrete chunks. The Pipes and
Filters pattern provides an example of how you can achieve this.
Alternatively, you can implement a checkpoint mechanism that records state information about the task at
regular intervals, and save this state in durable storage that can be accessed by any instance of the
process running the task. In this way, if the process is shut down, the work that it was performing can be
resumed from the last checkpoint by using another instance.
When background tasks run on separate compute instances, such as in worker roles of a cloud-services–
hosted application, you may need to scale different parts of the application using different scaling policies.
For example, you may need to deploy additional user interface (UI) compute instances without increasing
the number of background compute instances, or the opposite of this. If you offer different levels of
service (such as basic and premium service packages), you may need to scale out the compute resources
for premium service packages more aggressively than those for basic service packages in order to meet
SLAs.
Consider using the length of the queue over which UI and background compute instances communicate
as a criterion for your autoscaling strategy. This is the best indicator of an imbalance or difference
between the current load and the processing capacity of the background task.
If you base your autoscaling strategy on counters that measure business processes, such as the number of
orders placed per hour or the average execution time of a complex transaction, ensure that you fully
understand the relationship between the results from these types of counters and the actual compute
capacity requirements. It may be necessary to scale more than one component or compute unit in
response to changes in business process counters.
To prevent a system from attempting to scale out excessively, and to avoid the costs associated with
running many thousands of instances, consider limiting the maximum number of instances that can be
automatically added. Most autoscaling mechanisms allow you to specify the minimum and maximum
number of instances for a rule. In addition, consider gracefully degrading the functionality that the system
provides if the maximum number of instances have been deployed, and the system is still overloaded.
Keep in mind that autoscaling might not be the most appropriate mechanism to handle a sudden burst in
workload. It takes time to provision and start new instances of a service or add resources to a system, and
the peak demand may have passed by the time these additional resources have been made available. In
this scenario, it may be better to throttle the service. For more information, see the Throttling pattern.
Conversely, if you do need the capacity to process all requests when the volume fluctuates rapidly, and
cost isn't a major contributing factor, consider using an aggressive autoscaling strategy that starts
additional instances more quickly. You can also use a scheduled policy that starts a sufficient number of
instances to meet the maximum load before that load is expected.
The autoscaling mechanism should monitor the autoscaling process, and log the details of each
autoscaling event (what triggered it, what resources were added or removed, and when). If you create a
custom autoscaling mechanism, ensure that it incorporates this capability. Analyze the information to help
measure the effectiveness of the autoscaling strategy, and tune it if necessary. You can tune both in the
short term, as the usage patterns become more obvious, and over the long term, as the business expands
or the requirements of the application evolve. If an application reaches the upper limit defined for
autoscaling, the mechanism might also alert an operator who could manually start additional resources if
necessary. Note that under these circumstances the operator may also be responsible for manually
removing these resources after the workload eases.

Related patterns and guidance


The following patterns and guidance may also be relevant to your scenario when implementing autoscaling:
Throttling pattern. This pattern describes how an application can continue to function and meet SLAs
when an increase in demand places an extreme load on resources. Throttling can be used with autoscaling
to prevent a system from being overwhelmed while the system scales out.
Competing Consumers pattern. This pattern describes how to implement a pool of service instances that
can handle messages from any application instance. Autoscaling can be used to start and stop service
instances to match the anticipated workload. This approach enables a system to process multiple
messages concurrently to optimize throughput, improve scalability and availability, and balance the
workload.
Monitoring and diagnostics. Instrumentation and telemetry are vital for gathering the information that
can drive the autoscaling process.
Background jobs
11/2/2020 • 23 minutes to read • Edit Online

Many types of applications require background tasks that run independently of the user interface (UI). Examples
include batch jobs, intensive processing tasks, and long-running processes such as workflows. Background jobs
can be executed without requiring user interaction--the application can start the job and then continue to process
interactive requests from users. This can help to minimize the load on the application UI, which can improve
availability and reduce interactive response times.
For example, if an application is required to generate thumbnails of images that are uploaded by users, it can do
this as a background job and save the thumbnail to storage when it is complete--without the user needing to wait
for the process to be completed. In the same way, a user placing an order can initiate a background workflow that
processes the order, while the UI allows the user to continue browsing the web app. When the background job is
complete, it can update the stored orders data and send an email to the user that confirms the order.
When you consider whether to implement a task as a background job, the main criteria is whether the task can run
without user interaction and without the UI needing to wait for the job to be completed. Tasks that require the user
or the UI to wait while they are completed might not be appropriate as background jobs.

Types of background jobs


Background jobs typically include one or more of the following types of jobs:
CPU-intensive jobs, such as mathematical calculations or structural model analysis.
I/O-intensive jobs, such as executing a series of storage transactions or indexing files.
Batch jobs, such as nightly data updates or scheduled processing.
Long-running workflows, such as order fulfillment, or provisioning services and systems.
Sensitive-data processing where the task is handed off to a more secure location for processing. For example,
you might not want to process sensitive data within a web app. Instead, you might use a pattern such as the
Gatekeeper pattern to transfer the data to an isolated background process that has access to protected storage.

Triggers
Background jobs can be initiated in several different ways. They fall into one of the following categories:
Event-driven triggers . The task is started in response to an event, typically an action taken by a user or a step
in a workflow.
Schedule-driven triggers . The task is invoked on a schedule based on a timer. This might be a recurring
schedule or a one-off invocation that is specified for a later time.
Event-driven triggers
Event-driven invocation uses a trigger to start the background task. Examples of using event-driven triggers
include:
The UI or another job places a message in a queue. The message contains data about an action that has taken
place, such as the user placing an order. The background task listens on this queue and detects the arrival of a
new message. It reads the message and uses the data in it as the input to the background job.
The UI or another job saves or updates a value in storage. The background task monitors the storage and
detects changes. It reads the data and uses it as the input to the background job.
The UI or another job makes a request to an endpoint, such as an HTTPS URI, or an API that is exposed as a web
service. It passes the data that is required to complete the background task as part of the request. The endpoint
or web service invokes the background task, which uses the data as its input.
Typical examples of tasks that are suited to event-driven invocation include image processing, workflows, sending
information to remote services, sending email messages, and provisioning new users in multitenant applications.
Schedule -driven triggers
Schedule-driven invocation uses a timer to start the background task. Examples of using schedule-driven triggers
include:
A timer that is running locally within the application or as part of the application's operating system invokes a
background task on a regular basis.
A timer that is running in a different application, such as Azure Logic Apps, sends a request to an API or web
service on a regular basis. The API or web service invokes the background task.
A separate process or application starts a timer that causes the background task to be invoked once after a
specified time delay, or at a specific time.
Typical examples of tasks that are suited to schedule-driven invocation include batch-processing routines (such as
updating related-products lists for users based on their recent behavior), routine data processing tasks (such as
updating indexes or generating accumulated results), data analysis for daily reports, data retention cleanup, and
data consistency checks.
If you use a schedule-driven task that must run as a single instance, be aware of the following:
If the compute instance that is running the scheduler (such as a virtual machine using Windows scheduled
tasks) is scaled, you will have multiple instances of the scheduler running. These could start multiple instances
of the task.
If tasks run for longer than the period between scheduler events, the scheduler may start another instance of
the task while the previous one is still running.

Returning results
Background jobs execute asynchronously in a separate process, or even in a separate location, from the UI or the
process that invoked the background task. Ideally, background tasks are "fire and forget" operations, and their
execution progress has no impact on the UI or the calling process. This means that the calling process does not
wait for completion of the tasks. Therefore, it cannot automatically detect when the task ends.
If you require a background task to communicate with the calling task to indicate progress or completion, you
must implement a mechanism for this. Some examples are:
Write a status indicator value to storage that is accessible to the UI or caller task, which can monitor or check
this value when required. Other data that the background task must return to the caller can be placed into the
same storage.
Establish a reply queue that the UI or caller listens on. The background task can send messages to the queue
that indicate status and completion. Data that the background task must return to the caller can be placed into
the messages. If you are using Azure Service Bus, you can use the ReplyTo and CorrelationId properties to
implement this capability.
Expose an API or endpoint from the background task that the UI or caller can access to obtain status
information. Data that the background task must return to the caller can be included in the response.
Have the background task call back to the UI or caller through an API to indicate status at predefined points or
on completion. This might be through events raised locally or through a publish-and-subscribe mechanism.
Data that the background task must return to the caller can be included in the request or event payload.

Hosting environment
You can host background tasks by using a range of different Azure platform services:
Azure Web Apps and WebJobs . You can use WebJobs to execute custom jobs based on a range of different
types of scripts or executable programs within the context of a web app.
Azure Vir tual Machines . If you have a Windows service or want to use the Windows Task Scheduler, it is
common to host your background tasks within a dedicated virtual machine.
Azure Batch . Batch is a platform service that schedules compute-intensive work to run on a managed
collection of virtual machines. It can automatically scale compute resources.
Azure Kubernetes Ser vice (AKS). Azure Kubernetes Service provides a managed hosting environment for
Kubernetes on Azure.
The following sections describe each of these options in more detail, and include considerations to help you
choose the appropriate option.
Azure Web Apps and WebJobs
You can use Azure WebJobs to execute custom jobs as background tasks within an Azure Web App. WebJobs run
within the context of your web app as a continuous process. WebJobs also run in response to a trigger event from
Azure Logic Apps or external factors, such as changes to storage blobs and message queues. Jobs can be started
and stopped on demand, and shut down gracefully. If a continuously running WebJob fails, it is automatically
restarted. Retry and error actions are configurable.
When you configure a WebJob:
If you want the job to respond to an event-driven trigger, you should configure it as Run continuously . The
script or program is stored in the folder named site/wwwroot/app_data/jobs/continuous.
If you want the job to respond to a schedule-driven trigger, you should configure it as Run on a schedule . The
script or program is stored in the folder named site/wwwroot/app_data/jobs/triggered.
If you choose the Run on demand option when you configure a job, it will execute the same code as the Run
on a schedule option when you start it.
Azure WebJobs run within the sandbox of the web app. This means that they can access environment variables
and share information, such as connection strings, with the web app. The job has access to the unique identifier of
the machine that is running the job. The connection string named AzureWebJobsStorage provides access to
Azure storage queues, blobs, and tables for application data, and access to Service Bus for messaging and
communication. The connection string named AzureWebJobsDashboard provides access to the job action log
files.
Azure WebJobs have the following characteristics:
Security : WebJobs are protected by the deployment credentials of the web app.
Suppor ted file types : You can define WebJobs by using command scripts (.cmd), batch files (.bat), PowerShell
scripts (.ps1), bash shell scripts (.sh), PHP scripts (.php), Python scripts (.py), JavaScript code (.js), and executable
programs (.exe, .jar, and more).
Deployment : You can deploy scripts and executables by using the Azure portal, by using Visual Studio, by
using the Azure WebJobs SDK, or by copying them directly to the following locations:
For triggered execution: site/wwwroot/app_data/jobs/triggered/{ job name}
For continuous execution: site/wwwroot/app_data/jobs/continuous/{ job name}
Logging : Console.Out is treated (marked) as INFO. Console.Error is treated as ERROR. You can access
monitoring and diagnostics information by using the Azure portal. You can download log files directly from the
site. They are saved in the following locations:
For triggered execution: Vfs/data/jobs/triggered/jobName
For continuous execution: Vfs/data/jobs/continuous/jobName
Configuration : You can configure WebJobs by using the portal, the REST API, and PowerShell. You can use a
configuration file named settings.job in the same root directory as the job script to provide configuration
information for a job. For example:
{ "stopping_wait_time": 60 }
{ "is_singleton": true }
Considerations
By default, WebJobs scale with the web app. However, you can configure jobs to run on single instance by
setting the is_singleton configuration property to true . Single instance WebJobs are useful for tasks that you
do not want to scale or run as simultaneous multiple instances, such as reindexing, data analysis, and similar
tasks.
To minimize the impact of jobs on the performance of the web app, consider creating an empty Azure Web App
instance in a new App Service plan to host long-running or resource-intensive WebJobs.
Azure Virtual Machines
Background tasks might be implemented in a way that prevents them from being deployed to Azure Web Apps, or
these options might not be convenient. Typical examples are Windows services, and third-party utilities and
executable programs. Another example might be programs written for an execution environment that is different
than that hosting the application. For example, it might be a Unix or Linux program that you want to execute from
a Windows or .NET application. You can choose from a range of operating systems for an Azure virtual machine,
and run your service or executable on that virtual machine.
To help you choose when to use Virtual Machines, see Azure App Services, Cloud Services and Virtual Machines
comparison. For information about the options for Virtual Machines, see Sizes for Windows virtual machines in
Azure. For more information about the operating systems and prebuilt images that are available for Virtual
Machines, see Azure Virtual Machines Marketplace.
To initiate the background task in a separate virtual machine, you have a range of options:
You can execute the task on demand directly from your application by sending a request to an endpoint that the
task exposes. This passes in any data that the task requires. This endpoint invokes the task.
You can configure the task to run on a schedule by using a scheduler or timer that is available in your chosen
operating system. For example, on Windows you can use Windows Task Scheduler to execute scripts and tasks.
Or, if you have SQL Server installed on the virtual machine, you can use the SQL Server Agent to execute
scripts and tasks.
You can use Azure Logic Apps to initiate the task by adding a message to a queue that the task listens on, or by
sending a request to an API that the task exposes.
See the earlier section Triggers for more information about how you can initiate background tasks.
Considerations
Consider the following points when you are deciding whether to deploy background tasks in an Azure virtual
machine:
Hosting background tasks in a separate Azure virtual machine provides flexibility and allows precise control
over initiation, execution, scheduling, and resource allocation. However, it will increase runtime cost if a virtual
machine must be deployed just to run background tasks.
There is no facility to monitor the tasks in the Azure portal and no automated restart capability for failed tasks--
although you can monitor the basic status of the virtual machine and manage it by using the Azure Resource
Manager Cmdlets. However, there are no facilities to control processes and threads in compute nodes. Typically,
using a virtual machine will require additional effort to implement a mechanism that collects data from
instrumentation in the task, and from the operating system in the virtual machine. One solution that might be
appropriate is to use the System Center Management Pack for Azure.
You might consider creating monitoring probes that are exposed through HTTP endpoints. The code for these
probes could perform health checks, collect operational information and statistics--or collate error information
and return it to a management application. For more information, see the Health Endpoint Monitoring pattern.
For more information, see:
Virtual Machines
Azure Virtual Machines FAQ
Azure Batch
Consider Azure Batch if you need to run large, parallel high-performance computing (HPC) workloads across tens,
hundreds, or thousands of VMs.
The Batch service provisions the VMs, assign tasks to the VMs, runs the tasks, and monitors the progress. Batch
can automatically scale out the VMs in response to the workload. Batch also provides job scheduling. Azure Batch
supports both Linux and Windows VMs.
Considerations
Batch works well with intrinsically parallel workloads. It can also perform parallel calculations with a reduce step at
the end, or run Message Passing Interface (MPI) applications for parallel tasks that require message passing
between nodes.
An Azure Batch job runs on a pool of nodes (VMs). One approach is to allocate a pool only when needed and then
delete it after the job completes. This maximizes utilization, because nodes are not idle, but the job must wait for
nodes to be allocated. Alternatively, you can create a pool ahead of time. That approach minimizes the time that it
takes for a job to start, but can result in having nodes that sit idle. For more information, see Pool and compute
node lifetime.
For more information, see:
What is Azure Batch?
Develop large-scale parallel compute solutions with Batch
Batch and HPC solutions for large-scale computing workloads
Azure Kubernetes Service
Azure Kubernetes Service (AKS) manages your hosted Kubernetes environment, which makes it easy to deploy
and manage containerized applications.
Containers can be useful for running background jobs. Some of the benefits include:
Containers support high-density hosting. You can isolate a background task in a container, while placing
multiple containers in each VM.
The container orchestrator handles internal load balancing, configuring the internal network, and other
configuration tasks.
Containers can be started and stopped as needed.
Azure Container Registry allows you to register your containers inside Azure boundaries. This comes with
security, privacy, and proximity benefits.
Considerations
Requires an understanding of how to use a container orchestrator. Depending on the skill set of your DevOps
team, this may or may not be an issue.
For more information, see:
Overview of containers in Azure
Introduction to private Docker container registries

Partitioning
If you decide to include background tasks within an existing compute instance, you must consider how this will
affect the quality attributes of the compute instance and the background task itself. These factors will help you to
decide whether to colocate the tasks with the existing compute instance or separate them out into a separate
compute instance:
Availability : Background tasks might not need to have the same level of availability as other parts of the
application, in particular the UI and other parts that are directly involved in user interaction. Background
tasks might be more tolerant of latency, retried connection failures, and other factors that affect availability
because the operations can be queued. However, there must be sufficient capacity to prevent the backup of
requests that could block queues and affect the application as a whole.
Scalability : Background tasks are likely to have a different scalability requirement than the UI and the
interactive parts of the application. Scaling the UI might be necessary to meet peaks in demand, while
outstanding background tasks might be completed during less busy times by fewer compute instances.
Resiliency : Failure of a compute instance that just hosts background tasks might not fatally affect the
application as a whole if the requests for these tasks can be queued or postponed until the task is available
again. If the compute instance and/or tasks can be restarted within an appropriate interval, users of the
application might not be affected.
Security : Background tasks might have different security requirements or restrictions than the UI or other
parts of the application. By using a separate compute instance, you can specify a different security
environment for the tasks. You can also use patterns such as Gatekeeper to isolate the background compute
instances from the UI in order to maximize security and separation.
Performance : You can choose the type of compute instance for background tasks to specifically match the
performance requirements of the tasks. This might mean using a less expensive compute option if the tasks
do not require the same processing capabilities as the UI, or a larger instance if they require additional
capacity and resources.
Manageability : Background tasks might have a different development and deployment rhythm from the
main application code or the UI. Deploying them to a separate compute instance can simplify updates and
versioning.
Cost : Adding compute instances to execute background tasks increases hosting costs. You should carefully
consider the trade-off between additional capacity and these extra costs.
For more information, see the Leader Election pattern and the Competing Consumers pattern.

Conflicts
If you have multiple instances of a background job, it is possible that they will compete for access to resources and
services, such as databases and storage. This concurrent access can result in resource contention, which might
cause conflicts in availability of the services and in the integrity of data in storage. You can resolve resource
contention by using a pessimistic locking approach. This prevents competing instances of a task from concurrently
accessing a service or corrupting data.
Another approach to resolve conflicts is to define background tasks as a singleton, so that there is only ever one
instance running. However, this eliminates the reliability and performance benefits that a multiple-instance
configuration can provide. This is especially true if the UI can supply sufficient work to keep more than one
background task busy.
It is vital to ensure that the background task can automatically restart and that it has sufficient capacity to cope
with peaks in demand. You can achieve this by allocating a compute instance with sufficient resources, by
implementing a queueing mechanism that can store requests for later execution when demand decreases, or by
using a combination of these techniques.
Coordination
The background tasks might be complex and might require multiple individual tasks to execute to produce a result
or to fulfill all the requirements. It is common in these scenarios to divide the task into smaller discreet steps or
subtasks that can be executed by multiple consumers. Multistep jobs can be more efficient and more flexible
because individual steps might be reusable in multiple jobs. It is also easy to add, remove, or modify the order of
the steps.
Coordinating multiple tasks and steps can be challenging, but there are three common patterns that you can use
to guide your implementation of a solution:
Decomposing a task into multiple reusable steps . An application might be required to perform a
variety of tasks of varying complexity on the information that it processes. A straightforward but inflexible
approach to implementing this application might be to perform this processing as a monolithic module.
However, this approach is likely to reduce the opportunities for refactoring the code, optimizing it, or
reusing it if parts of the same processing are required elsewhere within the application. For more
information, see the Pipes and Filters pattern.
Managing execution of the steps for a task . An application might perform tasks that comprise a
number of steps (some of which might invoke remote services or access remote resources). The individual
steps might be independent of each other, but they are orchestrated by the application logic that
implements the task. For more information, see Scheduler Agent Supervisor pattern.
Managing recover y for task steps that fail . An application might need to undo the work that is
performed by a series of steps (which together define an eventually consistent operation) if one or more of
the steps fail. For more information, see the Compensating Transaction pattern.

Resiliency considerations
Background tasks must be resilient in order to provide reliable services to the application. When you are planning
and designing background tasks, consider the following points:
Background tasks must be able to gracefully handle restarts without corrupting data or introducing
inconsistency into the application. For long-running or multistep tasks, consider using check pointing by
saving the state of jobs in persistent storage, or as messages in a queue if this is appropriate. For example,
you can persist state information in a message in a queue and incrementally update this state information
with the task progress so that the task can be processed from the last known good checkpoint--instead of
restarting from the beginning. When using Azure Service Bus queues, you can use message sessions to
enable the same scenario. Sessions allow you to save and retrieve the application processing state by using
the SetState and GetState methods. For more information about designing reliable multistep processes and
workflows, see the Scheduler Agent Supervisor pattern.
When you use queues to communicate with background tasks, the queues can act as a buffer to store
requests that are sent to the tasks while the application is under higher than usual load. This allows the
tasks to catch up with the UI during less busy periods. It also means that restarts will not block the UI. For
more information, see the Queue-Based Load Leveling pattern. If some tasks are more important than
others, consider implementing the Priority Queue pattern to ensure that these tasks run before less
important ones.
Background tasks that are initiated by messages or process messages must be designed to handle
inconsistencies, such as messages arriving out of order, messages that repeatedly cause an error (often
referred to as poison messages), and messages that are delivered more than once. Consider the following:
Messages that must be processed in a specific order, such as those that change data based on the
existing data value (for example, adding a value to an existing value), might not arrive in the original
order in which they were sent. Alternatively, they might be handled by different instances of a
background task in a different order due to varying loads on each instance. Messages that must be
processed in a specific order should include a sequence number, key, or some other indicator that
background tasks can use to ensure that they are processed in the correct order. If you are using
Azure Service Bus, you can use message sessions to guarantee the order of delivery. However, it is
usually more efficient, where possible, to design the process so that the message order is not
important.
Typically, a background task will peek at messages in the queue, which temporarily hides them from
other message consumers. Then it deletes the messages after they have been successfully processed.
If a background task fails when processing a message, that message will reappear on the queue after
the peek time-out expires. It will be processed by another instance of the task or during the next
processing cycle of this instance. If the message consistently causes an error in the consumer, it will
block the task, the queue, and eventually the application itself when the queue becomes full.
Therefore, it is vital to detect and remove poison messages from the queue. If you are using Azure
Service Bus, messages that cause an error can be moved automatically or manually to an associated
dead letter queue.
Queues are guaranteed at least once delivery mechanisms, but they might deliver the same message
more than once. In addition, if a background task fails after processing a message but before deleting
it from the queue, the message will become available for processing again. Background tasks should
be idempotent, which means that processing the same message more than once does not cause an
error or inconsistency in the application's data. Some operations are naturally idempotent, such as
setting a stored value to a specific new value. However, operations such as adding a value to an
existing stored value without checking that the stored value is still the same as when the message
was originally sent will cause inconsistencies. Azure Service Bus queues can be configured to
automatically remove duplicated messages.
Some messaging systems, such as Azure storage queues and Azure Service Bus queues, support a
de-queue count property that indicates the number of times a message has been read from the
queue. This can be useful in handling repeated and poison messages. For more information, see
Asynchronous Messaging Primer and Idempotency Patterns.

Scaling and performance considerations


Background tasks must offer sufficient performance to ensure they do not block the application, or cause
inconsistencies due to delayed operation when the system is under load. Typically, performance is improved by
scaling the compute instances that host the background tasks. When you are planning and designing background
tasks, consider the following points around scalability and performance:
Azure supports autoscaling (both scaling out and scaling back in) based on current demand and load or on
a predefined schedule, for Web Apps and Virtual Machines hosted deployments. Use this feature to ensure
that the application as a whole has sufficient performance capabilities while minimizing runtime costs.
Where background tasks have a different performance capability from the other parts of an application (for
example, the UI or components such as the data access layer), hosting the background tasks together in a
separate compute service allows the UI and background tasks to scale independently to manage the load. If
multiple background tasks have significantly different performance capabilities from each other, consider
dividing them and scaling each type independently. However, note that this might increase runtime costs.
Simply scaling the compute resources might not be sufficient to prevent loss of performance under load.
You might also need to scale storage queues and other resources to prevent a single point of the overall
processing chain from becoming a bottleneck. Also, consider other limitations, such as the maximum
throughput of storage and other services that the application and the background tasks rely on.
Background tasks must be designed for scaling. For example, they must be able to dynamically detect the
number of storage queues in use in order to listen on or send messages to the appropriate queue.
By default, WebJobs scale with their associated Azure Web Apps instance. However, if you want a WebJob to
run as only a single instance, you can create a Settings.job file that contains the JSON data {
"is_singleton": true } . This forces Azure to only run one instance of the WebJob, even if there are multiple
instances of the associated web app. This can be a useful technique for scheduled jobs that must run as only
a single instance.

Related patterns
Compute Partitioning Guidance
Caching
11/2/2020 • 55 minutes to read • Edit Online

Caching is a common technique that aims to improve the performance and scalability of a system. It does this by
temporarily copying frequently accessed data to fast storage that's located close to the application. If this fast data
storage is located closer to the application than the original source, then caching can significantly improve
response times for client applications by serving data more quickly.
Caching is most effective when a client instance repeatedly reads the same data, especially if all the following
conditions apply to the original data store:
It remains relatively static.
It's slow compared to the speed of the cache.
It's subject to a high level of contention.
It's far away when network latency can cause access to be slow.

Caching in distributed applications


Distributed applications typically implement either or both of the following strategies when caching data:
Using a private cache, where data is held locally on the computer that's running an instance of an application
or service.
Using a shared cache, serving as a common source that can be accessed by multiple processes and machines.
In both cases, caching can be performed client-side and server-side. Client-side caching is done by the process
that provides the user interface for a system, such as a web browser or desktop application. Server-side caching is
done by the process that provides the business services that are running remotely.
Private caching
The most basic type of cache is an in-memory store. It's held in the address space of a single process and
accessed directly by the code that runs in that process. This type of cache is quick to access. It can also provide an
effective means for storing modest amounts of static data, since the size of a cache is typically constrained by the
amount of memory available on the machine hosting the process.
If you need to cache more information than is physically possible in memory, you can write cached data to the
local file system. This will be slower to access than data held in memory, but should still be faster and more
reliable than retrieving data across a network.
If you have multiple instances of an application that uses this model running concurrently, each application
instance has its own independent cache holding its own copy of the data.
Think of a cache as a snapshot of the original data at some point in the past. If this data is not static, it is likely that
different application instances hold different versions of the data in their caches. Therefore, the same query
performed by these instances can return different results, as shown in Figure 1.
Figure 1: Using an in-memory cache in different instances of an application.
Shared caching
Using a shared cache can help alleviate concerns that data might differ in each cache, which can occur with in-
memory caching. Shared caching ensures that different application instances see the same view of cached data. It
does this by locating the cache in a separate location, typically hosted as part of a separate service, as shown in
Figure 2.

Figure 2: Using a shared cache.


An important benefit of the shared caching approach is the scalability it provides. Many shared cache services are
implemented by using a cluster of servers and use software to distribute the data across the cluster transparently.
An application instance simply sends a request to the cache service. The underlying infrastructure determines the
location of the cached data in the cluster. You can easily scale the cache by adding more servers.
There are two main disadvantages of the shared caching approach:
The cache is slower to access because it is no longer held locally to each application instance.
The requirement to implement a separate cache service might add complexity to the solution.

Considerations for using caching


The following sections describe in more detail the considerations for designing and using a cache.
Decide when to cache data
Caching can dramatically improve performance, scalability, and availability. The more data that you have and the
larger the number of users that need to access this data, the greater the benefits of caching become. That's
because caching reduces the latency and contention that's associated with handling large volumes of concurrent
requests in the original data store.
For example, a database might support a limited number of concurrent connections. Retrieving data from a
shared cache, however, rather than the underlying database, makes it possible for a client application to access
this data even if the number of available connections is currently exhausted. Additionally, if the database becomes
unavailable, client applications might be able to continue by using the data that's held in the cache.
Consider caching data that is read frequently but modified infrequently (for example, data that has a higher
proportion of read operations than write operations). However, we don't recommend that you use the cache as
the authoritative store of critical information. Instead, ensure that all changes that your application cannot afford
to lose are always saved to a persistent data store. This means that if the cache is unavailable, your application can
still continue to operate by using the data store, and you won't lose important information.
Determine how to cache data effectively
The key to using a cache effectively lies in determining the most appropriate data to cache, and caching it at the
appropriate time. The data can be added to the cache on demand the first time it is retrieved by an application.
This means that the application needs to fetch the data only once from the data store, and that subsequent access
can be satisfied by using the cache.
Alternatively, a cache can be partially or fully populated with data in advance, typically when the application starts
(an approach known as seeding). However, it might not be advisable to implement seeding for a large cache
because this approach can impose a sudden, high load on the original data store when the application starts
running.
Often an analysis of usage patterns can help you decide whether to fully or partially prepopulate a cache, and to
choose the data to cache. For example, it can be useful to seed the cache with the static user profile data for
customers who use the application regularly (perhaps every day), but not for customers who use the application
only once a week.
Caching typically works well with data that is immutable or that changes infrequently. Examples include reference
information such as product and pricing information in an e-commerce application, or shared static resources
that are costly to construct. Some or all of this data can be loaded into the cache at application startup to
minimize demand on resources and to improve performance. It might also be appropriate to have a background
process that periodically updates reference data in the cache to ensure it is up-to-date, or that refreshes the cache
when reference data changes.
Caching is less useful for dynamic data, although there are some exceptions to this consideration (see the section
Cache highly dynamic data later in this article for more information). When the original data changes regularly,
either the cached information becomes stale very quickly or the overhead of synchronizing the cache with the
original data store reduces the effectiveness of caching.
Note that a cache does not have to include the complete data for an entity. For example, if a data item represents
a multivalued object such as a bank customer with a name, address, and account balance, some of these elements
might remain static (such as the name and address), while others (such as the account balance) might be more
dynamic. In these situations, it can be useful to cache the static portions of the data and retrieve (or calculate) only
the remaining information when it is required.
We recommend that you carry out performance testing and usage analysis to determine whether prepopulating
or on-demand loading of the cache, or a combination of both, is appropriate. The decision should be based on the
volatility and usage pattern of the data. Cache utilization and performance analysis are particularly important in
applications that encounter heavy loads and must be highly scalable. For example, in highly scalable scenarios it
might make sense to seed the cache to reduce the load on the data store at peak times.
Caching can also be used to avoid repeating computations while the application is running. If an operation
transforms data or performs a complicated calculation, it can save the results of the operation in the cache. If the
same calculation is required afterward, the application can simply retrieve the results from the cache.
An application can modify data that's held in a cache. However, we recommend thinking of the cache as a
transient data store that could disappear at any time. Do not store valuable data in the cache only; make sure that
you maintain the information in the original data store as well. This means that if the cache becomes unavailable,
you minimize the chance of losing data.
Cache highly dynamic data
When you store rapidly changing information in a persistent data store, it can impose an overhead on the system.
For example, consider a device that continually reports status or some other measurement. If an application
chooses not to cache this data on the basis that the cached information will nearly always be outdated, then the
same consideration could be true when storing and retrieving this information from the data store. In the time it
takes to save and fetch this data, it might have changed.
In a situation such as this, consider the benefits of storing the dynamic information directly in the cache instead of
in the persistent data store. If the data is noncritical and does not require auditing, then it doesn't matter if the
occasional change is lost.
Manage data expiration in a cache
In most cases, data that's held in a cache is a copy of data that's held in the original data store. The data in the
original data store might change after it was cached, causing the cached data to become stale. Many caching
systems enable you to configure the cache to expire data and reduce the period for which data may be out of
date.
When cached data expires, it's removed from the cache, and the application must retrieve the data from the
original data store (it can put the newly fetched information back into cache). You can set a default expiration
policy when you configure the cache. In many cache services, you can also stipulate the expiration period for
individual objects when you store them programmatically in the cache. Some caches enable you to specify the
expiration period as an absolute value, or as a sliding value that causes the item to be removed from the cache if
it is not accessed within the specified time. This setting overrides any cache-wide expiration policy, but only for
the specified objects.

NOTE
Consider the expiration period for the cache and the objects that it contains carefully. If you make it too short, objects will
expire too quickly and you will reduce the benefits of using the cache. If you make the period too long, you risk the data
becoming stale.

It's also possible that the cache might fill up if data is allowed to remain resident for a long time. In this case, any
requests to add new items to the cache might cause some items to be forcibly removed in a process known as
eviction. Cache services typically evict data on a least-recently-used (LRU) basis, but you can usually override this
policy and prevent items from being evicted. However, if you adopt this approach, you risk exceeding the memory
that's available in the cache. An application that attempts to add an item to the cache will fail with an exception.
Some caching implementations might provide additional eviction policies. There are several types of eviction
policies. These include:
A most-recently-used policy (in the expectation that the data will not be required again).
A first-in-first-out policy (oldest data is evicted first).
An explicit removal policy based on a triggered event (such as the data being modified).
Invalidate data in a client-side cache
Data that's held in a client-side cache is generally considered to be outside the auspices of the service that
provides the data to the client. A service cannot directly force a client to add or remove information from a client-
side cache.
This means that it's possible for a client that uses a poorly configured cache to continue using outdated
information. For example, if the expiration policies of the cache aren't properly implemented, a client might use
outdated information that's cached locally when the information in the original data source has changed.
If you are building a web application that serves data over an HTTP connection, you can implicitly force a web
client (such as a browser or web proxy) to fetch the most recent information. You can do this if a resource is
updated by a change in the URI of that resource. Web clients typically use the URI of a resource as the key in the
client-side cache, so if the URI changes, the web client ignores any previously cached versions of a resource and
fetches the new version instead.

Managing concurrency in a cache


Caches are often designed to be shared by multiple instances of an application. Each application instance can read
and modify data in the cache. Consequently, the same concurrency issues that arise with any shared data store
also apply to a cache. In a situation where an application needs to modify data that's held in the cache, you might
need to ensure that updates made by one instance of the application do not overwrite the changes made by
another instance.
Depending on the nature of the data and the likelihood of collisions, you can adopt one of two approaches to
concurrency:
Optimistic . Immediately prior to updating the data, the application checks to see whether the data in the
cache has changed since it was retrieved. If the data is still the same, the change can be made. Otherwise, the
application has to decide whether to update it. (The business logic that drives this decision will be application-
specific.) This approach is suitable for situations where updates are infrequent, or where collisions are unlikely
to occur.
Pessimistic . When it retrieves the data, the application locks it in the cache to prevent another instance from
changing it. This process ensures that collisions cannot occur, but they can also block other instances that need
to process the same data. Pessimistic concurrency can affect the scalability of a solution and is recommended
only for short-lived operations. This approach might be appropriate for situations where collisions are more
likely, especially if an application updates multiple items in the cache and must ensure that these changes are
applied consistently.
Implement high availability and scalability, and improve performance
Avoid using a cache as the primary repository of data; this is the role of the original data store from which the
cache is populated. The original data store is responsible for ensuring the persistence of the data.
Be careful not to introduce critical dependencies on the availability of a shared cache service into your solutions.
An application should be able to continue functioning if the service that provides the shared cache is unavailable.
The application should not become unresponsive or fail while waiting for the cache service to resume.
Therefore, the application must be prepared to detect the availability of the cache service and fall back to the
original data store if the cache is inaccessible. The Circuit-Breaker pattern is useful for handling this scenario. The
service that provides the cache can be recovered, and once it becomes available, the cache can be repopulated as
data is read from the original data store, following a strategy such as the Cache-aside pattern.
However, system scalability may be affected if the application falls back to the original data store when the cache
is temporarily unavailable. While the data store is being recovered, the original data store could be swamped with
requests for data, resulting in timeouts and failed connections.
Consider implementing a local, private cache in each instance of an application, together with the shared cache
that all application instances access. When the application retrieves an item, it can check first in its local cache,
then in the shared cache, and finally in the original data store. The local cache can be populated using the data in
either the shared cache, or in the database if the shared cache is unavailable.
This approach requires careful configuration to prevent the local cache from becoming too stale with respect to
the shared cache. However, the local cache acts as a buffer if the shared cache is unreachable. Figure 3 shows this
structure.

Figure 3: Using a local private cache with a shared cache.


To support large caches that hold relatively long-lived data, some cache services provide a high-availability option
that implements automatic failover if the cache becomes unavailable. This approach typically involves replicating
the cached data that's stored on a primary cache server to a secondary cache server, and switching to the
secondary server if the primary server fails or connectivity is lost.
To reduce the latency that's associated with writing to multiple destinations, the replication to the secondary
server might occur asynchronously when data is written to the cache on the primary server. This approach leads
to the possibility that some cached information might be lost in the event of a failure, but the proportion of this
data should be small compared to the overall size of the cache.
If a shared cache is large, it might be beneficial to partition the cached data across nodes to reduce the chances of
contention and improve scalability. Many shared caches support the ability to dynamically add (and remove)
nodes and rebalance the data across partitions. This approach might involve clustering, in which the collection of
nodes is presented to client applications as a seamless, single cache. Internally, however, the data is dispersed
between nodes following a predefined distribution strategy that balances the load evenly. For more information
about possible partitioning strategies, see Data partitioning guidance.
Clustering can also increase the availability of the cache. If a node fails, the remainder of the cache is still
accessible. Clustering is frequently used in conjunction with replication and failover. Each node can be replicated,
and the replica can be quickly brought online if the node fails.
Many read and write operations are likely to involve single data values or objects. However, at times it might be
necessary to store or retrieve large volumes of data quickly. For example, seeding a cache could involve writing
hundreds or thousands of items to the cache. An application might also need to retrieve a large number of related
items from the cache as part of the same request.
Many large-scale caches provide batch operations for these purposes. This enables a client application to package
up a large volume of items into a single request and reduces the overhead that's associated with performing a
large number of small requests.

Caching and eventual consistency


For the cache-aside pattern to work, the instance of the application that populates the cache must have access to
the most recent and consistent version of the data. In a system that implements eventual consistency (such as a
replicated data store) this might not be the case.
One instance of an application could modify a data item and invalidate the cached version of that item. Another
instance of the application might attempt to read this item from a cache, which causes a cache-miss, so it reads
the data from the data store and adds it to the cache. However, if the data store has not been fully synchronized
with the other replicas, the application instance could read and populate the cache with the old value.
For more information about handling data consistency, see the Data consistency primer.
Protect cached data
Irrespective of the cache service you use, consider how to protect the data that's held in the cache from
unauthorized access. There are two main concerns:
The privacy of the data in the cache.
The privacy of data as it flows between the cache and the application that's using the cache.
To protect data in the cache, the cache service might implement an authentication mechanism that requires that
applications specify the following:
Which identities can access data in the cache.
Which operations (read and write) that these identities are allowed to perform.
To reduce overhead that's associated with reading and writing data, after an identity has been granted write
and/or read access to the cache, that identity can use any data in the cache.
If you need to restrict access to subsets of the cached data, you can do one of the following:
Split the cache into partitions (by using different cache servers) and only grant access to identities for the
partitions that they should be allowed to use.
Encrypt the data in each subset by using different keys, and provide the encryption keys only to identities that
should have access to each subset. A client application might still be able to retrieve all of the data in the cache,
but it will only be able to decrypt the data for which it has the keys.
You must also protect the data as it flows in and out of the cache. To do this, you depend on the security features
provided by the network infrastructure that client applications use to connect to the cache. If the cache is
implemented using an on-site server within the same organization that hosts the client applications, then the
isolation of the network itself might not require you to take additional steps. If the cache is located remotely and
requires a TCP or HTTP connection over a public network (such as the Internet), consider implementing SSL.
Considerations for implementing caching in Azure
Azure Cache for Redis is an implementation of the open source Redis cache that runs as a service in an Azure
datacenter. It provides a caching service that can be accessed from any Azure application, whether the application
is implemented as a cloud service, a website, or inside an Azure virtual machine. Caches can be shared by client
applications that have the appropriate access key.
Azure Cache for Redis is a high-performance caching solution that provides availability, scalability and security. It
typically runs as a service spread across one or more dedicated machines. It attempts to store as much
information as it can in memory to ensure fast access. This architecture is intended to provide low latency and
high throughput by reducing the need to perform slow I/O operations.
Azure Cache for Redis is compatible with many of the various APIs that are used by client applications. If you have
existing applications that already use Azure Cache for Redis running on-premises, the Azure Cache for Redis
provides a quick migration path to caching in the cloud.
Features of Redis
Redis is more than a simple cache server. It provides a distributed in-memory database with an extensive
command set that supports many common scenarios. These are described later in this document, in the section
Using Redis caching. This section summarizes some of the key features that Redis provides.
Redis as an in-memory database
Redis supports both read and write operations. In Redis, writes can be protected from system failure either by
being stored periodically in a local snapshot file or in an append-only log file. This is not the case in many caches
(which should be considered transitory data stores).
All writes are asynchronous and do not block clients from reading and writing data. When Redis starts running, it
reads the data from the snapshot or log file and uses it to construct the in-memory cache. For more information,
see Redis persistence on the Redis website.

NOTE
Redis does not guarantee that all writes will be saved in the event of a catastrophic failure, but at worst you might lose only
a few seconds worth of data. Remember that a cache is not intended to act as an authoritative data source, and it is the
responsibility of the applications using the cache to ensure that critical data is saved successfully to an appropriate data
store. For more information, see the Cache-aside pattern.

Redis data types


Redis is a key-value store, where values can contain simple types or complex data structures such as hashes, lists,
and sets. It supports a set of atomic operations on these data types. Keys can be permanent or tagged with a
limited time-to-live, at which point the key and its corresponding value are automatically removed from the
cache. For more information about Redis keys and values, visit the page An introduction to Redis data types and
abstractions on the Redis website.
Redis replication and clustering
Redis supports primary/subordinate replication to help ensure availability and maintain throughput. Write
operations to a Redis primary node are replicated to one or more subordinate nodes. Read operations can be
served by the primary or any of the subordinates.
In the event of a network partition, subordinates can continue to serve data and then transparently resynchronize
with the primary when the connection is reestablished. For further details, visit the Replication page on the Redis
website.
Redis also provides clustering, which enables you to transparently partition data into shards across servers and
spread the load. This feature improves scalability, because new Redis servers can be added and the data
repartitioned as the size of the cache increases.
Furthermore, each server in the cluster can be replicated by using primary/subordinate replication. This ensures
availability across each node in the cluster. For more information about clustering and sharding, visit the Redis
cluster tutorial page on the Redis website.
Redis memory use
A Redis cache has a finite size that depends on the resources available on the host computer. When you configure
a Redis server, you can specify the maximum amount of memory it can use. You can also configure a key in a
Redis cache to have an expiration time, after which it is automatically removed from the cache. This feature can
help prevent the in-memory cache from filling with old or stale data.
As memory fills up, Redis can automatically evict keys and their values by following a number of policies. The
default is LRU (least recently used), but you can also select other policies such as evicting keys at random or
turning off eviction altogether (in which, case attempts to add items to the cache fail if it is full). The page Using
Redis as an LRU cache provides more information.
Redis transactions and batches
Redis enables a client application to submit a series of operations that read and write data in the cache as an
atomic transaction. All the commands in the transaction are guaranteed to run sequentially, and no commands
issued by other concurrent clients will be interwoven between them.
However, these are not true transactions as a relational database would perform them. Transaction processing
consists of two stages--the first is when the commands are queued, and the second is when the commands are
run. During the command queuing stage, the commands that comprise the transaction are submitted by the
client. If some sort of error occurs at this point (such as a syntax error, or the wrong number of parameters) then
Redis refuses to process the entire transaction and discards it.
During the run phase, Redis performs each queued command in sequence. If a command fails during this phase,
Redis continues with the next queued command and does not roll back the effects of any commands that have
already been run. This simplified form of transaction helps to maintain performance and avoid performance
problems that are caused by contention.
Redis does implement a form of optimistic locking to assist in maintaining consistency. For detailed information
about transactions and locking with Redis, visit the Transactions page on the Redis website.
Redis also supports nontransactional batching of requests. The Redis protocol that clients use to send commands
to a Redis server enables a client to send a series of operations as part of the same request. This can help to
reduce packet fragmentation on the network. When the batch is processed, each command is performed. If any of
these commands are malformed, they will be rejected (which doesn't happen with a transaction), but the
remaining commands will be performed. There is also no guarantee about the order in which the commands in
the batch will be processed.
Redis security
Redis is focused purely on providing fast access to data, and is designed to run inside a trusted environment that
can be accessed only by trusted clients. Redis supports a limited security model based on password
authentication. (It is possible to remove authentication completely, although we don't recommend this.)
All authenticated clients share the same global password and have access to the same resources. If you need
more comprehensive sign-in security, you must implement your own security layer in front of the Redis server,
and all client requests should pass through this additional layer. Redis should not be directly exposed to untrusted
or unauthenticated clients.
You can restrict access to commands by disabling them or renaming them (and by providing only privileged
clients with the new names).
Redis does not directly support any form of data encryption, so all encoding must be performed by client
applications. Additionally, Redis does not provide any form of transport security. If you need to protect data as it
flows across the network, we recommend implementing an SSL proxy.
For more information, visit the Redis security page on the Redis website.

NOTE
Azure Cache for Redis provides its own security layer through which clients connect. The underlying Redis servers are not
exposed to the public network.

Azure Redis cache


Azure Cache for Redis provides access to Redis servers that are hosted at an Azure datacenter. It acts as a façade
that provides access control and security. You can provision a cache by using the Azure portal.
The portal provides a number of predefined configurations. These range from a 53 GB cache running as a
dedicated service that supports SSL communications (for privacy) and master/subordinate replication with an
SLA of 99.9% availability, down to a 250 MB cache without replication (no availability guarantees) running on
shared hardware.
Using the Azure portal, you can also configure the eviction policy of the cache, and control access to the cache by
adding users to the roles provided. These roles, which define the operations that members can perform, include
Owner, Contributor, and Reader. For example, members of the Owner role have complete control over the cache
(including security) and its contents, members of the Contributor role can read and write information in the cache,
and members of the Reader role can only retrieve data from the cache.
Most administrative tasks are performed through the Azure portal. For this reason, many of the administrative
commands that are available in the standard version of Redis are not available, including the ability to modify the
configuration programmatically, shut down the Redis server, configure additional subordinates, or forcibly save
data to disk.
The Azure portal includes a convenient graphical display that enables you to monitor the performance of the
cache. For example, you can view the number of connections being made, the number of requests being
performed, the volume of reads and writes, and the number of cache hits versus cache misses. Using this
information, you can determine the effectiveness of the cache and if necessary, switch to a different configuration
or change the eviction policy.
Additionally, you can create alerts that send email messages to an administrator if one or more critical metrics fall
outside of an expected range. For example, you might want to alert an administrator if the number of cache
misses exceeds a specified value in the last hour, because it means the cache might be too small or data might be
being evicted too quickly.
You can also monitor the CPU, memory, and network usage for the cache.
For further information and examples showing how to create and configure an Azure Cache for Redis, visit the
page Lap around Azure Cache for Redis on the Azure blog.

Caching session state and HTML output


If you're building ASP.NET web applications that run by using Azure web roles, you can save session state
information and HTML output in an Azure Cache for Redis. The session state provider for Azure Cache for Redis
enables you to share session information between different instances of an ASP.NET web application, and is very
useful in web farm situations where client-server affinity is not available and caching session data in-memory
would not be appropriate.
Using the session state provider with Azure Cache for Redis delivers several benefits, including:
Sharing session state with a large number of instances of ASP.NET web applications.
Providing improved scalability.
Supporting controlled, concurrent access to the same session state data for multiple readers and a single
writer.
Using compression to save memory and improve network performance.
For more information, see ASP.NET session state provider for Azure Cache for Redis.

NOTE
Do not use the session state provider for Azure Cache for Redis with ASP.NET applications that run outside of the Azure
environment. The latency of accessing the cache from outside of Azure can eliminate the performance benefits of caching
data.

Similarly, the output cache provider for Azure Cache for Redis enables you to save the HTTP responses generated
by an ASP.NET web application. Using the output cache provider with Azure Cache for Redis can improve the
response times of applications that render complex HTML output. Application instances that generate similar
responses can use the shared output fragments in the cache rather than generating this HTML output afresh. For
more information, see ASP.NET output cache provider for Azure Cache for Redis.

Building a custom Redis cache


Azure Cache for Redis acts as a façade to the underlying Redis servers. If you require an advanced configuration
that is not covered by the Azure Redis cache (such as a cache bigger than 53 GB) you can build and host your own
Redis servers by using Azure virtual machines.
This is a potentially complex process because you might need to create several VMs to act as primary and
subordinate nodes if you want to implement replication. Furthermore, if you wish to create a cluster, then you
need multiple primaries and subordinate servers. A minimal clustered replication topology that provides a high
degree of availability and scalability comprises at least six VMs organized as three pairs of primary/subordinate
servers (a cluster must contain at least three primary nodes).
Each primary/subordinate pair should be located close together to minimize latency. However, each set of pairs
can be running in different Azure datacenters located in different regions, if you wish to locate cached data close
to the applications that are most likely to use it. For an example of building and configuring a Redis node running
as an Azure VM, see Running Redis on a CentOS Linux VM in Azure.

NOTE
If you implement your own Redis cache in this way, you are responsible for monitoring, managing, and securing the service.

Partitioning a Redis cache


Partitioning the cache involves splitting the cache across multiple computers. This structure gives you several
advantages over using a single cache server, including:
Creating a cache that is much bigger than can be stored on a single server.
Distributing data across servers, improving availability. If one server fails or becomes inaccessible, the data
that it holds is unavailable, but the data on the remaining servers can still be accessed. For a cache, this is not
crucial because the cached data is only a transient copy of the data that's held in a database. Cached data on a
server that becomes inaccessible can be cached on a different server instead.
Spreading the load across servers, thereby improving performance and scalability.
Geolocating data close to the users that access it, thus reducing latency.
For a cache, the most common form of partitioning is sharding. In this strategy, each partition (or shard) is a Redis
cache in its own right. Data is directed to a specific partition by using sharding logic, which can use a variety of
approaches to distribute the data. The Sharding pattern provides more information about implementing sharding.
To implement partitioning in a Redis cache, you can take one of the following approaches:
Server-side query routing. In this technique, a client application sends a request to any of the Redis servers
that comprise the cache (probably the closest server). Each Redis server stores metadata that describes the
partition that it holds, and also contains information about which partitions are located on other servers. The
Redis server examines the client request. If it can be resolved locally, it will perform the requested operation.
Otherwise it will forward the request on to the appropriate server. This model is implemented by Redis
clustering, and is described in more detail on the Redis cluster tutorial page on the Redis website. Redis
clustering is transparent to client applications, and additional Redis servers can be added to the cluster (and
the data re-partitioned) without requiring that you reconfigure the clients.
Client-side partitioning. In this model, the client application contains logic (possibly in the form of a library)
that routes requests to the appropriate Redis server. This approach can be used with Azure Cache for Redis.
Create multiple Azure Cache for Redis (one for each data partition) and implement the client-side logic that
routes the requests to the correct cache. If the partitioning scheme changes (if additional Azure Cache for
Redis are created, for example), client applications might need to be reconfigured.
Proxy-assisted partitioning. In this scheme, client applications send requests to an intermediary proxy service
which understands how the data is partitioned and then routes the request to the appropriate Redis server.
This approach can also be used with Azure Cache for Redis; the proxy service can be implemented as an Azure
cloud service. This approach requires an additional level of complexity to implement the service, and requests
might take longer to perform than using client-side partitioning.
The page Partitioning: how to split data among multiple Redis instances on the Redis website provides further
information about implementing partitioning with Redis.
Implement Redis cache client applications
Redis supports client applications written in numerous programming languages. If you are building new
applications by using the .NET Framework, the recommended approach is to use the StackExchange.Redis client
library. This library provides a .NET Framework object model that abstracts the details for connecting to a Redis
server, sending commands, and receiving responses. It is available in Visual Studio as a NuGet package. You can
use this same library to connect to an Azure Cache for Redis, or a custom Redis cache hosted on a VM.
To connect to a Redis server you use the static Connect method of the ConnectionMultiplexer class. The
connection that this method creates is designed to be used throughout the lifetime of the client application, and
the same connection can be used by multiple concurrent threads. Do not reconnect and disconnect each time you
perform a Redis operation because this can degrade performance.
You can specify the connection parameters, such as the address of the Redis host and the password. If you are
using Azure Cache for Redis, the password is either the primary or secondary key that is generated for Azure
Cache for Redis by using the Azure portal.
After you have connected to the Redis server, you can obtain a handle on the Redis database that acts as the
cache. The Redis connection provides the GetDatabase method to do this. You can then retrieve items from the
cache and store data in the cache by using the StringGet and StringSet methods. These methods expect a key
as a parameter, and return the item either in the cache that has a matching value ( StringGet ) or add the item to
the cache with this key ( StringSet ).
Depending on the location of the Redis server, many operations might incur some latency while a request is
transmitted to the server and a response is returned to the client. The StackExchange library provides
asynchronous versions of many of the methods that it exposes to help client applications remain responsive.
These methods support the Task-based Asynchronous pattern in the .NET Framework.
The following code snippet shows a method named RetrieveItem . It illustrates an implementation of the cache-
aside pattern based on Redis and the StackExchange library. The method takes a string key value and attempts to
retrieve the corresponding item from the Redis cache by calling the StringGetAsync method (the asynchronous
version of StringGet ).
If the item is not found, it is fetched from the underlying data source using the GetItemFromDataSourceAsync
method (which is a local method and not part of the StackExchange library). It's then added to the cache by using
the StringSetAsync method so it can be retrieved more quickly next time.

// Connect to the Azure Redis cache


ConfigurationOptions config = new ConfigurationOptions();
config.EndPoints.Add("<your DNS name>.redis.cache.windows.net");
config.Password = "<Redis cache key from management portal>";
ConnectionMultiplexer redisHostConnection = ConnectionMultiplexer.Connect(config);
IDatabase cache = redisHostConnection.GetDatabase();
...
private async Task<string> RetrieveItem(string itemKey)
{
// Attempt to retrieve the item from the Redis cache
string itemValue = await cache.StringGetAsync(itemKey);

// If the value returned is null, the item was not found in the cache
// So retrieve the item from the data source and add it to the cache
if (itemValue == null)
{
itemValue = await GetItemFromDataSourceAsync(itemKey);
await cache.StringSetAsync(itemKey, itemValue);
}

// Return the item


return itemValue;
}

The StringGet and StringSet methods are not restricted to retrieving or storing string values. They can take any
item that is serialized as an array of bytes. If you need to save a .NET object, you can serialize it as a byte stream
and use the StringSet method to write it to the cache.
Similarly, you can read an object from the cache by using the StringGet method and deserializing it as a .NET
object. The following code shows a set of extension methods for the IDatabase interface (the GetDatabase method
of a Redis connection returns an IDatabase object), and some sample code that uses these methods to read and
write a BlogPost object to the cache:
public static class RedisCacheExtensions
{
public static async Task<T> GetAsync<T>(this IDatabase cache, string key)
{
return Deserialize<T>(await cache.StringGetAsync(key));
}

public static async Task<object> GetAsync(this IDatabase cache, string key)


{
return Deserialize<object>(await cache.StringGetAsync(key));
}

public static async Task SetAsync(this IDatabase cache, string key, object value)
{
await cache.StringSetAsync(key, Serialize(value));
}

static byte[] Serialize(object o)


{
byte[] objectDataAsStream = null;

if (o != null)
{
BinaryFormatter binaryFormatter = new BinaryFormatter();
using (MemoryStream memoryStream = new MemoryStream())
{
binaryFormatter.Serialize(memoryStream, o);
objectDataAsStream = memoryStream.ToArray();
}
}

return objectDataAsStream;
}

static T Deserialize<T>(byte[] stream)


{
T result = default(T);

if (stream != null)
{
BinaryFormatter binaryFormatter = new BinaryFormatter();
using (MemoryStream memoryStream = new MemoryStream(stream))
{
result = (T)binaryFormatter.Deserialize(memoryStream);
}
}

return result;
}
}

The following code illustrates a method named RetrieveBlogPost that uses these extension methods to read and
write a serializable BlogPost object to the cache following the cache-aside pattern:
// The BlogPost type
[Serializable]
public class BlogPost
{
private HashSet<string> tags;

public BlogPost(int id, string title, int score, IEnumerable<string> tags)


{
this.Id = id;
this.Title = title;
this.Score = score;
this.tags = new HashSet<string>(tags);
}

public int Id { get; set; }


public string Title { get; set; }
public int Score { get; set; }
public ICollection<string> Tags => this.tags;
}
...
private async Task<BlogPost> RetrieveBlogPost(string blogPostKey)
{
BlogPost blogPost = await cache.GetAsync<BlogPost>(blogPostKey);
if (blogPost == null)
{
blogPost = await GetBlogPostFromDataSourceAsync(blogPostKey);
await cache.SetAsync(blogPostKey, blogPost);
}

return blogPost;
}

Redis supports command pipelining if a client application sends multiple asynchronous requests. Redis can
multiplex the requests using the same connection rather than receiving and responding to commands in a strict
sequence.
This approach helps to reduce latency by making more efficient use of the network. The following code snippet
shows an example that retrieves the details of two customers concurrently. The code submits two requests and
then performs some other processing (not shown) before waiting to receive the results. The Wait method of the
cache object is similar to the .NET Framework Task.Wait method:

ConnectionMultiplexer redisHostConnection = ...;


IDatabase cache = redisHostConnection.GetDatabase();
...
var task1 = cache.StringGetAsync("customer:1");
var task2 = cache.StringGetAsync("customer:2");
...
var customer1 = cache.Wait(task1);
var customer2 = cache.Wait(task2);

For additional information on writing client applications that can the Azure Cache for Redis, see the Azure Cache
for Redis documentation. More information is also available at StackExchange.Redis.
The page Pipelines and multiplexers on the same website provides more information about asynchronous
operations and pipelining with Redis and the StackExchange library.

Using Redis caching


The simplest use of Redis for caching concerns is key-value pairs where the value is an uninterpreted string of
arbitrary length that can contain any binary data. (It is essentially an array of bytes that can be treated as a string).
This scenario was illustrated in the section Implement Redis Cache client applications earlier in this article.
Note that keys also contain uninterpreted data, so you can use any binary information as the key. The longer the
key is, however, the more space it will take to store, and the longer it will take to perform lookup operations. For
usability and ease of maintenance, design your keyspace carefully and use meaningful (but not verbose) keys.
For example, use structured keys such as "customer:100" to represent the key for the customer with ID 100 rather
than simply "100". This scheme enables you to easily distinguish between values that store different data types.
For example, you could also use the key "orders:100" to represent the key for the order with ID 100.
Apart from one-dimensional binary strings, a value in a Redis key-value pair can also hold more structured
information, including lists, sets (sorted and unsorted), and hashes. Redis provides a comprehensive command
set that can manipulate these types, and many of these commands are available to .NET Framework applications
through a client library such as StackExchange. The page An introduction to Redis data types and abstractions on
the Redis website provides a more detailed overview of these types and the commands that you can use to
manipulate them.
This section summarizes some common use cases for these data types and commands.
Perform atomic and batch operations
Redis supports a series of atomic get-and-set operations on string values. These operations remove the possible
race hazards that might occur when using separate GET and SET commands. The operations that are available
include:
INCR , INCRBY , DECR , and DECRBY , which perform atomic increment and decrement operations on integer
numeric data values. The StackExchange library provides overloaded versions of the
IDatabase.StringIncrementAsync and IDatabase.StringDecrementAsync methods to perform these
operations and return the resulting value that is stored in the cache. The following code snippet illustrates
how to use these methods:

ConnectionMultiplexer redisHostConnection = ...;


IDatabase cache = redisHostConnection.GetDatabase();
...
await cache.StringSetAsync("data:counter", 99);
...
long oldValue = await cache.StringIncrementAsync("data:counter");
// Increment by 1 (the default)
// oldValue should be 100

long newValue = await cache.StringDecrementAsync("data:counter", 50);


// Decrement by 50
// newValue should be 50

GETSET , which retrieves the value that's associated with a key and changes it to a new value. The
StackExchange library makes this operation available through the IDatabase.StringGetSetAsync method.
The code snippet below shows an example of this method. This code returns the current value that's
associated with the key "data:counter" from the previous example. Then it resets the value for this key back
to zero, all as part of the same operation:

ConnectionMultiplexer redisHostConnection = ...;


IDatabase cache = redisHostConnection.GetDatabase();
...
string oldValue = await cache.StringGetSetAsync("data:counter", 0);

MGET and MSET, which can return or change a set of string values as a single operation. The
IDatabase.StringGetAsync and IDatabase.StringSetAsync methods are overloaded to support this
functionality, as shown in the following example:
ConnectionMultiplexer redisHostConnection = ...;
IDatabase cache = redisHostConnection.GetDatabase();
...
// Create a list of key-value pairs
var keysAndValues =
new List<KeyValuePair<RedisKey, RedisValue>>()
{
new KeyValuePair<RedisKey, RedisValue>("data:key1", "value1"),
new KeyValuePair<RedisKey, RedisValue>("data:key99", "value2"),
new KeyValuePair<RedisKey, RedisValue>("data:key322", "value3")
};

// Store the list of key-value pairs in the cache


cache.StringSet(keysAndValues.ToArray());
...
// Find all values that match a list of keys
RedisKey[] keys = { "data:key1", "data:key99", "data:key322"};
// values should contain { "value1", "value2", "value3" }
RedisValue[] values = cache.StringGet(keys);

You can also combine multiple operations into a single Redis transaction as described in the Redis transactions
and batches section earlier in this article. The StackExchange library provides support for transactions through the
ITransaction interface.

You create an ITransaction object by using the IDatabase.CreateTransaction method. You invoke commands to
the transaction by using the methods provided by the ITransaction object.
The ITransaction interface provides access to a set of methods that's similar to those accessed by the IDatabase
interface, except that all the methods are asynchronous. This means that they are only performed when the
ITransaction.Execute method is invoked. The value that's returned by the ITransaction.Execute method
indicates whether the transaction was created successfully (true) or if it failed (false).
The following code snippet shows an example that increments and decrements two counters as part of the same
transaction:

ConnectionMultiplexer redisHostConnection = ...;


IDatabase cache = redisHostConnection.GetDatabase();
...
ITransaction transaction = cache.CreateTransaction();
var tx1 = transaction.StringIncrementAsync("data:counter1");
var tx2 = transaction.StringDecrementAsync("data:counter2");
bool result = transaction.Execute();
Console.WriteLine("Transaction {0}", result ? "succeeded" : "failed");
Console.WriteLine("Result of increment: {0}", tx1.Result);
Console.WriteLine("Result of decrement: {0}", tx2.Result);

Remember that Redis transactions are unlike transactions in relational databases. The Execute method simply
queues all the commands that comprise the transaction to be run, and if any of them is malformed then the
transaction is stopped. If all the commands have been queued successfully, each command runs asynchronously.
If any command fails, the others still continue processing. If you need to verify that a command has completed
successfully, you must fetch the results of the command by using the Result property of the corresponding task,
as shown in the example above. Reading the Result property will block the calling thread until the task has
completed.
For more information, see Transactions in Redis.
When performing batch operations, you can use the IBatch interface of the StackExchange library. This interface
provides access to a set of methods similar to those accessed by the IDatabase interface, except that all the
methods are asynchronous.
You create an object by using the IDatabase.CreateBatch method, and then run the batch by using the
IBatch
IBatch.Execute method, as shown in the following example. This code simply sets a string value, increments and
decrements the same counters used in the previous example, and displays the results:

ConnectionMultiplexer redisHostConnection = ...;


IDatabase cache = redisHostConnection.GetDatabase();
...
IBatch batch = cache.CreateBatch();
batch.StringSetAsync("data:key1", 11);
var t1 = batch.StringIncrementAsync("data:counter1");
var t2 = batch.StringDecrementAsync("data:counter2");
batch.Execute();
Console.WriteLine("{0}", t1.Result);
Console.WriteLine("{0}", t2.Result);

It is important to understand that unlike a transaction, if a command in a batch fails because it is malformed, the
other commands might still run. The IBatch.Execute method does not return any indication of success or failure.
Perform fire and forget cache operations
Redis supports fire and forget operations by using command flags. In this situation, the client simply initiates an
operation but has no interest in the result and does not wait for the command to be completed. The example
below shows how to perform the INCR command as a fire and forget operation:

ConnectionMultiplexer redisHostConnection = ...;


IDatabase cache = redisHostConnection.GetDatabase();
...
await cache.StringSetAsync("data:key1", 99);
...
cache.StringIncrement("data:key1", flags: CommandFlags.FireAndForget);

Specify automatically expiring keys


When you store an item in a Redis cache, you can specify a timeout after which the item will be automatically
removed from the cache. You can also query how much more time a key has before it expires by using the TTL
command. This command is available to StackExchange applications by using the IDatabase.KeyTimeToLive
method.
The following code snippet shows how to set an expiration time of 20 seconds on a key, and query the remaining
lifetime of the key:

ConnectionMultiplexer redisHostConnection = ...;


IDatabase cache = redisHostConnection.GetDatabase();
...
// Add a key with an expiration time of 20 seconds
await cache.StringSetAsync("data:key1", 99, TimeSpan.FromSeconds(20));
...
// Query how much time a key has left to live
// If the key has already expired, the KeyTimeToLive function returns a null
TimeSpan? expiry = cache.KeyTimeToLive("data:key1");

You can also set the expiration time to a specific date and time by using the EXPIRE command, which is available
in the StackExchange library as the KeyExpireAsync method:
ConnectionMultiplexer redisHostConnection = ...;
IDatabase cache = redisHostConnection.GetDatabase();
...
// Add a key with an expiration date of midnight on 1st January 2015
await cache.StringSetAsync("data:key1", 99);
await cache.KeyExpireAsync("data:key1",
new DateTime(2015, 1, 1, 0, 0, 0, DateTimeKind.Utc));
...

TIP
You can manually remove an item from the cache by using the DEL command, which is available through the StackExchange
library as the IDatabase.KeyDeleteAsync method.

Use tags to cross-correlate cached items


A Redis set is a collection of multiple items that share a single key. You can create a set by using the SADD
command. You can retrieve the items in a set by using the SMEMBERS command. The StackExchange library
implements the SADD command with the IDatabase.SetAddAsync method, and the SMEMBERS command with
the IDatabase.SetMembersAsync method.
You can also combine existing sets to create new sets by using the SDIFF (set difference), SINTER (set intersection),
and SUNION (set union) commands. The StackExchange library unifies these operations in the
IDatabase.SetCombineAsync method. The first parameter to this method specifies the set operation to perform.

The following code snippets show how sets can be useful for quickly storing and retrieving collections of related
items. This code uses the BlogPost type that was described in the section Implement Redis Cache Client
Applications earlier in this article.
A BlogPost object contains four fields—an ID, a title, a ranking score, and a collection of tags. The first code
snippet below shows the sample data that's used for populating a C# list of BlogPost objects:
List<string[]> tags = new List<string[]>
{
new[] { "iot","csharp" },
new[] { "iot","azure","csharp" },
new[] { "csharp","git","big data" },
new[] { "iot","git","database" },
new[] { "database","git" },
new[] { "csharp","database" },
new[] { "iot" },
new[] { "iot","database","git" },
new[] { "azure","database","big data","git","csharp" },
new[] { "azure" }
};

List<BlogPost> posts = new List<BlogPost>();


int blogKey = 1;
int numberOfPosts = 20;
Random random = new Random();
for (int i = 0; i < numberOfPosts; i++)
{
blogKey++;
posts.Add(new BlogPost(
blogKey, // Blog post ID
string.Format(CultureInfo.InvariantCulture, "Blog Post #{0}",
blogKey), // Blog post title
random.Next(100, 10000), // Ranking score
tags[i % tags.Count])); // Tags--assigned from a collection
// in the tags list
}

You can store the tags for each BlogPost object as a set in a Redis cache and associate each set with the ID of the
BlogPost . This enables an application to quickly find all the tags that belong to a specific blog post. To enable
searching in the opposite direction and find all blog posts that share a specific tag, you can create another set that
holds the blog posts referencing the tag ID in the key:

ConnectionMultiplexer redisHostConnection = ...;


IDatabase cache = redisHostConnection.GetDatabase();
...
// Tags are easily represented as Redis Sets
foreach (BlogPost post in posts)
{
string redisKey = string.Format(CultureInfo.InvariantCulture,
"blog:posts:{0}:tags", post.Id);
// Add tags to the blog post in Redis
await cache.SetAddAsync(
redisKey, post.Tags.Select(s => (RedisValue)s).ToArray());

// Now do the inverse so we can figure out which blog posts have a given tag
foreach (var tag in post.Tags)
{
await cache.SetAddAsync(string.Format(CultureInfo.InvariantCulture,
"tag:{0}:blog:posts", tag), post.Id);
}
}

These structures enable you to perform many common queries very efficiently. For example, you can find and
display all of the tags for blog post 1 like this:
// Show the tags for blog post #1
foreach (var value in await cache.SetMembersAsync("blog:posts:1:tags"))
{
Console.WriteLine(value);
}

You can find all tags that are common to blog post 1 and blog post 2 by performing a set intersection operation,
as follows:

// Show the tags in common for blog posts #1 and #2


foreach (var value in await cache.SetCombineAsync(SetOperation.Intersect, new RedisKey[]
{ "blog:posts:1:tags", "blog:posts:2:tags" }))
{
Console.WriteLine(value);
}

And you can find all blog posts that contain a specific tag:

// Show the ids of the blog posts that have the tag "iot".
foreach (var value in await cache.SetMembersAsync("tag:iot:blog:posts"))
{
Console.WriteLine(value);
}

Find recently accessed items


A common task required of many applications is to find the most recently accessed items. For example, a
blogging site might want to display information about the most recently read blog posts.
You can implement this functionality by using a Redis list. A Redis list contains multiple items that share the same
key. The list acts as a double-ended queue. You can push items to either end of the list by using the LPUSH (left
push) and RPUSH (right push) commands. You can retrieve items from either end of the list by using the LPOP
and RPOP commands. You can also return a set of elements by using the LRANGE and RRANGE commands.
The code snippets below show how you can perform these operations by using the StackExchange library. This
code uses the BlogPost type from the previous examples. As a blog post is read by a user, the
IDatabase.ListLeftPushAsync method pushes the title of the blog post onto a list that's associated with the key
"blog:recent_posts" in the Redis cache.

ConnectionMultiplexer redisHostConnection = ...;


IDatabase cache = redisHostConnection.GetDatabase();
...
string redisKey = "blog:recent_posts";
BlogPost blogPost = ...; // Reference to the blog post that has just been read
await cache.ListLeftPushAsync(
redisKey, blogPost.Title); // Push the blog post onto the list

As more blog posts are read, their titles are pushed onto the same list. The list is ordered by the sequence in
which the titles have been added. The most recently read blog posts are toward the left end of the list. (If the same
blog post is read more than once, it will have multiple entries in the list.)
You can display the titles of the most recently read posts by using the IDatabase.ListRange method. This method
takes the key that contains the list, a starting point, and an ending point. The following code retrieves the titles of
the 10 blog posts (items from 0 to 9) at the left-most end of the list:
// Show latest ten posts
foreach (string postTitle in await cache.ListRangeAsync(redisKey, 0, 9))
{
Console.WriteLine(postTitle);
}

Note that the ListRangeAsync method does not remove items from the list. To do this, you can use the
IDatabase.ListLeftPopAsync and IDatabase.ListRightPopAsync methods.
To prevent the list from growing indefinitely, you can periodically cull items by trimming the list. The code snippet
below shows you how to remove all but the five left-most items from the list:

await cache.ListTrimAsync(redisKey, 0, 5);

Implement a leader board


By default, the items in a set are not held in any specific order. You can create an ordered set by using the ZADD
command (the IDatabase.SortedSetAdd method in the StackExchange library). The items are ordered by using a
numeric value called a score, which is provided as a parameter to the command.
The following code snippet adds the title of a blog post to an ordered list. In this example, each blog post also has
a score field that contains the ranking of the blog post.

ConnectionMultiplexer redisHostConnection = ...;


IDatabase cache = redisHostConnection.GetDatabase();
...
string redisKey = "blog:post_rankings";
BlogPost blogPost = ...; // Reference to a blog post that has just been rated
await cache.SortedSetAddAsync(redisKey, blogPost.Title, blogPost.Score);

You can retrieve the blog post titles and scores in ascending score order by using the
IDatabase.SortedSetRangeByRankWithScores method:

foreach (var post in await cache.SortedSetRangeByRankWithScoresAsync(redisKey))


{
Console.WriteLine(post);
}

NOTE
The StackExchange library also provides the IDatabase.SortedSetRangeByRankAsync method, which returns the data in
score order, but does not return the scores.

You can also retrieve items in descending order of scores, and limit the number of items that are returned by
providing additional parameters to the IDatabase.SortedSetRangeByRankWithScoresAsync method. The next example
displays the titles and scores of the top 10 ranked blog posts:

foreach (var post in await cache.SortedSetRangeByRankWithScoresAsync(


redisKey, 0, 9, Order.Descending))
{
Console.WriteLine(post);
}

The next example uses the IDatabase.SortedSetRangeByScoreWithScoresAsync method, which you can use to limit
the items that are returned to those that fall within a given score range:

// Blog posts with scores between 5000 and 100000


foreach (var post in await cache.SortedSetRangeByScoreWithScoresAsync(
redisKey, 5000, 100000))
{
Console.WriteLine(post);
}

Message by using channels


Apart from acting as a data cache, a Redis server provides messaging through a high-performance
publisher/subscriber mechanism. Client applications can subscribe to a channel, and other applications or
services can publish messages to the channel. Subscribing applications will then receive these messages and can
process them.
Redis provides the SUBSCRIBE command for client applications to use to subscribe to channels. This command
expects the name of one or more channels on which the application will accept messages. The StackExchange
library includes the ISubscription interface, which enables a .NET Framework application to subscribe and
publish to channels.
You create an ISubscription object by using the GetSubscriber method of the connection to the Redis server.
Then you listen for messages on a channel by using the SubscribeAsync method of this object. The following code
example shows how to subscribe to a channel named "messages:blogPosts":

ConnectionMultiplexer redisHostConnection = ...;


ISubscriber subscriber = redisHostConnection.GetSubscriber();
...
await subscriber.SubscribeAsync("messages:blogPosts", (channel, message) => Console.WriteLine("Title is:
{0}", message));

The first parameter to the Subscribe method is the name of the channel. This name follows the same
conventions that are used by keys in the cache. The name can contain any binary data, although it is advisable to
use relatively short, meaningful strings to help ensure good performance and maintainability.
Note also that the namespace used by channels is separate from that used by keys. This means you can have
channels and keys that have the same name, although this may make your application code more difficult to
maintain.
The second parameter is an Action delegate. This delegate runs asynchronously whenever a new message
appears on the channel. This example simply displays the message on the console (the message will contain the
title of a blog post).
To publish to a channel, an application can use the Redis PUBLISH command. The StackExchange library provides
the IServer.PublishAsync method to perform this operation. The next code snippet shows how to publish a
message to the "messages:blogPosts" channel:

ConnectionMultiplexer redisHostConnection = ...;


ISubscriber subscriber = redisHostConnection.GetSubscriber();
...
BlogPost blogPost = ...;
subscriber.PublishAsync("messages:blogPosts", blogPost.Title);

There are several points you should understand about the publish/subscribe mechanism:
Multiple subscribers can subscribe to the same channel, and they will all receive the messages that are
published to that channel.
Subscribers only receive messages that have been published after they have subscribed. Channels are not
buffered, and once a message is published, the Redis infrastructure pushes the message to each subscriber
and then removes it.
By default, messages are received by subscribers in the order in which they are sent. In a highly active system
with a large number of messages and many subscribers and publishers, guaranteed sequential delivery of
messages can slow performance of the system. If each message is independent and the order is unimportant,
you can enable concurrent processing by the Redis system, which can help to improve responsiveness. You can
achieve this in a StackExchange client by setting the PreserveAsyncOrder of the connection used by the
subscriber to false:

ConnectionMultiplexer redisHostConnection = ...;


redisHostConnection.PreserveAsyncOrder = false;
ISubscriber subscriber = redisHostConnection.GetSubscriber();

Serialization considerations
When you choose a serialization format, consider tradeoffs between performance, interoperability, versioning,
compatibility with existing systems, data compression, and memory overhead. When you are evaluating
performance, remember that benchmarks are highly dependent on context. They may not reflect your actual
workload, and may not consider newer libraries or versions. There is no single "fastest" serializer for all scenarios.
Some options to consider include:
Protocol Buffers (also called protobuf) is a serialization format developed by Google for serializing
structured data efficiently. It uses strongly typed definition files to define message structures. These
definition files are then compiled to language-specific code for serializing and deserializing messages.
Protobuf can be used over existing RPC mechanisms, or it can generate an RPC service.
Apache Thrift uses a similar approach, with strongly typed definition files and a compilation step to
generate the serialization code and RPC services.
Apache Avro provides similar functionality to Protocol Buffers and Thrift, but there is no compilation step.
Instead, serialized data always includes a schema that describes the structure.
JSON is an open standard that uses human-readable text fields. It has broad cross-platform support. JSON
does not use message schemas. Being a text-based format, it is not very efficient over the wire. In some
cases, however, you may be returning cached items directly to a client via HTTP, in which case storing JSON
could save the cost of deserializing from another format and then serializing to JSON.
BSON is a binary serialization format that uses a structure similar to JSON. BSON was designed to be
lightweight, easy to scan, and fast to serialize and deserialize, relative to JSON. Payloads are comparable in
size to JSON. Depending on the data, a BSON payload may be smaller or larger than a JSON payload.
BSON has some additional data types that are not available in JSON, notably BinData (for byte arrays) and
Date.
MessagePack is a binary serialization format that is designed to be compact for transmission over the wire.
There are no message schemas or message type checking.
Bond is a cross-platform framework for working with schematized data. It supports cross-language
serialization and deserialization. Notable differences from other systems listed here are support for
inheritance, type aliases, and generics.
gRPC is an open-source RPC system developed by Google. By default, it uses Protocol Buffers as its
definition language and underlying message interchange format.

Related patterns and guidance


The following patterns might also be relevant to your scenario when you implement caching in your applications:
Cache-aside pattern: This pattern describes how to load data on demand into a cache from a data store.
This pattern also helps to maintain consistency between data that's held in the cache and the data in the
original data store.
The Sharding pattern provides information about implementing horizontal partitioning to help improve
scalability when storing and accessing large volumes of data.

More information
Azure Cache for Redis documentation
Azure Cache for Redis FAQ
Task-based Asynchronous pattern
Redis documentation
StackExchange.Redis
Data partitioning guide
Best practices for using content delivery networks
(CDNs)
11/2/2020 • 8 minutes to read • Edit Online

A content delivery network (CDN) is a distributed network of servers that can efficiently deliver web content to
users. CDNs store cached content on edge servers that are close to end users to minimize latency.
CDNs are typically used to deliver static content such as images, style sheets, documents, client-side scripts, and
HTML pages. The major advantages of using a CDN are lower latency and faster delivery of content to users,
regardless of their geographical location in relation to the datacenter where the application is hosted. CDNs can
also help to reduce load on a web application, because the application does not have to service requests for the
content that is hosted in the CDN.

In Azure, the Azure Content Delivery Network is a global CDN solution for delivering high-bandwidth content that
is hosted in Azure or any other location. Using Azure CDN, you can cache publicly available objects loaded from
Azure blob storage, a web application, virtual machine, any publicly accessible web server.
This topic describes some general best practices and considerations when using a CDN. For more information, see
Azure CDN.

How and why a CDN is used


Typical uses for a CDN include:
Delivering static resources for client applications, often from a website. These resources can be images,
style sheets, documents, files, client-side scripts, HTML pages, HTML fragments, or any other content that
the server does not need to modify for each request. The application can create items at runtime and make
them available to the CDN (for example, by creating a list of current news headlines), but it does not do so
for each request.
Delivering public static and shared content to devices such as mobile phones and tablet computers. The
application itself is a web service that offers an API to clients running on the various devices. The CDN can
also deliver static datasets (via the web service) for the clients to use, perhaps to generate the client UI. For
example, the CDN could be used to distribute JSON or XML documents.
Serving entire websites that consist of only public static content to clients, without requiring any dedicated
compute resources.
Streaming video files to the client on demand. Video benefits from the low latency and reliable connectivity
available from the globally located datacenters that offer CDN connections. Microsoft Azure Media Services
(AMS) integrates with Azure CDN to deliver content directly to the CDN for further distribution. For more
information, see Streaming endpoints overview.
Generally improving the experience for users, especially those located far from the datacenter hosting the
application. These users might otherwise suffer higher latency. A large proportion of the total size of the
content in a web application is often static, and using the CDN can help to maintain performance and
overall user experience while eliminating the requirement to deploy the application to multiple datacenters.
For a list of Azure CDN node locations, see Azure CDN POP Locations.
Supporting IoT (Internet of Things) solutions. The huge numbers of devices and appliances involved in an
IoT solution could easily overwhelm an application if it had to distribute firmware updates directly to each
device.
Coping with peaks and surges in demand without requiring the application to scale, avoiding the
consequent increase in running costs. For example, when an update to an operating system is released for
a hardware device such as a specific model of router, or for a consumer device such as a smart TV, there will
be a huge peak in demand as it is downloaded by millions of users and devices over a short period.

Challenges
There are several challenges to take into account when planning to use a CDN.
Deployment . Decide the origin from which the CDN fetches the content, and whether you need to deploy
the content in more than one storage system. Take into account the process for deploying static content
and resources. For example, you may need to implement a separate step to load content into Azure blob
storage.
Versioning and cache-control . Consider how you will update static content and deploy new versions.
Understand how the CDN performs caching and time-to-live (TTL). For Azure CDN, see How caching works.
Testing . It can be difficult to perform local testing of your CDN settings when developing and testing an
application locally or in a staging environment.
Search engine optimization (SEO) . Content such as images and documents are served from a different
domain when you use the CDN. This can have an effect on SEO for this content.
Content security . Not all CDNs offer any form of access control for the content. Some CDN services,
including Azure CDN, support token-based authentication to protect CDN content. For more information,
see Securing Azure Content Delivery Network assets with token authentication.
Client security . Clients might connect from an environment that does not allow access to resources on
the CDN. This could be a security-constrained environment that limits access to only a set of known
sources, or one that prevents loading of resources from anything other than the page origin. A fallback
implementation is required to handle these cases.
Resilience . The CDN is a potential single point of failure for an application.
Scenarios where a CDN may be less useful include:
If the content has a low hit rate, it might be accessed only few times while it is valid (determined by its
time-to-live setting).
If the data is private, such as for large enterprises or supply chain ecosystems.

General guidelines and good practices


Using a CDN is a good way to minimize the load on your application, and maximize availability and performance.
Consider adopting this strategy for all of the appropriate content and resources your application uses. Consider
the points in the following sections when designing your strategy to use a CDN.
Deployment
Static content may need to be provisioned and deployed independently from the application if you do not include
it in the application deployment package or process. Consider how this will affect the versioning approach you use
to manage both the application components and the static resource content.
Consider using bundling and minification techniques to reduce load times for clients. Bundling combines multiple
files into a single file. Minification removes unnecessary characters from scripts and CSS files without altering
functionality.
If you need to deploy the content to an additional location, this will be an extra step in the deployment process. If
the application updates the content for the CDN, perhaps at regular intervals or in response to an event, it must
store the updated content in any additional locations as well as the endpoint for the CDN.
Consider how you will handle local development and testing when some static content is expected to be served
from a CDN. For example, you could predeploy the content to the CDN as part of your build script. Alternatively,
use compile directives or flags to control how the application loads the resources. For example, in debug mode,
the application could load static resources from a local folder. In release mode, the application would use the CDN.
Consider the options for file compression, such as gzip (GNU zip). Compression may be performed on the origin
server by the web application hosting or directly on the edge servers by the CDN. For more information, see
Improve performance by compressing files in Azure CDN.
Routing and versioning
You may need to use different CDN instances at various times. For example, when you deploy a new version of the
application you may want to use a new CDN and retain the old CDN (holding content in an older format) for
previous versions. If you use Azure blob storage as the content origin, you can create a separate storage account
or a separate container and point the CDN endpoint to it.
Do not use the query string to denote different versions of the application in links to resources on the CDN
because, when retrieving content from Azure blob storage, the query string is part of the resource name (the blob
name). This approach can also affect how the client caches resources.
Deploying new versions of static content when you update an application can be a challenge if the previous
resources are cached on the CDN. For more information, see the section on cache control, below.
Consider restricting the CDN content access by country/region. Azure CDN allows you to filter requests based on
the country or region of origin and restrict the content delivered. For more information, see Restrict access to your
content by country/region.
Cache control
Consider how to manage caching within the system. For example, in Azure CDN, you can set global caching rules,
and then set custom caching for particular origin endpoints. You can also control how caching is performed in a
CDN by sending cache-directive headers at the origin.
For more information, see How caching works.
To prevent objects from being available on the CDN, you can delete them from the origin, remove or delete the
CDN endpoint, or in the case of blob storage, make the container or blob private. However, items are not removed
from the CDN until the time-to-live expires. You can also manually purge a CDN endpoint.
Security
The CDN can deliver content over HTTPS (SSL), by using the certificate provided by the CDN, as well as over
standard HTTP. To avoid browser warnings about mixed content, you might need to use HTTPS to request static
content that is displayed in pages loaded through HTTPS.
If you deliver static assets such as font files by using the CDN, you might encounter same-origin policy issues if
you use an XMLHttpRequest call to request these resources from a different domain. Many web browsers prevent
cross-origin resource sharing (CORS) unless the web server is configured to set the appropriate response
headers. You can configure the CDN to support CORS by using one of the following methods:
Configure the CDN to add CORS headers to the responses. For more information, see Using Azure CDN
with CORS.
If the origin is Azure blob storage, add CORS rules to the storage endpoint. For more information, see
Cross-Origin Resource Sharing (CORS) Support for the Azure Storage Services.
Configure the application to set the CORS headers. For example, see Enabling Cross-Origin Requests
(CORS) in the ASP.NET Core documentation.
CDN fallback
Consider how your application will cope with a failure or temporary unavailability of the CDN. Client applications
may be able to use copies of the resources that were cached locally (on the client) during previous requests, or
you can include code that detects failure and instead requests resources from the origin (the application folder or
Azure blob container that holds the resources) if the CDN is unavailable.
Horizontal, vertical, and functional data partitioning
11/2/2020 • 17 minutes to read • Edit Online

In many large-scale solutions, data is divided into partitions that can be managed and accessed separately.
Partitioning can improve scalability, reduce contention, and optimize performance. It can also provide a
mechanism for dividing data by usage pattern. For example, you can archive older data in cheaper data storage.
However, the partitioning strategy must be chosen carefully to maximize the benefits while minimizing adverse
effects.

NOTE
In this article, the term partitioning means the process of physically dividing data into separate data stores. It is not the
same as SQL Server table partitioning.

Why partition data?


Improve scalability . When you scale up a single database system, it will eventually reach a physical
hardware limit. If you divide data across multiple partitions, each hosted on a separate server, you can
scale out the system almost indefinitely.
Improve performance . Data access operations on each partition take place over a smaller volume of
data. Correctly done, partitioning can make your system more efficient. Operations that affect more than
one partition can run in parallel.
Improve security . In some cases, you can separate sensitive and nonsensitive data into different
partitions and apply different security controls to the sensitive data.
Provide operational flexibility . Partitioning offers many opportunities for fine-tuning operations,
maximizing administrative efficiency, and minimizing cost. For example, you can define different strategies
for management, monitoring, backup and restore, and other administrative tasks based on the
importance of the data in each partition.
Match the data store to the pattern of use . Partitioning allows each partition to be deployed on a
different type of data store, based on cost and the built-in features that data store offers. For example,
large binary data can be stored in blob storage, while more structured data can be held in a document
database. See Choose the right data store.
Improve availability . Separating data across multiple servers avoids a single point of failure. If one
instance fails, only the data in that partition is unavailable. Operations on other partitions can continue.
For managed PaaS data stores, this consideration is less relevant, because these services are designed
with built-in redundancy.

Designing partitions
There are three typical strategies for partitioning data:
Horizontal par titioning (often called sharding). In this strategy, each partition is a separate data store,
but all partitions have the same schema. Each partition is known as a shard and holds a specific subset of
the data, such as all the orders for a specific set of customers.
Ver tical par titioning . In this strategy, each partition holds a subset of the fields for items in the data
store. The fields are divided according to their pattern of use. For example, frequently accessed fields
might be placed in one vertical partition and less frequently accessed fields in another.
Functional par titioning . In this strategy, data is aggregated according to how it is used by each
bounded context in the system. For example, an e-commerce system might store invoice data in one
partition and product inventory data in another.
These strategies can be combined, and we recommend that you consider them all when you design a
partitioning scheme. For example, you might divide data into shards and then use vertical partitioning to further
subdivide the data in each shard.
Horizontal partitioning (sharding)
Figure 1 shows horizontal partitioning or sharding. In this example, product inventory data is divided into shards
based on the product key. Each shard holds the data for a contiguous range of shard keys (A-G and H-Z),
organized alphabetically. Sharding spreads the load over more computers, which reduces contention and
improves performance.

Figure 1 - Horizontally partitioning (sharding) data based on a partition key.


The most important factor is the choice of a sharding key. It can be difficult to change the key after the system is
in operation. The key must ensure that data is partitioned to spread the workload as evenly as possible across
the shards.
The shards don't have to be the same size. It's more important to balance the number of requests. Some shards
might be very large, but each item has a low number of access operations. Other shards might be smaller, but
each item is accessed much more frequently. It's also important to ensure that a single shard does not exceed the
scale limits (in terms of capacity and processing resources) of the data store.
Avoid creating "hot" partitions that can affect performance and availability. For example, using the first letter of a
customer's name causes an unbalanced distribution, because some letters are more common. Instead, use a hash
of a customer identifier to distribute data more evenly across partitions.
Choose a sharding key that minimizes any future requirements to split large shards, coalesce small shards into
larger partitions, or change the schema. These operations can be very time consuming, and might require taking
one or more shards offline while they are performed.
If shards are replicated, it might be possible to keep some of the replicas online while others are split, merged, or
reconfigured. However, the system might need to limit the operations that can be performed during the
reconfiguration. For example, the data in the replicas might be marked as read-only to prevent data
inconsistences.
For more information about horizontal partitioning, see sharding pattern.
Vertical partitioning
The most common use for vertical partitioning is to reduce the I/O and performance costs associated with
fetching items that are frequently accessed. Figure 2 shows an example of vertical partitioning. In this example,
different properties of an item are stored in different partitions. One partition holds data that is accessed more
frequently, including product name, description, and price. Another partition holds inventory data: the stock
count and last-ordered date.

Figure 2 - Vertically partitioning data by its pattern of use.


In this example, the application regularly queries the product name, description, and price when displaying the
product details to customers. Stock count and last- ordered date are held in a separate partition because these
two items are commonly used together.
Other advantages of vertical partitioning:
Relatively slow-moving data (product name, description, and price) can be separated from the more
dynamic data (stock level and last ordered date). Slow moving data is a good candidate for an application
to cache in memory.
Sensitive data can be stored in a separate partition with additional security controls.
Vertical partitioning can reduce the amount of concurrent access that's needed.
Vertical partitioning operates at the entity level within a data store, partially normalizing an entity to break it
down from a wide item to a set of narrow items. It is ideally suited for column-oriented data stores such as
HBase and Cassandra. If the data in a collection of columns is unlikely to change, you can also consider using
column stores in SQL Server.
Functional partitioning
When it's possible to identify a bounded context for each distinct business area in an application, functional
partitioning is a way to improve isolation and data access performance. Another common use for functional
partitioning is to separate read-write data from read-only data. Figure 3 shows an overview of functional
partitioning where inventory data is separated from customer data.
Figure 3 - Functionally partitioning data by bounded context or subdomain.
This partitioning strategy can help reduce data access contention across different parts of a system.

Designing partitions for scalability


It's vital to consider size and workload for each partition and balance them so that data is distributed to achieve
maximum scalability. However, you must also partition the data so that it does not exceed the scaling limits of a
single partition store.
Follow these steps when designing partitions for scalability:
1. Analyze the application to understand the data access patterns, such as the size of the result set returned by
each query, the frequency of access, the inherent latency, and the server-side compute processing
requirements. In many cases, a few major entities will demand most of the processing resources.
2. Use this analysis to determine the current and future scalability targets, such as data size and workload. Then
distribute the data across the partitions to meet the scalability target. For horizontal partitioning, choosing the
right shard key is important to make sure distribution is even. For more information, see the sharding pattern.
3. Make sure each partition has enough resources to handle the scalability requirements, in terms of data size
and throughput. Depending on the data store, there might be a limit on the amount of storage space,
processing power, or network bandwidth per partition. If the requirements are likely to exceed these limits,
you may need to refine your partitioning strategy or split data out further, possibly combining two or more
strategies.
4. Monitor the system to verify that data is distributed as expected and that the partitions can handle the load.
Actual usage does not always match what an analysis predicts. If so, it might be possible to rebalance the
partitions, or else redesign some parts of the system to gain the required balance.
Some cloud environments allocate resources in terms of infrastructure boundaries. Ensure that the limits of your
selected boundary provide enough room for any anticipated growth in the volume of data, in terms of data
storage, processing power, and bandwidth.
For example, if you use Azure table storage, there is a limit to the volume of requests that can be handled by a
single partition in a particular period of time. (For more information, see Azure storage scalability and
performance targets.) A busy shard might require more resources than a single partition can handle. If so, the
shard might need to be repartitioned to spread the load. If the total size or throughput of these tables exceeds
the capacity of a storage account, you might need to create additional storage accounts and spread the tables
across these accounts.
Designing partitions for query performance
Query performance can often be boosted by using smaller data sets and by running parallel queries. Each
partition should contain a small proportion of the entire data set. This reduction in volume can improve the
performance of queries. However, partitioning is not an alternative for designing and configuring a database
appropriately. For example, make sure that you have the necessary indexes in place.
Follow these steps when designing partitions for query performance:
1. Examine the application requirements and performance:
Use business requirements to determine the critical queries that must always perform quickly.
Monitor the system to identify any queries that perform slowly.
Find which queries are performed most frequently. Even if a single query has a minimal cost, the
cumulative resource consumption could be significant.
2. Partition the data that is causing slow performance:
Limit the size of each partition so that the query response time is within target.
If you use horizontal partitioning, design the shard key so that the application can easily select the right
partition. This prevents the query from having to scan through every partition.
Consider the location of a partition. If possible, try to keep data in partitions that are geographically
close to the applications and users that access it.
3. If an entity has throughput and query performance requirements, use functional partitioning based on
that entity. If this still doesn't satisfy the requirements, apply horizontal partitioning as well. In most cases,
a single partitioning strategy will suffice, but in some cases it is more efficient to combine both strategies.
4. Consider running queries in parallel across partitions to improve performance.

Designing partitions for availability


Partitioning data can improve the availability of applications by ensuring that the entire dataset does not
constitute a single point of failure and that individual subsets of the dataset can be managed independently.
Consider the following factors that affect availability:
How critical the data is to business operations . Identify which data is critical business information, such as
transactions, and which data is less critical operational data, such as log files.
Consider storing critical data in highly available partitions with an appropriate backup plan.
Establish separate management and monitoring procedures for the different datasets.
Place data that has the same level of criticality in the same partition so that it can be backed up together at
an appropriate frequency. For example, partitions that hold transaction data might need to be backed up
more frequently than partitions that hold logging or trace information.
How individual par titions can be managed . Designing partitions to support independent management and
maintenance provides several advantages. For example:
If a partition fails, it can be recovered independently without applications that access data in other
partitions.
Partitioning data by geographical area allows scheduled maintenance tasks to occur at off-peak hours for
each location. Ensure that partitions are not too large to prevent any planned maintenance from being
completed during this period.
Whether to replicate critical data across par titions . This strategy can improve availability and
performance, but can also introduce consistency issues. It takes time to synchronize changes with every replica.
During this period, different partitions will contain different data values.

Application design considerations


Partitioning adds complexity to the design and development of your system. Consider partitioning as a
fundamental part of system design even if the system initially only contains a single partition. If you address
partitioning as an afterthought, it will be more challenging because you already have a live system to maintain:
Data access logic will need to be modified.
Large quantities of existing data may need to be migrated, to distribute it across partitions.
Users expect to be able to continue using the system during the migration.
In some cases, partitioning is not considered important because the initial dataset is small and can be easily
handled by a single server. This might be true for some workloads, but many commercial systems need to
expand as the number of users increases.
Moreover, it's not only large data stores that benefit from partitioning. For example, a small data store might be
heavily accessed by hundreds of concurrent clients. Partitioning the data in this situation can help to reduce
contention and improve throughput.
Consider the following points when you design a data partitioning scheme:
Minimize cross-par tition data access operations . Where possible, keep data for the most common
database operations together in each partition to minimize cross-partition data access operations. Querying
across partitions can be more time-consuming than querying within a single partition, but optimizing partitions
for one set of queries might adversely affect other sets of queries. If you must query across partitions, minimize
query time by running parallel queries and aggregating the results within the application. (This approach might
not be possible in some cases, such as when the result from one query is used in the next query.)
Consider replicating static reference data. If queries use relatively static reference data, such as postal code
tables or product lists, consider replicating this data in all of the partitions to reduce separate lookup operations
in different partitions. This approach can also reduce the likelihood of the reference data becoming a "hot"
dataset, with heavy traffic from across the entire system. However, there is an additional cost associated with
synchronizing any changes to the reference data.
Minimize cross-par tition joins. Where possible, minimize requirements for referential integrity across
vertical and functional partitions. In these schemes, the application is responsible for maintaining referential
integrity across partitions. Queries that join data across multiple partitions are inefficient because the application
typically needs to perform consecutive queries based on a key and then a foreign key. Instead, consider
replicating or de-normalizing the relevant data. If cross-partition joins are necessary, run parallel queries over
the partitions and join the data within the application.
Embrace eventual consistency . Evaluate whether strong consistency is actually a requirement. A common
approach in distributed systems is to implement eventual consistency. The data in each partition is updated
separately, and the application logic ensures that the updates are all completed successfully. It also handles the
inconsistencies that can arise from querying data while an eventually consistent operation is running.
Consider how queries locate the correct par tition . If a query must scan all partitions to locate the required
data, there is a significant impact on performance, even when multiple parallel queries are running. With vertical
and functional partitioning, queries can naturally specify the partition. Horizontal partitioning, on the other hand,
can make locating an item difficult, because every shard has the same schema. A typical solution to maintain a
map that is used to look up the shard location for specific items. This map can be implemented in the sharding
logic of the application, or maintained by the data store if it supports transparent sharding.
Consider periodically rebalancing shards . With horizontal partitioning, rebalancing shards can help
distribute the data evenly by size and by workload to minimize hotspots, maximize query performance, and work
around physical storage limitations. However, this is a complex task that often requires the use of a custom tool
or process.
Replicate par titions. If you replicate each partition, it provides additional protection against failure. If a single
replica fails, queries can be directed toward a working copy.
If you reach the physical limits of a par titioning strategy, you might need to extend the scalability
to a different level . For example, if partitioning is at the database level, you might need to locate or replicate
partitions in multiple databases. If partitioning is already at the database level, and physical limitations are an
issue, it might mean that you need to locate or replicate partitions in multiple hosting accounts.
Avoid transactions that access data in multiple par titions . Some data stores implement transactional
consistency and integrity for operations that modify data, but only when the data is located in a single partition.
If you need transactional support across multiple partitions, you will probably need to implement this as part of
your application logic because most partitioning systems do not provide native support.
All data stores require some operational management and monitoring activity. The tasks can range from loading
data, backing up and restoring data, reorganizing data, and ensuring that the system is performing correctly and
efficiently.
Consider the following factors that affect operational management:
How to implement appropriate management and operational tasks when the data is
par titioned . These tasks might include backup and restore, archiving data, monitoring the system, and
other administrative tasks. For example, maintaining logical consistency during backup and restore
operations can be a challenge.
How to load the data into multiple par titions and add new data that's arriving from other
sources . Some tools and utilities might not support sharded data operations such as loading data into
the correct partition.
How to archive and delete the data on a regular basis . To prevent the excessive growth of
partitions, you need to archive and delete data on a regular basis (such as monthly). It might be necessary
to transform the data to match a different archive schema.
How to locate data integrity issues . Consider running a periodic process to locate any data integrity
issues, such as data in one partition that references missing information in another. The process can either
attempt to fix these issues automatically or generate a report for manual review.

Rebalancing partitions
As a system matures, you might have to adjust the partitioning scheme. For example, individual partitions might
start getting a disproportionate volume of traffic and become hot, leading to excessive contention. Or you might
have underestimated the volume of data in some partitions, causing some partitions to approach capacity limits.
Some data stores, such as Cosmos DB, can automatically rebalance partitions. In other cases, rebalancing is an
administrative task that consists of two stages:
1. Determine a new partitioning strategy.
Which partitions need to be split (or possibly combined)?
What is the new partition key?
2. Migrate data from the old partitioning scheme to the new set of partitions.
Depending on the data store, you might be able to migrate data between partitions while they are in use. This is
called online migration. If that's not possible, you might need to make partitions unavailable while the data is
relocated (offline migration).
Offline migration
Offline migration is typically simpler because it reduces the chances of contention occurring. Conceptually, offline
migration works as follows:
1. Mark the partition offline.
2. Split-merge and move the data to the new partitions.
3. Verify the data.
4. Bring the new partitions online.
5. Remove the old partition.
Optionally, you can mark a partition as read-only in step 1, so that applications can still read the data while it is
being moved.

Online migration
Online migration is more complex to perform but less disruptive. The process is similar to offline migration,
except the original partition is not marked offline. Depending on the granularity of the migration process (for
example, item by item versus shard by shard), the data access code in the client applications might have to
handle reading and writing data that's held in two locations, the original partition and the new partition.

Related patterns
The following design patterns might be relevant to your scenario:
The sharding pattern describes some common strategies for sharding data.
The index table pattern shows how to create secondary indexes over data. An application can quickly
retrieve data with this approach, by using queries that do not reference the primary key of a collection.
The materialized view pattern describes how to generate prepopulated views that summarize data to
support fast query operations. This approach can be useful in a partitioned data store if the partitions that
contain the data being summarized are distributed across multiple sites.

Next steps
Learn about partitioning strategies for specific Azure services. See Data partitioning strategies
Data partitioning strategies
11/13/2020 • 31 minutes to read • Edit Online

This article describes some strategies for partitioning data in various Azure data stores. For general guidance about
when to partition data and best practices, see Data partitioning.

Partitioning Azure SQL Database


A single SQL database has a limit to the volume of data that it can contain. Throughput is constrained by
architectural factors and the number of concurrent connections that it supports.
Elastic pools support horizontal scaling for a SQL database. Using elastic pools, you can partition your data into
shards that are spread across multiple SQL databases. You can also add or remove shards as the volume of data
that you need to handle grows and shrinks. Elastic pools can also help reduce contention by distributing the load
across databases.
Each shard is implemented as a SQL database. A shard can hold more than one dataset (called a shardlet). Each
database maintains metadata that describes the shardlets that it contains. A shardlet can be a single data item, or a
group of items that share the same shardlet key. For example, in a multitenant application, the shardlet key can be
the tenant ID, and all data for a tenant can be held in the same shardlet.
Client applications are responsible for associating a dataset with a shardlet key. A separate SQL database acts as a
global shard map manager. This database has a list of all the shards and shardlets in the system. The application
connects to the shard map manager database to obtain a copy of the shard map. It caches the shard map locally,
and uses the map to route data requests to the appropriate shard. This functionality is hidden behind a series of
APIs that are contained in the Elastic Database client library, which is available for Java and .NET.
For more information about elastic pools, see Scaling out with Azure SQL Database.
To reduce latency and improve availability, you can replicate the global shard map manager database. With the
Premium pricing tiers, you can configure active geo-replication to continuously copy data to databases in different
regions.
Alternatively, use Azure SQL Data Sync or Azure Data Factory to replicate the shard map manager database across
regions. This form of replication runs periodically and is more suitable if the shard map changes infrequently, and
does not require Premium tier.
Elastic Database provides two schemes for mapping data to shardlets and storing them in shards:
A list shard map associates a single key to a shardlet. For example, in a multitenant system, the data for
each tenant can be associated with a unique key and stored in its own shardlet. To guarantee isolation, each
shardlet can be held within its own shard.
A range shard map associates a set of contiguous key values to a shardlet. For example, you can group the
data for a set of tenants (each with their own key) within the same shardlet. This scheme is less expensive
than the first, because tenants share data storage, but has less isolation.

A single shard can contain the data for several shardlets. For example, you can use list shardlets to store data for
different non-contiguous tenants in the same shard. You can also mix range shardlets and list shardlets in the same
shard, although they will be addressed through different maps. The following diagram shows this approach:
Elastic pools make it possible to add and remove shards as the volume of data shrinks and grows. Client
applications can create and delete shards dynamically, and transparently update the shard map manager. However,
removing a shard is a destructive operation that also requires deleting all the data in that shard.
If an application needs to split a shard into two separate shards or combine shards, use the split-merge tool. This
tool runs as an Azure web service, and migrates data safely between shards.
The partitioning scheme can significantly affect the performance of your system. It can also affect the rate at which
shards have to be added or removed, or that data must be repartitioned across shards. Consider the following
points:
Group data that is used together in the same shard, and avoid operations that access data from multiple
shards. A shard is a SQL database in its own right, and cross-database joins must be performed on the client
side.
Although SQL Database does not support cross-database joins, you can use the Elastic Database tools to
perform multi-shard queries. A multi-shard query sends individual queries to each database and merges the
results.
Don't design a system that has dependencies between shards. Referential integrity constraints, triggers, and
stored procedures in one database cannot reference objects in another.
If you have reference data that is frequently used by queries, consider replicating this data across shards.
This approach can remove the need to join data across databases. Ideally, such data should be static or slow-
moving, to minimize the replication effort and reduce the chances of it becoming stale.
Shardlets that belong to the same shard map should have the same schema. This rule is not enforced by
SQL Database, but data management and querying becomes very complex if each shardlet has a different
schema. Instead, create separate shard maps for each schema. Remember that data belonging to different
shardlets can be stored in the same shard.
Transactional operations are only supported for data within a shard, and not across shards. Transactions can
span shardlets as long as they are part of the same shard. Therefore, if your business logic needs to perform
transactions, either store the data in the same shard or implement eventual consistency.
Place shards close to the users that access the data in those shards. This strategy helps reduce latency.
Avoid having a mixture of highly active and relatively inactive shards. Try to spread the load evenly across
shards. This might require hashing the sharding keys. If you are geo-locating shards, make sure that the
hashed keys map to shardlets held in shards stored close to the users that access that data.

Partitioning Azure table storage


Azure table storage is a key-value store that's designed around partitioning. All entities are stored in a partition,
and partitions are managed internally by Azure table storage. Each entity stored in a table must provide a two-part
key that includes:
The par tition key . This is a string value that determines the partition where Azure table storage will place
the entity. All entities with the same partition key are stored in the same partition.
The row key . This is a string value that identifies the entity within the partition. All entities within a partition
are sorted lexically, in ascending order, by this key. The partition key/row key combination must be unique
for each entity and cannot exceed 1 KB in length.
If an entity is added to a table with a previously unused partition key, Azure table storage creates a new partition
for this entity. Other entities with the same partition key will be stored in the same partition.
This mechanism effectively implements an automatic scale-out strategy. Each partition is stored on the same server
in an Azure datacenter to help ensure that queries that retrieve data from a single partition run quickly.
Microsoft has published scalability targets for Azure Storage. If your system is likely to exceed these limits, consider
splitting entities into multiple tables. Use vertical partitioning to divide the fields into the groups that are most
likely to be accessed together.
The following diagram shows the logical structure of an example storage account. The storage account contains
three tables: Customer Info, Product Info, and Order Info.
Each table has multiple partitions.
In the Customer Info table, the data is partitioned according to the city where the customer is located. The row
key contains the customer ID.
In the Product Info table, products are partitioned by product category, and the row key contains the product
number.
In the Order Info table, the orders are partitioned by order date, and the row key specifies the time the order
was received. All data is ordered by the row key in each partition.
Consider the following points when you design your entities for Azure table storage:
Select a partition key and row key by how the data is accessed. Choose a partition key/row key combination
that supports the majority of your queries. The most efficient queries retrieve data by specifying the
partition key and the row key. Queries that specify a partition key and a range of row keys can be completed
by scanning a single partition. This is relatively fast because the data is held in row key order. If queries don't
specify which partition to scan, every partition must be scanned.
If an entity has one natural key, then use it as the partition key and specify an empty string as the row key. If
an entity has a composite key consisting of two properties, select the slowest changing property as the
partition key and the other as the row key. If an entity has more than two key properties, use a
concatenation of properties to provide the partition and row keys.
If you regularly perform queries that look up data by using fields other than the partition and row keys,
consider implementing the Index Table pattern, or consider using a different data store that supports
indexing, such as Cosmos DB.
If you generate partition keys by using a monotonic sequence (such as "0001", "0002", "0003") and each
partition only contains a limited amount of data, Azure table storage can physically group these partitions
together on the same server. Azure Storage assumes that the application is most likely to perform queries
across a contiguous range of partitions (range queries) and is optimized for this case. However, this
approach can lead to hotspots, because all insertions of new entities are likely to be concentrated at one end
the contiguous range. It can also reduce scalability. To spread the load more evenly, consider hashing the
partition key.
Azure table storage supports transactional operations for entities that belong to the same partition. An
application can perform multiple insert, update, delete, replace, or merge operations as an atomic unit, as
long as the transaction doesn't include more than 100 entities and the payload of the request doesn't exceed
4 MB. Operations that span multiple partitions are not transactional, and might require you to implement
eventual consistency. For more information about table storage and transactions, see Performing entity
group transactions.
Consider the granularity of the partition key:
Using the same partition key for every entity results in a single partition that's held on one server.
This prevents the partition from scaling out and focuses the load on a single server. As a result, this
approach is only suitable for storing a small number of entities. However, it does ensure that all
entities can participate in entity group transactions.
Using a unique partition key for every entity causes the table storage service to create a separate
partition for each entity, possibly resulting in a large number of small partitions. This approach is
more scalable than using a single partition key, but entity group transactions are not possible. Also,
queries that fetch more than one entity might involve reading from more than one server. However, if
the application performs range queries, then using a monotonic sequence for the partition keys
might help to optimize these queries.
Sharing the partition key across a subset of entities makes it possible to group related entities in the
same partition. Operations that involve related entities can be performed by using entity group
transactions, and queries that fetch a set of related entities can be satisfied by accessing a single
server.
For more information, see Azure storage table design guide and Scalable partitioning strategy.

Partitioning Azure blob storage


Azure blob storage makes it possible to hold large binary objects. Use block blobs in scenarios when you need to
upload or download large volumes of data quickly. Use page blobs for applications that require random rather than
serial access to parts of the data.
Each blob (either block or page) is held in a container in an Azure storage account. You can use containers to group
related blobs that have the same security requirements. This grouping is logical rather than physical. Inside a
container, each blob has a unique name.
The partition key for a blob is account name + container name + blob name. The partition key is used to partition
data into ranges and these ranges are load-balanced across the system. Blobs can be distributed across many
servers in order to scale out access to them, but a single blob can only be served by a single server.
If your naming scheme uses timestamps or numerical identifiers, it can lead to excessive traffic going to one
partition, limiting the system from effectively load balancing. For instance, if you have daily operations that use a
blob object with a timestamp such as yyyy-mm-dd, all the traffic for that operation would go to a single partition
server. Instead, consider prefixing the name with a three-digit hash. For more information, see Partition Naming
Convention.
The actions of writing a single block or page are atomic, but operations that span blocks, pages, or blobs are not. If
you need to ensure consistency when performing write operations across blocks, pages, and blobs, take out a write
lock by using a blob lease.
Partitioning Azure storage queues
Azure storage queues enable you to implement asynchronous messaging between processes. An Azure storage
account can contain any number of queues, and each queue can contain any number of messages. The only
limitation is the space that's available in the storage account. The maximum size of an individual message is 64 KB.
If you require messages bigger than this, then consider using Azure Service Bus queues instead.
Each storage queue has a unique name within the storage account that contains it. Azure partitions queues based
on the name. All messages for the same queue are stored in the same partition, which is controlled by a single
server. Different queues can be managed by different servers to help balance the load. The allocation of queues to
servers is transparent to applications and users.
In a large-scale application, don't use the same storage queue for all instances of the application because this
approach might cause the server that's hosting the queue to become a hot spot. Instead, use different queues for
different functional areas of the application. Azure storage queues do not support transactions, so directing
messages to different queues should have little effect on messaging consistency.
An Azure storage queue can handle up to 2,000 messages per second. If you need to process messages at a greater
rate than this, consider creating multiple queues. For example, in a global application, create separate storage
queues in separate storage accounts to handle application instances that are running in each region.

Partitioning Azure Service Bus


Azure Service Bus uses a message broker to handle messages that are sent to a Service Bus queue or topic. By
default, all messages that are sent to a queue or topic are handled by the same message broker process. This
architecture can place a limitation on the overall throughput of the message queue. However, you can also partition
a queue or topic when it is created. You do this by setting the EnablePartitioning property of the queue or topic
description to true.
A partitioned queue or topic is divided into multiple fragments, each of which is backed by a separate message
store and message broker. Service Bus takes responsibility for creating and managing these fragments. When an
application posts a message to a partitioned queue or topic, Service Bus assigns the message to a fragment for that
queue or topic. When an application receives a message from a queue or subscription, Service Bus checks each
fragment for the next available message and then passes it to the application for processing.
This structure helps distribute the load across message brokers and message stores, increasing scalability and
improving availability. If the message broker or message store for one fragment is temporarily unavailable, Service
Bus can retrieve messages from one of the remaining available fragments.
Service Bus assigns a message to a fragment as follows:
If the message belongs to a session, all messages with the same value for the SessionId property are sent to
the same fragment.
If the message does not belong to a session, but the sender has specified a value for the PartitionKey
property, then all messages with the same PartitionKey value are sent to the same fragment.

NOTE
If the SessionId and PartitionKey properties are both specified, then they must be set to the same value or the
message will be rejected.

If the SessionId and PartitionKey properties for a message are not specified, but duplicate detection is
enabled, the MessageId property will be used. All messages with the same MessageId will be directed to the
same fragment.
If messages do not include a SessionId, PartitionKey, or MessageId property, then Service Bus assigns
messages to fragments sequentially. If a fragment is unavailable, Service Bus will move on to the next. This
means that a temporary fault in the messaging infrastructure does not cause the message-send operation to
fail.
Consider the following points when deciding if or how to partition a Service Bus message queue or topic:
Service Bus queues and topics are created within the scope of a Service Bus namespace. Service Bus
currently allows up to 100 partitioned queues or topics per namespace.
Each Service Bus namespace imposes quotas on the available resources, such as the number of
subscriptions per topic, the number of concurrent send and receive requests per second, and the maximum
number of concurrent connections that can be established. These quotas are documented in Service Bus
quotas. If you expect to exceed these values, then create additional namespaces with their own queues and
topics, and spread the work across these namespaces. For example, in a global application, create separate
namespaces in each region and configure application instances to use the queues and topics in the nearest
namespace.
Messages that are sent as part of a transaction must specify a partition key. This can be a SessionId,
PartitionKey, or MessageId property. All messages that are sent as part of the same transaction must specify
the same partition key because they must be handled by the same message broker process. You cannot
send messages to different queues or topics within the same transaction.
Partitioned queues and topics can't be configured to be automatically deleted when they become idle.
Partitioned queues and topics can't currently be used with the Advanced Message Queuing Protocol (AMQP)
if you are building cross-platform or hybrid solutions.

Partitioning Cosmos DB
Azure Cosmos DB is a NoSQL database that can store JSON documents using the Azure Cosmos DB SQL API. A
document in a Cosmos DB database is a JSON-serialized representation of an object or other piece of data. No
fixed schemas are enforced except that every document must contain a unique ID.
Documents are organized into collections. You can group related documents together in a collection. For example,
in a system that maintains blog postings, you can store the contents of each blog post as a document in a
collection. You can also create collections for each subject type. Alternatively, in a multitenant application, such as a
system where different authors control and manage their own blog posts, you can partition blogs by author and
create separate collections for each author. The storage space that's allocated to collections is elastic and can shrink
or grow as needed.
Cosmos DB supports automatic partitioning of data based on an application-defined partition key. A logical
partition is a partition that stores all the data for a single partition key value. All documents that share the same
value for the partition key are placed within the same logical partition. Cosmos DB distributes values according to
hash of the partition key. A logical partition has a maximum size of 10 GB. Therefore, the choice of the partition key
is an important decision at design time. Choose a property with a wide range of values and even access patterns.
For more information, see Partition and scale in Azure Cosmos DB.

NOTE
Each Cosmos DB database has a performance level that determines the amount of resources it gets. A performance level is
associated with a request unit (RU) rate limit. The RU rate limit specifies the volume of resources that's reserved and available
for exclusive use by that collection. The cost of a collection depends on the performance level that's selected for that
collection. The higher the performance level (and RU rate limit) the higher the charge. You can adjust the performance level of
a collection by using the Azure portal. For more information, see Request Units in Azure Cosmos DB.
If the partitioning mechanism that Cosmos DB provides is not sufficient, you may need to shard the data at the
application level. Document collections provide a natural mechanism for partitioning data within a single database.
The simplest way to implement sharding is to create a collection for each shard. Containers are logical resources
and can span one or more servers. Fixed-size containers have a maximum limit of 10 GB and 10,000 RU/s
throughput. Unlimited containers do not have a maximum storage size, but must specify a partition key. With
application sharding, the client application must direct requests to the appropriate shard, usually by implementing
its own mapping mechanism based on some attributes of the data that define the shard key.
All databases are created in the context of a Cosmos DB database account. A single account can contain several
databases, and it specifies in which regions the databases are created. Each account also enforces its own access
control. You can use Cosmos DB accounts to geo-locate shards (collections within databases) close to the users
who need to access them, and enforce restrictions so that only those users can connect to them.
Consider the following points when deciding how to partition data with the Cosmos DB SQL API:
The resources available to a Cosmos DB database are subject to the quota limitations of the
account . Each database can hold a number of collections, and each collection is associated with a
performance level that governs the RU rate limit (reserved throughput) for that collection. For more
information, see Azure subscription and service limits, quotas, and constraints.
Each document must have an attribute that can be used to uniquely identify that document
within the collection in which it is held . This attribute is different from the shard key, which defines
which collection holds the document. A collection can contain a large number of documents. In theory, it's
limited only by the maximum length of the document ID. The document ID can be up to 255 characters.
All operations against a document are performed within the context of a transaction.
Transactions are scoped to the collection in which the document is contained. If an operation fails,
the work that it has performed is rolled back. While a document is subject to an operation, any changes that
are made are subject to snapshot-level isolation. This mechanism guarantees that if, for example, a request
to create a new document fails, another user who's querying the database simultaneously will not see a
partial document that is then removed.
Database queries are also scoped to the collection level . A single query can retrieve data from only
one collection. If you need to retrieve data from multiple collections, you must query each collection
individually and merge the results in your application code.
Cosmos DB suppor ts programmable items that can all be stored in a collection alongside
documents . These include stored procedures, user-defined functions, and triggers (written in JavaScript).
These items can access any document within the same collection. Furthermore, these items run either inside
the scope of the ambient transaction (in the case of a trigger that fires as the result of a create, delete, or
replace operation performed against a document), or by starting a new transaction (in the case of a stored
procedure that is run as the result of an explicit client request). If the code in a programmable item throws
an exception, the transaction is rolled back. You can use stored procedures and triggers to maintain integrity
and consistency between documents, but these documents must all be part of the same collection.
The collections that you intend to hold in the databases should be unlikely to exceed the
throughput limits defined by the performance levels of the collections . For more information, see
Request Units in Azure Cosmos DB. If you anticipate reaching these limits, consider splitting collections
across databases in different accounts to reduce the load per collection.

Partitioning Azure Search


The ability to search for data is often the primary method of navigation and exploration that's provided by many
web applications. It helps users find resources quickly (for example, products in an e-commerce application) based
on combinations of search criteria. The Azure Search service provides full-text search capabilities over web content,
and includes features such as type-ahead, suggested queries based on near matches, and faceted navigation. For
more information, see What is Azure Search?.
Azure Search stores searchable content as JSON documents in a database. You define indexes that specify the
searchable fields in these documents and provide these definitions to Azure Search. When a user submits a search
request, Azure Search uses the appropriate indexes to find matching items.
To reduce contention, the storage that's used by Azure Search can be divided into 1, 2, 3, 4, 6, or 12 partitions, and
each partition can be replicated up to 6 times. The product of the number of partitions multiplied by the number of
replicas is called the search unit (SU). A single instance of Azure Search can contain a maximum of 36 SUs (a
database with 12 partitions only supports a maximum of 3 replicas).
You are billed for each SU that is allocated to your service. As the volume of searchable content increases or the
rate of search requests grows, you can add SUs to an existing instance of Azure Search to handle the extra load.
Azure Search itself distributes the documents evenly across the partitions. No manual partitioning strategies are
currently supported.
Each partition can contain a maximum of 15 million documents or occupy 300 GB of storage space (whichever is
smaller). You can create up to 50 indexes. The performance of the service varies and depends on the complexity of
the documents, the available indexes, and the effects of network latency. On average, a single replica (1 SU) should
be able to handle 15 queries per second (QPS), although we recommend performing benchmarking with your own
data to obtain a more precise measure of throughput. For more information, see Service limits in Azure Search.

NOTE
You can store a limited set of data types in searchable documents, including strings, Booleans, numeric data, datetime data,
and some geographical data. For more information, see the page Supported data types (Azure Search) on the Microsoft
website.

You have limited control over how Azure Search partitions data for each instance of the service. However, in a
global environment you might be able to improve performance and reduce latency and contention further by
partitioning the service itself using either of the following strategies:
Create an instance of Azure Search in each geographic region, and ensure that client applications are
directed toward the nearest available instance. This strategy requires that any updates to searchable content
are replicated in a timely manner across all instances of the service.
Create two tiers of Azure Search:
A local service in each region that contains the data that's most frequently accessed by users in that
region. Users can direct requests here for fast but limited results.
A global service that encompasses all the data. Users can direct requests here for slower but more
complete results.
This approach is most suitable when there is a significant regional variation in the data that's being searched.

Partitioning Azure Cache for Redis


Azure Cache for Redis provides a shared caching service in the cloud that's based on the Redis key-value data store.
As its name implies, Azure Cache for Redis is intended as a caching solution. Use it only for holding transient data
and not as a permanent data store. Applications that use Azure Cache for Redis should be able to continue
functioning if the cache is unavailable. Azure Cache for Redis supports primary/secondary replication to provide
high availability, but currently limits the maximum cache size to 53 GB. If you need more space than this, you must
create additional caches. For more information, see Azure Cache for Redis.
Partitioning a Redis data store involves splitting the data across instances of the Redis service. Each instance
constitutes a single partition. Azure Cache for Redis abstracts the Redis services behind a façade and does not
expose them directly. The simplest way to implement partitioning is to create multiple Azure Cache for Redis
instances and spread the data across them.
You can associate each data item with an identifier (a partition key) that specifies which cache stores the data item.
The client application logic can then use this identifier to route requests to the appropriate partition. This scheme is
very simple, but if the partitioning scheme changes (for example, if additional Azure Cache for Redis instances are
created), client applications might need to be reconfigured.
Native Redis (not Azure Cache for Redis) supports server-side partitioning based on Redis clustering. In this
approach, you can divide the data evenly across servers by using a hashing mechanism. Each Redis server stores
metadata that describes the range of hash keys that the partition holds, and also contains information about which
hash keys are located in the partitions on other servers.
Client applications simply send requests to any of the participating Redis servers (probably the closest one). The
Redis server examines the client request. If it can be resolved locally, it performs the requested operation.
Otherwise it forwards the request on to the appropriate server.
This model is implemented by using Redis clustering, and is described in more detail on the Redis cluster tutorial
page on the Redis website. Redis clustering is transparent to client applications. Additional Redis servers can be
added to the cluster (and the data can be repartitioned) without requiring that you reconfigure the clients.

IMPORTANT
Azure Cache for Redis currently supports Redis clustering in premium tier only.

The page Partitioning: how to split data among multiple Redis instances on the Redis website provides more
information about implementing partitioning with Redis. The remainder of this section assumes that you are
implementing client-side or proxy-assisted partitioning.
Consider the following points when deciding how to partition data with Azure Cache for Redis:
Azure Cache for Redis is not intended to act as a permanent data store, so whatever partitioning scheme you
implement, your application code must be able to retrieve data from a location that's not the cache.
Data that is frequently accessed together should be kept in the same partition. Redis is a powerful key-value
store that provides several highly optimized mechanisms for structuring data. These mechanisms can be
one of the following:
Simple strings (binary data up to 512 MB in length)
Aggregate types such as lists (which can act as queues and stacks)
Sets (ordered and unordered)
Hashes (which can group related fields together, such as the items that represent the fields in an object)
The aggregate types enable you to associate many related values with the same key. A Redis key identifies a
list, set, or hash rather than the data items that it contains. These types are all available with Azure Cache for
Redis and are described by the Data types page on the Redis website. For example, in part of an e-commerce
system that tracks the orders that are placed by customers, the details of each customer can be stored in a
Redis hash that is keyed by using the customer ID. Each hash can hold a collection of order IDs for the
customer. A separate Redis set can hold the orders, again structured as hashes, and keyed by using the order
ID. Figure 8 shows this structure. Note that Redis does not implement any form of referential integrity, so it
is the developer's responsibility to maintain the relationships between customers and orders.
Figure 8. Suggested structure in Redis storage for recording customer orders and their details.

NOTE
In Redis, all keys are binary data values (like Redis strings) and can contain up to 512 MB of data. In theory, a key can contain
almost any information. However, we recommend adopting a consistent naming convention for keys that is descriptive of the
type of data and that identifies the entity, but is not excessively long. A common approach is to use keys of the form
"entity_type:ID". For example, you can use "customer:99" to indicate the key for a customer with the ID 99.

You can implement vertical partitioning by storing related information in different aggregations in the same
database. For example, in an e-commerce application, you can store commonly accessed information about
products in one Redis hash and less frequently used detailed information in another. Both hashes can use
the same product ID as part of the key. For example, you can use "product: nn" (where nn is the product ID)
for the product information and "product_details: nn" for the detailed data. This strategy can help reduce the
volume of data that most queries are likely to retrieve.
You can repartition a Redis data store, but keep in mind that it's a complex and time-consuming task. Redis
clustering can repartition data automatically, but this capability is not available with Azure Cache for Redis.
Therefore, when you design your partitioning scheme, try to leave sufficient free space in each partition to
allow for expected data growth over time. However, remember that Azure Cache for Redis is intended to
cache data temporarily, and that data held in the cache can have a limited lifetime specified as a time-to-live
(TTL) value. For relatively volatile data, the TTL can be short, but for static data the TTL can be a lot longer.
Avoid storing large amounts of long-lived data in the cache if the volume of this data is likely to fill the
cache. You can specify an eviction policy that causes Azure Cache for Redis to remove data if space is at a
premium.

NOTE
When you use Azure Cache for Redis, you specify the maximum size of the cache (from 250 MB to 53 GB) by
selecting the appropriate pricing tier. However, after an Azure Cache for Redis has been created, you cannot increase
(or decrease) its size.
Redis batches and transactions cannot span multiple connections, so all data that is affected by a batch or
transaction should be held in the same database (shard).

NOTE
A sequence of operations in a Redis transaction is not necessarily atomic. The commands that compose a transaction
are verified and queued before they run. If an error occurs during this phase, the entire queue is discarded. However,
after the transaction has been successfully submitted, the queued commands run in sequence. If any command fails,
only that command stops running. All previous and subsequent commands in the queue are performed. For more
information, go to the Transactions page on the Redis website.

Redis supports a limited number of atomic operations. The only operations of this type that support
multiple keys and values are MGET and MSET operations. MGET operations return a collection of values for
a specified list of keys, and MSET operations store a collection of values for a specified list of keys. If you
need to use these operations, the key-value pairs that are referenced by the MSET and MGET commands
must be stored within the same database.

Partitioning Azure Service Fabric


Azure Service Fabric is a microservices platform that provides a runtime for distributed applications in the cloud.
Service Fabric supports .Net guest executables, stateful and stateless services, and containers. Stateful services
provide a reliable collection to persistently store data in a key-value collection within the Service Fabric cluster. For
more information about strategies for partitioning keys in a reliable collection, see guidelines and
recommendations for reliable collections in Azure Service Fabric.
More information
Overview of Azure Service Fabric is an introduction to Azure Service Fabric.
Partition Service Fabric reliable services provides more information about reliable services in Azure Service
Fabric.

Partitioning Azure Event Hubs


Azure Event Hubs is designed for data streaming at massive scale, and partitioning is built into the service to
enable horizontal scaling. Each consumer only reads a specific partition of the message stream.
The event publisher is only aware of its partition key, not the partition to which the events are published. This
decoupling of key and partition insulates the sender from needing to know too much about the downstream
processing. (It's also possible send events directly to a given partition, but generally that's not recommended.)
Consider long-term scale when you select the partition count. After an event hub is created, you can't change the
number of partitions.
For more information about using partitions in Event Hubs, see What is Event Hubs?.
For considerations about trade-offs between availability and consistency, see Availability and consistency in Event
Hubs.
Message encoding considerations
11/2/2020 • 11 minutes to read • Edit Online

Many cloud applications use asynchronous messages to exchange information between components of the system.
An important aspect of messaging is the format used to encode the payload data. After you choose a messaging
technology, the next step is to define how the messages will be encoded. There are many options available, but the
right choice depends on your use case.
This article describes some of the considerations.

Overview
A message exchange between a producer and a consumer needs:
A shape or structure that defines the payload of the message.
An encoding format to represent the payload.
Serialization libraries to read and write the encoded payload.
The producer of the message defines the message shape based on the business logic and the information it wants
to send to the consumer(s). To structure the shape, divide the information into discrete or related subjects (fields).
Decide the characteristics of the values for those fields. Consider: What is the most efficient datatype? Will the
payload always have certain fields? Will the payload have a single record or a repeated set of values?
Then, choose an encoding format depending on your need. Certain factors include the ability to create highly
structured data if you need it, time taken to encode and transfer the message, and the ability to parse the payload.
Depending on the encoding format, choose a serialization library that is well supported.
A consumer of the message must be aware of those decisions so that it knows how to read incoming messages.
To transfer messages, the producer serializes the message to an encoding format. At the receiving end, the
consumer deserializes the payload to use the data. This way both entities share the model and as long as the shape
doesn't change, messaging continues without issues. When the contract changes, the encoding format should be
capable of handling the change without breaking the consumer.
Some encoding formats such as JSON are self-describing, meaning they can be parsed without referencing a
schema. However, such formats tend to yield larger messages. With other formats, the data may not be parsed as
easily but the messages are compact. This article highlights some factors that can help you choose a format.

Encoding format considerations


The encoding format defines how a set of structured data is represented as bytes. The type of message can
influence the format choice. Messages related to business transactions most likely will contain highly structured
data. Also, you may want to retrieve it later for auditing purposes. For a stream of events, you might want to read a
sequence of records as quickly as possible and store it for statistical analysis.
Here are some points to consider when choosing an encoding format.
Human readability
Message encoding can be broadly divided into text-based and binary formats.
With text-based encoding, the message payload is in plain text and therefore can be inspected by a person without
using any code libraries. Human readable formats are suitable for archival data. Also, because a human can read
the payload, text-based formats are easier to debug and send to logs for troubleshooting errors.
The downside is that the payload tends to be larger. A common text-based format is JSON.
Encoding size
Message size impacts network I/O performance across the wire. Binary formats are more compact than text-based
formats. Binary formats require serialization/deserialization libraries. The payload can't be read unless it's decoded.
Use a binary format if you want to reduce wire footprint and transfer messages faster. This category of format is
recommended in scenarios where storage or network bandwidth is a concern. Options for binary formats include
Apache Avro, Google Protocol Buffers (protobuf), MessagePack, and Concise Binary Object Representation (CBOR).
The pros and cons of those formats are described in this section.
The disadvantage is that the payload isn't human readable. Most binary formats use complex systems that can be
costly to maintain. Also, they need specialized libraries to decode, which may not be supported if you want to
retrieve archival data.
Understanding the payload
A message payload arrives as a sequence of bytes. To parse this sequence, the consumer must have access to
metadata that describes the data fields in the payload. There are two main approaches for storing and distributing
metadata:
Tagged metadata . In some encodings, notably JSON, fields are tagged with the data type and identifier, within the
body of the message. These formats are self-describing because they can be parsed into a dictionary of values
without referring to a schema. One way for the consumer to understand the fields is to query for expected values.
For example, the producer sends a payload in JSON. The consumer parses the JSON into a dictionary and checks
the existence of fields to understand the payload. Another way is for the consumer to apply a data model shared by
the producer. For example, if you are using a statically typed language, many JSON serialization libraries can parse
a JSON string into a typed class.
Schema . A schema formally defines the structure and data fields of a message. In this model, the producer and
consumer have a contract through a well-defined schema. The schema can define the data types, required/optional
fields, version information, and the structure of the payload. The producer sends the payload as per the writer
schema. The consumer receives the payload by applying a reader schema. The message is serialized/deserialized by
using the encoding-specific libraries. There are two ways to distribute schemas:
Store the schema as a preamble or header in the message but separate from the payload.
Store the schema externally.
Some encoding formats define the schema and use tools that generate classes from the schema. The producer and
consumer use those classes and libraries to serialize and deserialize the payload. The libraries also provide
compatibility checks between the writer and reader schema. Both protobuf and Apache Avro follow that approach.
The key difference is that protobuf has a language-agnostic schema definition but Avro uses compact JSON.
Another difference is in the way both formats provide compatibility checks between reader and writer schemas.
Another way to store the schema externally in a schema registry. The message contains a reference to the schema
and the payload. The producer sends the schema identifier in the message and the consumer retrieves the schema
by specifying that identifier from an external store. Both parties use format-specific library to read and write
messages. Apart from storing the schema a registry can provide compatibility checks to make sure the contract
between the producer and consumer isn't broken as the schema evolves.
Before choosing an approach, decide what is more important: the transfer data size or the ability to parse the
archived data later.
Storing the schema along with the payload yields a larger encoding size and is preferred for intermittent messages.
Choose this approach if transferring smaller chunks of bytes is crucial or you expect a sequence of records. The cost
of maintaining an external schema store can be high.
However, if on-demand decoding of the payload is more important than size, including the schema with the
payload or the tagged metadata approach guarantees decoding afterwards. There might be a significant increase in
message size and may impact the cost of storage.
Schema versioning
As business requirements change, the shape is expected to change, and the schema will evolve. Versioning allows
the producer to indicate schema updates that might include new features. There are two aspects to versioning:
The consumer should be aware of the changes.
One way is for the consumer to check all fields to determine whether the schema has changed. Another way
is for the producer to publish a schema version number with the message. When the schema evolves, the
producer increments the version.
Changes must not affect or break the business logic of consumers.
Suppose a field is added to an existing schema. If consumers using the new version get a payload as per the
old version, their logic might break if they are not able to overlook the lack of the new field. Considering the
reverse case, suppose a field is removed in the new schema. Consumers using the old schema might not be
able to read the data.
Encoding formats such as Avro offer the ability to define default values. In the preceding example, if the field
is added with a default value, the missing field will be populated with the default value. Other formats such
as protobuf provide similar functionality through required and optional fields.
Payload structure
Consider the way data is arranged in the payload. Is it a sequence of records or a discrete single payload? The
payload structure can be categorized into one of these models:
Array/dictionary/value: Defines entries that hold values in one or multi-dimensional arrays. Entries have
unique key-value pairs. It can be extended to represent the complex structures. Some examples include,
JSON, Apache Avro, and MessagePack.
This layout is suitable if messages are individual encoded with different schemas. If you have multiple
records, the payload can get overly redundant causing the payload to bloat.
Tabular data: Information is divided into rows and columns. Each column indicates a field, or the subject of
the information and each row contains values for those fields. This layout is efficient for a repeating set of
information, such as time series data.
CSV is one of the simplest text-based formats. It presents data as a sequence of records with a common
header. For binary encoding, Apache Avro has a preamble is similar to a CSV header but generate compact
encoding size.
Library support
Consider using well-known formats over a proprietary model.
Well-known formats are supported through libraries that are universally supported by the community. With
specialized formats, you need specific libraries. Your business logic might have to work around some of the API
design choices provided by the libraries.
For schema-based format, choose an encoding library that makes compatibility checks between the reader and
writer schema. Certain encoding libraries, such as Apache Avro, expect the consumer to specify both writer and the
reader schema before deserializing the message. This check ensures that the consumer is aware of the schema
versions.
Interoperability
Your choice of formats might depend on the particular workload or technology ecosystem.
For example:
Azure Stream Analytics has native support for JSON, CSV, and Avro. When using Stream Analytics, it makes
sense to choose one of these formats if possible. If not, you can provide a custom deserializer, but this adds
some additional complexity to your solution.
JSON is a standard interchange format for HTTP REST APIs. If your application receives JSON payloads from
clients and then places these onto a message queue for asynchronous processing, it might make sense to
use JSON for the messaging, rather than re-encode into a different format.
These are just two examples of interoperability considerations. In general, standardized formats will be more
interoperable than custom formats. In text-based options, JSON is one of the most interoperable.

Choices for encoding formats


Here are some popular encoding formats. Factor in the considerations before you choose a format.
JSON
JSON is an open standard (IETF RFC8259). It's a text-based format that follows the array/dictionary/value model.
JSON can be used for tagging metadata and you can parse the payload without a schema. JSON supports the
option to specify optional fields, which helps with forward and backward compatibility.
The biggest advantage is that its universally available. It's most interoperable and the default encoding format for
many messaging services.
Being a text-based format, it isn't efficient over the wire and not an ideal choice in cases where storage is a concern.
If you're returning cached items directly to a client via HTTP, storing JSON could save the cost of deserializing from
another format and then serializing to JSON.
Use for messages single record messages or a sequence of messages where each message has a different schema.
Avoid using JSON for a sequence of records, such as time-series data.
There are other variations of JSON such as BSON, which is a binary encoding aligned to work with MongoDB.
Comma-Separated Values (CSV )
CSV is a text-based tabular format. The header of the table indicates the fields. It's a preferred choice where the
message contains a set of records.
The disadvantage is lack of standardization. There are many ways of expressing separators, headers, and empty
fields.
Protocol Buffers (protobuf )
Google Protocol Buffers (or protobuf) is a serialization format that uses strongly typed definition files to define
schemas in key/value pairs. These definition files are then compiled to language-specific classes that are used for
serializing and deserializing messages.
The message contains a compressed binary small payload, which results is faster transfer. The downside is the
payload isn't human readable. Also, because the schema is external, it's not recommended for cases where you
have to retrieve archived data.
Apache Avro
Apache Avro is a binary serialization format that uses definition file similar to protobuf but there isn't a compilation
step. Instead, serialized data always includes a schema preamble.
The preamble can hold the header or a schema identifier. Because of the smaller encoding size, Avro is
recommended for streaming data. Also, because it has a header that applies to a set of records, it's a good choice
for tabular data.
MessagePack
MessagePack is a binary serialization format that is designed to be compact for transmission over the wire. There
are no message schemas or message type checking. This format isn't recommended for bulk storage.
CBOR
Concise Binary Object Representation (CBOR) (Specification) is a binary format that offers small encoding size. The
advantage of CBOR over MessagePack is that its compliant with IETF in RFC7049.

Next steps
Understand messaging design patterns for cloud applications.
Best practices for monitoring cloud applications
11/2/2020 • 68 minutes to read • Edit Online

Distributed applications and services running in the cloud are, by their nature, complex pieces of software that
comprise many moving parts. In a production environment, it's important to be able to track the way in which
users use your system, trace resource utilization, and generally monitor the health and performance of your
system. You can use this information as a diagnostic aid to detect and correct issues, and also to help spot
potential problems and prevent them from occurring.

Monitoring and diagnostics scenarios


You can use monitoring to gain an insight into how well a system is functioning. Monitoring is a crucial part of
maintaining quality-of-service targets. Common scenarios for collecting monitoring data include:
Ensuring that the system remains healthy.
Tracking the availability of the system and its component elements.
Maintaining performance to ensure that the throughput of the system does not degrade unexpectedly as the
volume of work increases.
Guaranteeing that the system meets any service-level agreements (SLAs) established with customers.
Protecting the privacy and security of the system, users, and their data.
Tracking the operations that are performed for auditing or regulatory purposes.
Monitoring the day-to-day usage of the system and spotting trends that might lead to problems if they're not
addressed.
Tracking issues that occur, from initial report through to analysis of possible causes, rectification, consequent
software updates, and deployment.
Tracing operations and debugging software releases.

NOTE
This list is not intended to be comprehensive. This document focuses on these scenarios as the most common situations for
performing monitoring. There might be others that are less common or are specific to your environment.

The following sections describe these scenarios in more detail. The information for each scenario is discussed in
the following format:
1. A brief overview of the scenario.
2. The typical requirements of this scenario.
3. The raw instrumentation data that's required to support the scenario, and possible sources of this information.
4. How this raw data can be analyzed and combined to generate meaningful diagnostic information.

Health monitoring
A system is healthy if it is running and capable of processing requests. The purpose of health monitoring is to
generate a snapshot of the current health of the system so that you can verify that all components of the system
are functioning as expected.
Requirements for health monitoring
An operator should be alerted quickly (within a matter of seconds) if any part of the system is deemed to be
unhealthy. The operator should be able to ascertain which parts of the system are functioning normally, and
which parts are experiencing problems. System health can be highlighted through a traffic-light system:
Red for unhealthy (the system has stopped)
Yellow for partially healthy (the system is running with reduced functionality)
Green for completely healthy
A comprehensive health-monitoring system enables an operator to drill down through the system to view the
health status of subsystems and components. For example, if the overall system is depicted as partially healthy,
the operator should be able to zoom in and determine which functionality is currently unavailable.
Data sources, instrumentation, and data-collection requirements
The raw data that's required to support health monitoring can be generated as a result of:
Tracing execution of user requests. This information can be used to determine which requests have succeeded,
which have failed, and how long each request takes.
Synthetic user monitoring. This process simulates the steps performed by a user and follows a predefined
series of steps. The results of each step should be captured.
Logging exceptions, faults, and warnings. This information can be captured as a result of trace statements
embedded into the application code, as well as retrieving information from the event logs of any services that
the system references.
Monitoring the health of any third-party services that the system uses. This monitoring might require
retrieving and parsing health data that these services supply. This information might take a variety of formats.
Endpoint monitoring. This mechanism is described in more detail in the "Availability monitoring" section.
Collecting ambient performance information, such as background CPU utilization or I/O (including network)
activity.
Analyzing health data
The primary focus of health monitoring is to quickly indicate whether the system is running. Hot analysis of the
immediate data can trigger an alert if a critical component is detected as unhealthy. (It fails to respond to a
consecutive series of pings, for example.) The operator can then take the appropriate corrective action.
A more advanced system might include a predictive element that performs a cold analysis over recent and current
workloads. A cold analysis can spot trends and determine whether the system is likely to remain healthy or
whether the system will need additional resources. This predictive element should be based on critical
performance metrics, such as:
The rate of requests directed at each service or subsystem.
The response times of these requests.
The volume of data flowing into and out of each service.
If the value of any metric exceeds a defined threshold, the system can raise an alert to enable an operator or
autoscaling (if available) to take the preventative actions necessary to maintain system health. These actions
might involve adding resources, restarting one or more services that are failing, or applying throttling to lower-
priority requests.

Availability monitoring
A truly healthy system requires that the components and subsystems that compose the system are available.
Availability monitoring is closely related to health monitoring. But whereas health monitoring provides an
immediate view of the current health of the system, availability monitoring is concerned with tracking the
availability of the system and its components to generate statistics about the uptime of the system.
In many systems, some components (such as a database) are configured with built-in redundancy to permit rapid
failover in the event of a serious fault or loss of connectivity. Ideally, users should not be aware that such a failure
has occurred. But from an availability monitoring perspective, it's necessary to gather as much information as
possible about such failures to determine the cause and take corrective actions to prevent them from recurring.
The data that's required to track availability might depend on a number of lower-level factors. Many of these
factors might be specific to the application, system, and environment. An effective monitoring system captures the
availability data that corresponds to these low-level factors and then aggregates them to give an overall picture of
the system. For example, in an e-commerce system, the business functionality that enables a customer to place
orders might depend on the repository where order details are stored and the payment system that handles the
monetary transactions for paying for these orders. The availability of the order-placement part of the system is
therefore a function of the availability of the repository and the payment subsystem.
Requirements for availability monitoring
An operator should also be able to view the historical availability of each system and subsystem, and use this
information to spot any trends that might cause one or more subsystems to periodically fail. (Do services start to
fail at a particular time of day that corresponds to peak processing hours?)
A monitoring solution should provide an immediate and historical view of the availability or unavailability of each
subsystem. It should also be capable of quickly alerting an operator when one or more services fail or when users
can't connect to services. This is a matter of not only monitoring each service, but also examining the actions that
each user performs if these actions fail when they attempt to communicate with a service. To some extent, a
degree of connectivity failure is normal and might be due to transient errors. But it might be useful to allow the
system to raise an alert for the number of connectivity failures to a specified subsystem that occur during a
specific period.
Data sources, instrumentation, and data-collection requirements
As with health monitoring, the raw data that's required to support availability monitoring can be generated as a
result of synthetic user monitoring and logging any exceptions, faults, and warnings that might occur. In addition,
availability data can be obtained from performing endpoint monitoring. The application can expose one or more
health endpoints, each testing access to a functional area within the system. The monitoring system can ping each
endpoint by following a defined schedule and collect the results (success or fail).
All timeouts, network connectivity failures, and connection retry attempts must be recorded. All data should be
timestamped.
Analyzing availability data
The instrumentation data must be aggregated and correlated to support the following types of analysis:
The immediate availability of the system and subsystems.
The availability failure rates of the system and subsystems. Ideally, an operator should be able to correlate
failures with specific activities: what was happening when the system failed?
A historical view of failure rates of the system or any subsystems across any specified period, and the load on
the system (number of user requests, for example) when a failure occurred.
The reasons for unavailability of the system or any subsystems. For example, the reasons might be service not
running, connectivity lost, connected but timing out, and connected but returning errors.
You can calculate the percentage availability of a service over a period of time by using the following formula:

%Availability = ((Total Time – Total Downtime) / Total Time ) * 100

This is useful for SLA purposes. (SLA monitoring is described in more detail later in this guidance.) The definition
of downtime depends on the service. For example, Visual Studio Team Services Build Service defines downtime as
the period (total accumulated minutes) during which Build Service is unavailable. A minute is considered
unavailable if all continuous HTTP requests to Build Service to perform customer-initiated operations throughout
the minute either result in an error code or do not return a response.
Performance monitoring
As the system is placed under more and more stress (by increasing the volume of users), the size of the datasets
that these users access grows and the possibility of failure of one or more components becomes more likely.
Frequently, component failure is preceded by a decrease in performance. If you're able detect such a decrease, you
can take proactive steps to remedy the situation.
System performance depends on a number of factors. Each factor is typically measured through key performance
indicators (KPIs), such as the number of database transactions per second or the volume of network requests that
are successfully serviced in a specified time frame. Some of these KPIs might be available as specific performance
measures, whereas others might be derived from a combination of metrics.

NOTE
Determining poor or good performance requires that you understand the level of performance at which the system should
be capable of running. This requires observing the system while it's functioning under a typical load and capturing the data
for each KPI over a period of time. This might involve running the system under a simulated load in a test environment and
gathering the appropriate data before deploying the system to a production environment.
You should also ensure that monitoring for performance purposes does not become a burden on the system. You might be
able to dynamically adjust the level of detail for the data that the performance monitoring process gathers.

Requirements for performance monitoring


To examine system performance, an operator typically needs to see information that includes:
The response rates for user requests.
The number of concurrent user requests.
The volume of network traffic.
The rates at which business transactions are being completed.
The average processing time for requests.
It can also be helpful to provide tools that enable an operator to help spot correlations, such as:
The number of concurrent users versus request latency times (how long it takes to start processing a request
after the user has sent it).
The number of concurrent users versus the average response time (how long it takes to complete a request
after it has started processing).
The volume of requests versus the number of processing errors.
Along with this high-level functional information, an operator should be able to obtain a detailed view of the
performance for each component in the system. This data is typically provided through low-level performance
counters that track information such as:
Memory utilization.
Number of threads.
CPU processing time.
Request queue length.
Disk or network I/O rates and errors.
Number of bytes written or read.
Middleware indicators, such as queue length.
All visualizations should allow an operator to specify a time period. The displayed data might be a snapshot of the
current situation and/or a historical view of the performance.
An operator should be able to raise an alert based on any performance measure for any specified value during
any specified time interval.
Data sources, instrumentation, and data-collection requirements
You can gather high-level performance data (throughput, number of concurrent users, number of business
transactions, error rates, and so on) by monitoring the progress of users' requests as they arrive and pass through
the system. This involves incorporating tracing statements at key points in the application code, together with
timing information. All faults, exceptions, and warnings should be captured with sufficient data for correlating
them with the requests that caused them. The Internet Information Services (IIS) log is another useful source.
If possible, you should also capture performance data for any external systems that the application uses. These
external systems might provide their own performance counters or other features for requesting performance
data. If this is not possible, record information such as the start time and end time of each request made to an
external system, together with the status (success, fail, or warning) of the operation. For example, you can use a
stopwatch approach to time requests: start a timer when the request starts and then stop the timer when the
request finishes.
Low-level performance data for individual components in a system might be available through features and
services such as Windows performance counters and Azure Diagnostics.
Analyzing performance data
Much of the analysis work consists of aggregating performance data by user request type and/or the subsystem
or service to which each request is sent. An example of a user request is adding an item to a shopping cart or
performing the checkout process in an e-commerce system.
Another common requirement is summarizing performance data in selected percentiles. For example, an operator
might determine the response times for 99 percent of requests, 95 percent of requests, and 70 percent of
requests. There might be SLA targets or other goals set for each percentile. The ongoing results should be
reported in near real time to help detect immediate issues. The results should also be aggregated over the longer
time for statistical purposes.
In the case of latency issues affecting performance, an operator should be able to quickly identify the cause of the
bottleneck by examining the latency of each step that each request performs. The performance data must
therefore provide a means of correlating performance measures for each step to tie them to a specific request.
Depending on the visualization requirements, it might be useful to generate and store a data cube that contains
views of the raw data. This data cube can allow complex ad hoc querying and analysis of the performance
information.

Security monitoring
All commercial systems that include sensitive data must implement a security structure. The complexity of the
security mechanism is usually a function of the sensitivity of the data. In a system that requires users to be
authenticated, you should record:
All sign-in attempts, whether they fail or succeed.
All operations performed by—and the details of all resources accessed by—an authenticated user.
When a user ends a session and signs out.
Monitoring might be able to help detect attacks on the system. For example, a large number of failed sign-in
attempts might indicate a brute-force attack. An unexpected surge in requests might be the result of a distributed
denial-of-service (DDoS) attack. You must be prepared to monitor all requests to all resources regardless of the
source of these requests. A system that has a sign-in vulnerability might accidentally expose resources to the
outside world without requiring a user to actually sign in.
Requirements for security monitoring
The most critical aspects of security monitoring should enable an operator to quickly:
Detect attempted intrusions by an unauthenticated entity.
Identify attempts by entities to perform operations on data for which they have not been granted access.
Determine whether the system, or some part of the system, is under attack from outside or inside. (For
example, a malicious authenticated user might be attempting to bring the system down.)
To support these requirements, an operator should be notified if:
One account makes repeated failed sign-in attempts within a specified period.
One authenticated account repeatedly tries to access a prohibited resource during a specified period.
A large number of unauthenticated or unauthorized requests occur during a specified period.
The information that's provided to an operator should include the host address of the source for each request. If
security violations regularly arise from a particular range of addresses, these hosts might be blocked.
A key part in maintaining the security of a system is being able to quickly detect actions that deviate from the
usual pattern. Information such as the number of failed and/or successful sign-in requests can be displayed
visually to help detect whether there is a spike in activity at an unusual time. (An example of this activity is users
signing in at 3:00 AM and performing a large number of operations when their working day starts at 9:00 AM).
This information can also be used to help configure time-based autoscaling. For example, if an operator observes
that a large number of users regularly sign in at a particular time of day, the operator can arrange to start
additional authentication services to handle the volume of work, and then shut down these additional services
when the peak has passed.
Data sources, instrumentation, and data-collection requirements
Security is an all-encompassing aspect of most distributed systems. The pertinent data is likely to be generated at
multiple points throughout a system. You should consider adopting a Security Information and Event
Management (SIEM) approach to gather the security-related information that results from events raised by the
application, network equipment, servers, firewalls, antivirus software, and other intrusion-prevention elements.
Security monitoring can incorporate data from tools that are not part of your application. These tools can include
utilities that identify port-scanning activities by external agencies, or network filters that detect attempts to gain
unauthenticated access to your application and data.
In all cases, the gathered data must enable an administrator to determine the nature of any attack and take the
appropriate countermeasures.
Analyzing security data
A feature of security monitoring is the variety of sources from which the data arises. The different formats and
level of detail often require complex analysis of the captured data to tie it together into a coherent thread of
information. Apart from the simplest of cases (such as detecting a large number of failed sign-ins, or repeated
attempts to gain unauthorized access to critical resources), it might not be possible to perform any complex
automated processing of security data. Instead, it might be preferable to write this data, timestamped but
otherwise in its original form, to a secure repository to allow for expert manual analysis.

SLA monitoring
Many commercial systems that support paying customers make guarantees about the performance of the system
in the form of SLAs. Essentially, SLAs state that the system can handle a defined volume of work within an agreed
time frame and without losing critical information. SLA monitoring is concerned with ensuring that the system
can meet measurable SLAs.
NOTE
SLA monitoring is closely related to performance monitoring. But whereas performance monitoring is concerned with
ensuring that the system functions optimally, SLA monitoring is governed by a contractual obligation that defines what
optimally actually means.

SLAs are often defined in terms of:


Overall system availability. For example, an organization might guarantee that the system will be available for
99.9 percent of the time. This equates to no more than 9 hours of downtime per year, or approximately 10
minutes a week.
Operational throughput. This aspect is often expressed as one or more high-water marks, such as
guaranteeing that the system can support up to 100,000 concurrent user requests or handle 10,000
concurrent business transactions.
Operational response time. The system might also make guarantees for the rate at which requests are
processed. An example is that 99 percent of all business transactions will finish within 2 seconds, and no single
transaction will take longer than 10 seconds.

NOTE
Some contracts for commercial systems might also include SLAs for customer support. An example is that all help-desk
requests will elicit a response within five minutes, and that 99 percent of all problems will be fully addressed within 1
working day. Effective issue tracking (described later in this section) is key to meeting SLAs such as these.

Requirements for SLA monitoring


At the highest level, an operator should be able to determine at a glance whether the system is meeting the
agreed SLAs or not. And if not, the operator should be able to drill down and examine the underlying factors to
determine the reasons for substandard performance.
Typical high-level indicators that can be depicted visually include:
The percentage of service uptime.
The application throughput (measured in terms of successful transactions and/or operations per second).
The number of successful/failing application requests.
The number of application and system faults, exceptions, and warnings.
All of these indicators should be capable of being filtered by a specified period of time.
A cloud application will likely comprise a number of subsystems and components. An operator should be able to
select a high-level indicator and see how it's composed from the health of the underlying elements. For example,
if the uptime of the overall system falls below an acceptable value, an operator should be able to zoom in and
determine which elements are contributing to this failure.

NOTE
System uptime needs to be defined carefully. In a system that uses redundancy to ensure maximum availability, individual
instances of elements might fail, but the system can remain functional. System uptime as presented by health monitoring
should indicate the aggregate uptime of each element and not necessarily whether the system has actually halted.
Additionally, failures might be isolated. So even if a specific system is unavailable, the remainder of the system might remain
available, although with decreased functionality. (In an e-commerce system, a failure in the system might prevent a
customer from placing orders, but the customer might still be able to browse the product catalog.)

For alerting purposes, the system should be able to raise an event if any of the high-level indicators exceed a
specified threshold. The lower-level details of the various factors that compose the high-level indicator should be
available as contextual data to the alerting system.
Data sources, instrumentation, and data-collection requirements
The raw data that's required to support SLA monitoring is similar to the raw data that's required for performance
monitoring, together with some aspects of health and availability monitoring. (See those sections for more
details.) You can capture this data by:
Performing endpoint monitoring.
Logging exceptions, faults, and warnings.
Tracing the execution of user requests.
Monitoring the availability of any third-party services that the system uses.
Using performance metrics and counters.
All data must be timed and timestamped.
Analyzing SLA data
The instrumentation data must be aggregated to generate a picture of the overall performance of the system.
Aggregated data must also support drill-down to enable examination of the performance of the underlying
subsystems. For example, you should be able to:
Calculate the total number of user requests during a specified period and determine the success and failure
rate of these requests.
Combine the response times of user requests to generate an overall view of system response times.
Analyze the progress of user requests to break down the overall response time of a request into the response
times of the individual work items in that request.
Determine the overall availability of the system as a percentage of uptime for any specific period.
Analyze the percentage time availability of the individual components and services in the system. This might
involve parsing logs that third-party services have generated.
Many commercial systems are required to report real performance figures against agreed SLAs for a specified
period, typically a month. This information can be used to calculate credits or other forms of repayments for
customers if the SLAs are not met during that period. You can calculate availability for a service by using the
technique described in the section Analyzing availability data.
For internal purposes, an organization might also track the number and nature of incidents that caused services to
fail. Learning how to resolve these issues quickly, or eliminate them completely, will help to reduce downtime and
meet SLAs.

Auditing
Depending on the nature of the application, there might be statutory or other legal regulations that specify
requirements for auditing users' operations and recording all data access. Auditing can provide evidence that links
customers to specific requests. Nonrepudiation is an important factor in many e-business systems to help
maintain trust be between a customer and the organization that's responsible for the application or service.
Requirements for auditing
An analyst must be able to trace the sequence of business operations that users are performing so that you can
reconstruct users' actions. This might be necessary simply as a matter of record, or as part of a forensic
investigation.
Audit information is highly sensitive. It will likely include data that identifies the users of the system, together with
the tasks that they're performing. For this reason, audit information will most likely take the form of reports that
are available only to trusted analysts rather than as an interactive system that supports drill-down of graphical
operations. An analyst should be able to generate a range of reports. For example, reports might list all users'
activities occurring during a specified time frame, detail the chronology of activity for a single user, or list the
sequence of operations performed against one or more resources.
Data sources, instrumentation, and data-collection requirements
The primary sources of information for auditing can include:
The security system that manages user authentication.
Trace logs that record user activity.
Security logs that track all identifiable and unidentifiable network requests.
The format of the audit data and the way in which it's stored might be driven by regulatory requirements. For
example, it might not be possible to clean the data in any way. (It must be recorded in its original format.) Access
to the repository where it's held must be protected to prevent tampering.
Analyzing audit data
An analyst must be able to access the raw data in its entirety, in its original form. Aside from the requirement to
generate common audit reports, the tools for analyzing this data are likely to be specialized and kept external to
the system.

Usage monitoring
Usage monitoring tracks how the features and components of an application are used. An operator can use the
gathered data to:
Determine which features are heavily used and determine any potential hotspots in the system. High-traffic
elements might benefit from functional partitioning or even replication to spread the load more evenly. An
operator can also use this information to ascertain which features are infrequently used and are possible
candidates for retirement or replacement in a future version of the system.
Obtain information about the operational events of the system under normal use. For example, in an e-
commerce site, you can record the statistical information about the number of transactions and the volume
of customers that are responsible for them. This information can be used for capacity planning as the
number of customers grows.
Detect (possibly indirectly) user satisfaction with the performance or functionality of the system. For
example, if a large number of customers in an e-commerce system regularly abandon their shopping carts,
this might be due to a problem with the checkout functionality.
Generate billing information. A commercial application or multitenant service might charge customers for
the resources that they use.
Enforce quotas. If a user in a multitenant system exceeds their paid quota of processing time or resource
usage during a specified period, their access can be limited or processing can be throttled.
Requirements for usage monitoring
To examine system usage, an operator typically needs to see information that includes:
The number of requests that are processed by each subsystem and directed to each resource.
The work that each user is performing.
The volume of data storage that each user occupies.
The resources that each user is accessing.
An operator should also be able to generate graphs. For example, a graph might display the most resource-
hungry users, or the most frequently accessed resources or system features.
Data sources, instrumentation, and data-collection requirements
Usage tracking can be performed at a relatively high level. It can note the start and end times of each request and
the nature of the request (read, write, and so on, depending on the resource in question). You can obtain this
information by:
Tracing user activity.
Capturing performance counters that measure the utilization for each resource.
Monitoring the resource consumption by each user.
For metering purposes, you also need to be able to identify which users are responsible for performing which
operations, and the resources that these operations use. The gathered information should be detailed enough to
enable accurate billing.

Issue tracking
Customers and other users might report issues if unexpected events or behavior occurs in the system. Issue
tracking is concerned with managing these issues, associating them with efforts to resolve any underlying
problems in the system, and informing customers of possible resolutions.
Requirements for issue tracking
Operators often perform issue tracking by using a separate system that enables them to record and report the
details of problems that users report. These details can include the tasks that the user was trying to perform,
symptoms of the problem, the sequence of events, and any error or warning messages that were issued.
Data sources, instrumentation, and data-collection requirements
The initial data source for issue-tracking data is the user who reported the issue in the first place. The user might
be able to provide additional data such as:
A crash dump (if the application includes a component that runs on the user's desktop).
A screen snapshot.
The date and time when the error occurred, together with any other environmental information such as the
user's location.
This information can be used to help the debugging effort and help construct a backlog for future releases of the
software.
Analyzing issue -tracking data
Different users might report the same problem. The issue-tracking system should associate common reports.
The progress of the debugging effort should be recorded against each issue report. When the problem is
resolved, the customer can be informed of the solution.
If a user reports an issue that has a known solution in the issue-tracking system, the operator should be able to
inform the user of the solution immediately.

Tracing operations and debugging software releases


When a user reports an issue, the user is often only aware of the immediate effect that it has on their operations.
The user can only report the results of their own experience back to an operator who is responsible for
maintaining the system. These experiences are usually just a visible symptom of one or more fundamental
problems. In many cases, an analyst will need to dig through the chronology of the underlying operations to
establish the root cause of the problem. This process is called root cause analysis.
NOTE
Root cause analysis might uncover inefficiencies in the design of an application. In these situations, it might be possible to
rework the affected elements and deploy them as part of a subsequent release. This process requires careful control, and
the updated components should be monitored closely.

Requirements for tracing and debugging


For tracing unexpected events and other problems, it's vital that the monitoring data provides enough information
to enable an analyst to trace back to the origins of these issues and reconstruct the sequence of events that
occurred. This information must be sufficient to enable an analyst to diagnose the root cause of any problems. A
developer can then make the necessary modifications to prevent them from recurring.
Data sources, instrumentation, and data-collection requirements
Troubleshooting can involve tracing all the methods (and their parameters) invoked as part of an operation to
build up a tree that depicts the logical flow through the system when a customer makes a specific request.
Exceptions and warnings that the system generates as a result of this flow need to be captured and logged.
To support debugging, the system can provide hooks that enable an operator to capture state information at
crucial points in the system. Or, the system can deliver detailed step-by-step information as selected operations
progress. Capturing data at this level of detail can impose an additional load on the system and should be a
temporary process. An operator uses this process mainly when a highly unusual series of events occurs and is
difficult to replicate, or when a new release of one or more elements into a system requires careful monitoring to
ensure that the elements function as expected.

The monitoring and diagnostics pipeline


Monitoring a large-scale distributed system poses a significant challenge. Each of the scenarios described in the
previous section should not necessarily be considered in isolation. There is likely to be a significant overlap in the
monitoring and diagnostic data that's required for each situation, although this data might need to be processed
and presented in different ways. For these reasons, you should take a holistic view of monitoring and diagnostics.
You can envisage the entire monitoring and diagnostics process as a pipeline that comprises the stages shown in
Figure 1.

Figure 1 - The stages in the monitoring and diagnostics pipeline.


Figure 1 highlights how the data for monitoring and diagnostics can come from a variety of data sources. The
instrumentation and collection stages are concerned with identifying the sources from where the data needs to be
captured, determining which data to capture, how to capture it, and how to format this data so that it can be easily
examined. The analysis/diagnosis stage takes the raw data and uses it to generate meaningful information that an
operator can use to determine the state of the system. The operator can use this information to make decisions
about possible actions to take, and then feed the results back into the instrumentation and collection stages. The
visualization/alerting stage phase presents a consumable view of the system state. It can display information in
near real time by using a series of dashboards. And it can generate reports, graphs, and charts to provide a
historical view of the data that can help identify long-term trends. If information indicates that a KPI is likely to
exceed acceptable bounds, this stage can also trigger an alert to an operator. In some cases, an alert can also be
used to trigger an automated process that attempts to take corrective actions, such as autoscaling.
Note that these steps constitute a continuous-flow process where the stages are happening in parallel. Ideally, all
the phases should be dynamically configurable. At some points, especially when a system has been newly
deployed or is experiencing problems, it might be necessary to gather extended data on a more frequent basis. At
other times, it should be possible to revert to capturing a base level of essential information to verify that the
system is functioning properly.
Additionally, the entire monitoring process should be considered a live, ongoing solution that's subject to fine-
tuning and improvements as a result of feedback. For example, you might start with measuring many factors to
determine system health. Analysis over time might lead to a refinement as you discard measures that aren't
relevant, enabling you to more precisely focus on the data that you need while minimizing background noise.

Sources of monitoring and diagnostic data


The information that the monitoring process uses can come from several sources, as illustrated in Figure 1. At the
application level, information comes from trace logs incorporated into the code of the system. Developers should
follow a standard approach for tracking the flow of control through their code. For example, an entry to a method
can emit a trace message that specifies the name of the method, the current time, the value of each parameter,
and any other pertinent information. Recording the entry and exit times can also prove useful.
You should log all exceptions and warnings, and ensure that you retain a full trace of any nested exceptions and
warnings. Ideally, you should also capture information that identifies the user who is running the code, together
with activity correlation information (to track requests as they pass through the system). And you should log
attempts to access all resources such as message queues, databases, files, and other dependent services. This
information can be used for metering and auditing purposes.
Many applications use libraries and frameworks to perform common tasks such as accessing a data store or
communicating over a network. These frameworks might be configurable to provide their own trace messages
and raw diagnostic information, such as transaction rates and data transmission successes and failures.

NOTE
Many modern frameworks automatically publish performance and trace events. Capturing this information is simply a
matter of providing a means to retrieve and store it where it can be processed and analyzed.

The operating system where the application is running can be a source of low-level system-wide information,
such as performance counters that indicate I/O rates, memory utilization, and CPU usage. Operating system
errors (such as the failure to open a file correctly) might also be reported.
You should also consider the underlying infrastructure and components on which your system runs. Virtual
machines, virtual networks, and storage services can all be sources of important infrastructure-level performance
counters and other diagnostic data.
If your application uses other external services, such as a web server or database management system, these
services might publish their own trace information, logs, and performance counters. Examples include SQL Server
Dynamic Management Views for tracking operations performed against a SQL Server database, and IIS trace logs
for recording requests made to a web server.
As the components of a system are modified and new versions are deployed, it's important to be able to attribute
issues, events, and metrics to each version. This information should be tied back to the release pipeline so that
problems with a specific version of a component can be tracked quickly and rectified.
Security issues might occur at any point in the system. For example, a user might attempt to sign in with an
invalid user ID or password. An authenticated user might try to obtain unauthorized access to a resource. Or a
user might provide an invalid or outdated key to access encrypted information. Security-related information for
successful and failing requests should always be logged.
The section Instrumenting an application contains more guidance on the information that you should capture. But
you can use a variety of strategies to gather this information:
Application/system monitoring . This strategy uses internal sources within the application, application
frameworks, operating system, and infrastructure. The application code can generate its own monitoring
data at notable points during the lifecycle of a client request. The application can include tracing statements
that might be selectively enabled or disabled as circumstances dictate. It might also be possible to inject
diagnostics dynamically by using a diagnostics framework. These frameworks typically provide plug-ins
that can attach to various instrumentation points in your code and capture trace data at these points.
Additionally, your code and/or the underlying infrastructure might raise events at critical points.
Monitoring agents that are configured to listen for these events can record the event information.
Real user monitoring . This approach records the interactions between a user and the application and
observes the flow of each request and response. This information can have a two-fold purpose: it can be
used for metering usage by each user, and it can be used to determine whether users are receiving a
suitable quality of service (for example, fast response times, low latency, and minimal errors). You can use
the captured data to identify areas of concern where failures occur most often. You can also use the data to
identify elements where the system slows down, possibly due to hotspots in the application or some other
form of bottleneck. If you implement this approach carefully, it might be possible to reconstruct users'
flows through the application for debugging and testing purposes.

IMPORTANT
You should consider the data that's captured by monitoring real users to be highly sensitive because it might include
confidential material. If you save captured data, store it securely. If you want to use the data for performance
monitoring or debugging purposes, strip out all personally identifiable information first.

Synthetic user monitoring . In this approach, you write your own test client that simulates a user and
performs a configurable but typical series of operations. You can track the performance of the test client to
help determine the state of the system. You can also use multiple instances of the test client as part of a
load-testing operation to establish how the system responds under stress, and what sort of monitoring
output is generated under these conditions.

NOTE
You can implement real and synthetic user monitoring by including code that traces and times the execution of
method calls and other critical parts of an application.

Profiling . This approach is primarily targeted at monitoring and improving application performance.
Rather than operating at the functional level of real and synthetic user monitoring, it captures lower-level
information as the application runs. You can implement profiling by using periodic sampling of the
execution state of an application (determining which piece of code that the application is running at a given
point in time). You can also use instrumentation that inserts probes into the code at important junctures
(such as the start and end of a method call) and records which methods were invoked, at what time, and
how long each call took. You can then analyze this data to determine which parts of the application might
cause performance problems.
Endpoint monitoring . This technique uses one or more diagnostic endpoints that the application exposes
specifically to enable monitoring. An endpoint provides a pathway into the application code and can return
information about the health of the system. Different endpoints can focus on various aspects of the
functionality. You can write your own diagnostics client that sends periodic requests to these endpoints and
assimilate the responses. For more information, see the Health Endpoint Monitoring pattern.
For maximum coverage, you should use a combination of these techniques.

Instrumenting an application
Instrumentation is a critical part of the monitoring process. You can make meaningful decisions about the
performance and health of a system only if you first capture the data that enables you to make these decisions.
The information that you gather by using instrumentation should be sufficient to enable you to assess
performance, diagnose problems, and make decisions without requiring you to sign in to a remote production
server to perform tracing (and debugging) manually. Instrumentation data typically comprises metrics and
information that's written to trace logs.
The contents of a trace log can be the result of textual data that's written by the application or binary data that's
created as the result of a trace event (if the application is using Event Tracing for Windows--ETW). They can also be
generated from system logs that record events arising from parts of the infrastructure, such as a web server.
Textual log messages are often designed to be human-readable, but they should also be written in a format that
enables an automated system to parse them easily.
You should also categorize logs. Don't write all trace data to a single log, but use separate logs to record the trace
output from different operational aspects of the system. You can then quickly filter log messages by reading from
the appropriate log rather than having to process a single lengthy file. Never write information that has different
security requirements (such as audit information and debugging data) to the same log.

NOTE
A log might be implemented as a file on the file system, or it might be held in some other format, such as a blob in blob
storage. Log information might also be held in more structured storage, such as rows in a table.

Metrics will generally be a measure or count of some aspect or resource in the system at a specific time, with one
or more associated tags or dimensions (sometimes called a sample). A single instance of a metric is usually not
useful in isolation. Instead, metrics have to be captured over time. The key issue to consider is which metrics you
should record and how frequently. Generating data for metrics too often can impose a significant additional load
on the system, whereas capturing metrics infrequently might cause you to miss the circumstances that lead to a
significant event. The considerations will vary from metric to metric. For example, CPU utilization on a server
might vary significantly from second to second, but high utilization becomes an issue only if it's long-lived over a
number of minutes.
Information for correlating data
You can easily monitor individual system-level performance counters, capture metrics for resources, and obtain
application trace information from various log files. But some forms of monitoring require the analysis and
diagnostics stage in the monitoring pipeline to correlate the data that's retrieved from several sources. This data
might take several forms in the raw data, and the analysis process must be provided with sufficient
instrumentation data to be able to map these different forms. For example, at the application framework level, a
task might be identified by a thread ID. Within an application, the same work might be associated with the user ID
for the user who is performing that task.
Also, there's unlikely to be a 1:1 mapping between threads and user requests, because asynchronous operations
might reuse the same threads to perform operations on behalf of more than one user. To complicate matters
further, a single request might be handled by more than one thread as execution flows through the system. If
possible, associate each request with a unique activity ID that's propagated through the system as part of the
request context. (The technique for generating and including activity IDs in trace information depends on the
technology that's used to capture the trace data.)
All monitoring data should be timestamped in the same way. For consistency, record all dates and times by using
Coordinated Universal Time. This will help you more easily trace sequences of events.

NOTE
Computers operating in different time zones and networks might not be synchronized. Don't depend on using timestamps
alone for correlating instrumentation data that spans multiple machines.

Information to include in the instrumentation data


Consider the following points when you're deciding which instrumentation data you need to collect:
Make sure that information captured by trace events is machine and human readable. Adopt well-defined
schemas for this information to facilitate automated processing of log data across systems, and to provide
consistency to operations and engineering staff reading the logs. Include environmental information, such
as the deployment environment, the machine on which the process is running, the details of the process,
and the call stack.
Enable profiling only when necessary because it can impose a significant overhead on the system. Profiling
by using instrumentation records an event (such as a method call) every time it occurs, whereas sampling
records only selected events. The selection can be time-based (once every n seconds), or frequency-based
(once every n requests). If events occur very frequently, profiling by instrumentation might cause too much
of a burden and itself affect overall performance. In this case, the sampling approach might be preferable.
However, if the frequency of events is low, sampling might miss them. In this case, instrumentation might
be the better approach.
Provide sufficient context to enable a developer or administrator to determine the source of each request.
This might include some form of activity ID that identifies a specific instance of a request. It might also
include information that can be used to correlate this activity with the computational work performed and
the resources used. Note that this work might cross process and machine boundaries. For metering, the
context should also include (either directly or indirectly via other correlated information) a reference to the
customer who caused the request to be made. This context provides valuable information about the
application state at the time that the monitoring data was captured.
Record all requests, and the locations or regions from which these requests are made. This information can
assist in determining whether there are any location-specific hotspots. This information can also be useful
in determining whether to repartition an application or the data that it uses.
Record and capture the details of exceptions carefully. Often, critical debug information is lost as a result of
poor exception handling. Capture the full details of exceptions that the application throws, including any
inner exceptions and other context information. Include the call stack if possible.
Be consistent in the data that the different elements of your application capture, because this can assist in
analyzing events and correlating them with user requests. Consider using a comprehensive and
configurable logging package to gather information, rather than depending on developers to adopt the
same approach as they implement different parts of the system. Gather data from key performance
counters, such as the volume of I/O being performed, network utilization, number of requests, memory
use, and CPU utilization. Some infrastructure services might provide their own specific performance
counters, such as the number of connections to a database, the rate at which transactions are being
performed, and the number of transactions that succeed or fail. Applications might also define their own
specific performance counters.
Log all calls made to external services, such as database systems, web services, or other system-level
services that are part of the infrastructure. Record information about the time taken to perform each call,
and the success or failure of the call. If possible, capture information about all retry attempts and failures
for any transient errors that occur.
Ensuring compatibility with telemetry systems
In many cases, the information that instrumentation produces is generated as a series of events and passed to a
separate telemetry system for processing and analysis. A telemetry system is typically independent of any specific
application or technology, but it expects information to follow a specific format that's usually defined by a schema.
The schema effectively specifies a contract that defines the data fields and types that the telemetry system can
ingest. The schema should be generalized to allow for data arriving from a range of platforms and devices.
A common schema should include fields that are common to all instrumentation events, such as the event name,
the event time, the IP address of the sender, and the details that are required for correlating with other events
(such as a user ID, a device ID, and an application ID). Remember that any number of devices might raise events,
so the schema should not depend on the device type. Additionally, various devices might raise events for the same
application; the application might support roaming or some other form of cross-device distribution.
The schema might also include domain fields that are relevant to a particular scenario that's common across
different applications. This might be information about exceptions, application start and end events, and success
and/or failure of web service API calls. All applications that use the same set of domain fields should emit the
same set of events, enabling a set of common reports and analytics to be built.
Finally, a schema might contain custom fields for capturing the details of application-specific events.
Best practices for instrumenting applications
The following list summarizes best practices for instrumenting a distributed application running in the cloud.
Make logs easy to read and easy to parse. Use structured logging where possible. Be concise and
descriptive in log messages.
In all logs, identify the source and provide context and timing information as each log record is written.
Use the same time zone and format for all timestamps. This will help to correlate events for operations that
span hardware and services running in different geographic regions.
Categorize logs and write messages to the appropriate log file.
Do not disclose sensitive information about the system or personal information about users. Scrub this
information before it's logged, but ensure that the relevant details are retained. For example, remove the ID
and password from any database connection strings, but write the remaining information to the log so that
an analyst can determine that the system is accessing the correct database. Log all critical exceptions, but
enable the administrator to turn logging on and off for lower levels of exceptions and warnings. Also,
capture and log all retry logic information. This data can be useful in monitoring the transient health of the
system.
Trace out of process calls, such as requests to external web services or databases.
Don't mix log messages with different security requirements in the same log file. For example, don't write
debug and audit information to the same log.
With the exception of auditing events, make sure that all logging calls are fire-and-forget operations that
do not block the progress of business operations. Auditing events are exceptional because they are critical
to the business and can be classified as a fundamental part of business operations.
Make sure that logging is extensible and does not have any direct dependencies on a concrete target. For
example, rather than writing information by using System.Diagnostics.Trace, define an abstract interface
(such as ILogger) that exposes logging methods and that can be implemented through any appropriate
means.
Make sure that all logging is fail-safe and never triggers any cascading errors. Logging must not throw any
exceptions.
Treat instrumentation as an ongoing iterative process and review logs regularly, not just when there is a
problem.

Collecting and storing data


The collection stage of the monitoring process is concerned with retrieving the information that instrumentation
generates, formatting this data to make it easier for the analysis/diagnosis stage to consume, and saving the
transformed data in reliable storage. The instrumentation data that you gather from different parts of a
distributed system can be held in a variety of locations and with varying formats. For example, your application
code might generate trace log files and generate application event log data, whereas performance counters that
monitor key aspects of the infrastructure that your application uses can be captured through other technologies.
Any third-party components and services that your application uses might provide instrumentation information
in different formats, by using separate trace files, blob storage, or even a custom data store.
Data collection is often performed through a collection service that can run autonomously from the application
that generates the instrumentation data. Figure 2 illustrates an example of this architecture, highlighting the
instrumentation data-collection subsystem.

Figure 2 - Collecting instrumentation data.


Note that this is a simplified view. The collection service is not necessarily a single process and might comprise
many constituent parts running on different machines, as described in the following sections. Additionally, if the
analysis of some telemetry data must be performed quickly (hot analysis, as described in the section Supporting
hot, warm, and cold analysis later in this document), local components that operate outside the collection service
might perform the analysis tasks immediately. Figure 2 depicts this situation for selected events. After analytical
processing, the results can be sent directly to the visualization and alerting subsystem. Data that's subjected to
warm or cold analysis is held in storage while it awaits processing.
For Azure applications and services, Azure Diagnostics provides one possible solution for capturing data. Azure
Diagnostics gathers data from the following sources for each compute node, aggregates it, and then uploads it to
Azure Storage:
IIS logs
IIS Failed Request logs
Windows event logs
Performance counters
Crash dumps
Azure Diagnostics infrastructure logs
Custom error logs
.NET EventSource
Manifest-based ETW
For more information, see the article Azure: Telemetry Basics and Troubleshooting.
Strategies for collecting instrumentation data
Considering the elastic nature of the cloud, and to avoid the necessity of manually retrieving telemetry data from
every node in the system, you should arrange for the data to be transferred to a central location and consolidated.
In a system that spans multiple datacenters, it might be useful to first collect, consolidate, and store data on a
region-by-region basis, and then aggregate the regional data into a single central system.
To optimize the use of bandwidth, you can elect to transfer less urgent data in chunks, as batches. However, the
data must not be delayed indefinitely, especially if it contains time-sensitive information.
Pulling and pushing instrumentation data
The instrumentation data-collection subsystem can actively retrieve instrumentation data from the various logs
and other sources for each instance of the application (the pull model). Or, it can act as a passive receiver that
waits for the data to be sent from the components that constitute each instance of the application (the push
model).
One approach to implementing the pull model is to use monitoring agents that run locally with each instance of
the application. A monitoring agent is a separate process that periodically retrieves (pulls) telemetry data collected
at the local node and writes this information directly to centralized storage that all instances of the application
share. This is the mechanism that Azure Diagnostics implements. Each instance of an Azure web or worker role
can be configured to capture diagnostic and other trace information that's stored locally. The monitoring agent
that runs alongside each instance copies the specified data to Azure Storage. The article Enabling Diagnostics in
Azure Cloud Services and Virtual Machines provides more details on this process. Some elements, such as IIS
logs, crash dumps, and custom error logs, are written to blob storage. Data from the Windows event log, ETW
events, and performance counters is recorded in table storage. Figure 3 illustrates this mechanism.

Figure 3 - Using a monitoring agent to pull information and write to shared storage.
NOTE
Using a monitoring agent is ideally suited to capturing instrumentation data that's naturally pulled from a data source. An
example is information from SQL Server Dynamic Management Views or the length of an Azure Service Bus queue.

It's feasible to use the approach just described to store telemetry data for a small-scale application running on a
limited number of nodes in a single location. However, a complex, highly scalable, global cloud application might
generate huge volumes of data from hundreds of web and worker roles, database shards, and other services. This
flood of data can easily overwhelm the I/O bandwidth available with a single, central location. Therefore, your
telemetry solution must be scalable to prevent it from acting as a bottleneck as the system expands. Ideally, your
solution should incorporate a degree of redundancy to reduce the risks of losing important monitoring
information (such as auditing or billing data) if part of the system fails.
To address these issues, you can implement queuing, as shown in Figure 4. In this architecture, the local
monitoring agent (if it can be configured appropriately) or custom data-collection service (if not) posts data to a
queue. A separate process running asynchronously (the storage writing service in Figure 4) takes the data in this
queue and writes it to shared storage. A message queue is suitable for this scenario because it provides "at least
once" semantics that help ensure that queued data will not be lost after it's posted. You can implement the storage
writing service by using a separate worker role.

Figure 4 - Using a queue to buffer instrumentation data.


The local data-collection service can add data to a queue immediately after it's received. The queue acts as a
buffer, and the storage writing service can retrieve and write the data at its own pace. By default, a queue operates
on a first-in, first-out basis. But you can prioritize messages to accelerate them through the queue if they contain
data that must be handled more quickly. For more information, see the Priority Queue pattern. Alternatively, you
can use different channels (such as Service Bus topics) to direct data to different destinations depending on the
form of analytical processing that's required.
For scalability, you can run multiple instances of the storage writing service. If there is a high volume of events,
you can use an event hub to dispatch the data to different compute resources for processing and storage.
Consolidating instrumentation data
The instrumentation data that the data-collection service retrieves from a single instance of an application gives a
localized view of the health and performance of that instance. To assess the overall health of the system, it's
necessary to consolidate some aspects of the data in the local views. You can perform this after the data has been
stored, but in some cases, you can also achieve it as the data is collected. Rather than being written directly to
shared storage, the instrumentation data can pass through a separate data consolidation service that combines
data and acts as a filter and cleanup process. For example, instrumentation data that includes the same correlation
information such as an activity ID can be amalgamated. (It's possible that a user starts performing a business
operation on one node and then gets transferred to another node in the event of node failure, or depending on
how load balancing is configured.) This process can also detect and remove any duplicated data (always a
possibility if the telemetry service uses message queues to push instrumentation data out to storage). Figure 5
illustrates an example of this structure.

Figure 5 - Using a separate service to consolidate and clean up instrumentation data.


Storing instrumentation data
The previous discussions have depicted a rather simplistic view of the way in which instrumentation data is
stored. In reality, it can make sense to store the different types of information by using technologies that are most
appropriate to the way in which each type is likely to be used.
For example, Azure blob and table storage have some similarities in the way in which they're accessed. But they
have limitations in the operations that you can perform by using them, and the granularity of the data that they
hold is quite different. If you need to perform more analytical operations or require full-text search capabilities on
the data, it might be more appropriate to use data storage that provides capabilities that are optimized for specific
types of queries and data access. For example:
Performance counter data can be stored in a SQL database to enable ad hoc analysis.
Trace logs might be better stored in Azure Cosmos DB.
Security information can be written to HDFS.
Information that requires full-text search can be stored through Elasticsearch (which can also speed searches
by using rich indexing).
You can implement an additional service that periodically retrieves the data from shared storage, partitions and
filters the data according to its purpose, and then writes it to an appropriate set of data stores as shown in Figure
6. An alternative approach is to include this functionality in the consolidation and cleanup process and write the
data directly to these stores as it's retrieved rather than saving it in an intermediate shared storage area. Each
approach has its advantages and disadvantages. Implementing a separate partitioning service lessens the load on
the consolidation and cleanup service, and it enables at least some of the partitioned data to be regenerated if
necessary (depending on how much data is retained in shared storage). However, it consumes additional
resources. Also, there might be a delay between the receipt of instrumentation data from each application
instance and the conversion of this data into actionable information.

Figure 6 - Partitioning data according to analytical and storage requirements.


The same instrumentation data might be required for more than one purpose. For example, performance counters
can be used to provide a historical view of system performance over time. This information might be combined
with other usage data to generate customer billing information. In these situations, the same data might be sent
to more than one destination, such as a document database that can act as a long-term store for holding billing
information, and a multidimensional store for handling complex performance analytics.
You should also consider how urgently the data is required. Data that provides information for alerting must be
accessed quickly, so it should be held in fast data storage and indexed or structured to optimize the queries that
the alerting system performs. In some cases, it might be necessary for the telemetry service that gathers the data
on each node to format and save data locally so that a local instance of the alerting system can quickly notify you
of any issues. The same data can be dispatched to the storage writing service shown in the previous diagrams and
stored centrally if it's also required for other purposes.
Information that's used for more considered analysis, for reporting, and for spotting historical trends is less
urgent and can be stored in a manner that supports data mining and ad hoc queries. For more information, see
the section Supporting hot, warm, and cold analysis later in this document.
Log rotation and data retention
Instrumentation can generate considerable volumes of data. This data can be held in several places, starting with
the raw log files, trace files, and other information captured at each node to the consolidated, cleaned, and
partitioned view of this data held in shared storage. In some cases, after the data has been processed and
transferred, the original raw source data can be removed from each node. In other cases, it might be necessary or
simply useful to save the raw information. For example, data that's generated for debugging purposes might be
best left available in its raw form but can then be discarded quickly after any bugs have been rectified.
Performance data often has a longer life so that it can be used for spotting performance trends and for capacity
planning. The consolidated view of this data is usually kept online for a finite period to enable fast access. After
that, it can be archived or discarded. Data gathered for metering and billing customers might need to be saved
indefinitely. Additionally, regulatory requirements might dictate that information collected for auditing and
security purposes also needs to be archived and saved. This data is also sensitive and might need to be encrypted
or otherwise protected to prevent tampering. You should never record users' passwords or other information that
might be used to commit identity fraud. Such details should be scrubbed from the data before it's stored.
Down-sampling
It's useful to store historical data so you can spot long-term trends. Rather than saving old data in its entirety, it
might be possible to down-sample the data to reduce its resolution and save storage costs. As an example, rather
than saving minute-by-minute performance indicators, you can consolidate data that's more than a month old to
form an hour-by-hour view.
Best practices for collecting and storing logging information
The following list summarizes best practices for capturing and storing logging information:
The monitoring agent or data-collection service should run as an out-of-process service and should be
simple to deploy.
All output from the monitoring agent or data-collection service should be an agnostic format that's
independent of the machine, operating system, or network protocol. For example, emit information in a
self-describing format such as JSON, MessagePack, or Protobuf rather than ETL/ETW. Using a standard
format enables the system to construct processing pipelines; components that read, transform, and send
data in the agreed format can be easily integrated.
The monitoring and data-collection process must be fail-safe and must not trigger any cascading error
conditions.
In the event of a transient failure in sending information to a data sink, the monitoring agent or data-
collection service should be prepared to reorder telemetry data so that the newest information is sent first.
(The monitoring agent/data-collection service might elect to drop the older data, or save it locally and
transmit it later to catch up, at its own discretion.)

Analyzing data and diagnosing issues


An important part of the monitoring and diagnostics process is analyzing the gathered data to obtain a picture of
the overall well-being of the system. You should have defined your own KPIs and performance metrics, and it's
important to understand how you can structure the data that has been gathered to meet your analysis
requirements. It's also important to understand how the data that's captured in different metrics and log files is
correlated, because this information can be key to tracking a sequence of events and help diagnose problems that
arise.
As described in the section Consolidating instrumentation data, the data for each part of the system is typically
captured locally, but it generally needs to be combined with data generated at other sites that participate in the
system. This information requires careful correlation to ensure that data is combined accurately. For example, the
usage data for an operation might span a node that hosts a website to which a user connects, a node that runs a
separate service accessed as part of this operation, and data storage held on another node. This information
needs to be tied together to provide an overall view of the resource and processing usage for the operation. Some
preprocessing and filtering of data might occur on the node on which the data is captured, whereas aggregation
and formatting are more likely to occur on a central node.
Supporting hot, warm, and cold analysis
Analyzing and reformatting data for visualization, reporting, and alerting purposes can be a complex process that
consumes its own set of resources. Some forms of monitoring are time-critical and require immediate analysis of
data to be effective. This is known as hot analysis. Examples include the analyses that are required for alerting and
some aspects of security monitoring (such as detecting an attack on the system). Data that's required for these
purposes must be quickly available and structured for efficient processing. In some cases, it might be necessary to
move the analysis processing to the individual nodes where the data is held.
Other forms of analysis are less time-critical and might require some computation and aggregation after the raw
data has been received. This is called warm analysis. Performance analysis often falls into this category. In this
case, an isolated, single performance event is unlikely to be statistically significant. (It might be caused by a
sudden spike or glitch.) The data from a series of events should provide a more reliable picture of system
performance.
Warm analysis can also be used to help diagnose health issues. A health event is typically processed through hot
analysis and can raise an alert immediately. An operator should be able to drill into the reasons for the health
event by examining the data from the warm path. This data should contain information about the events leading
up to the issue that caused the health event.
Some types of monitoring generate more long-term data. This analysis can be performed at a later date, possibly
according to a predefined schedule. In some cases, the analysis might need to perform complex filtering of large
volumes of data captured over a period of time. This is called cold analysis. The key requirement is that the data is
stored safely after it has been captured. For example, usage monitoring and auditing require an accurate picture
of the state of the system at regular points in time, but this state information does not have to be available for
processing immediately after it has been gathered.
An operator can also use cold analysis to provide the data for predictive health analysis. The operator can gather
historical information over a specified period and use it in conjunction with the current health data (retrieved from
the hot path) to spot trends that might soon cause health issues. In these cases, it might be necessary to raise an
alert so that corrective action can be taken.
Correlating data
The data that instrumentation captures can provide a snapshot of the system state, but the purpose of analysis is
to make this data actionable. For example:
What has caused an intense I/O loading at the system level at a specific time?
Is it the result of a large number of database operations?
Is this reflected in the database response times, the number of transactions per second, and application
response times at the same juncture?
If so, one remedial action that might reduce the load might be to shard the data over more servers. In addition,
exceptions can arise as a result of a fault in any level of the system. An exception in one level often triggers
another fault in the level above.
For these reasons, you need to be able to correlate the different types of monitoring data at each level to produce
an overall view of the state of the system and the applications that are running on it. You can then use this
information to make decisions about whether the system is functioning acceptably or not, and determine what
can be done to improve the quality of the system.
As described in the section Information for correlating data, you must ensure that the raw instrumentation data
includes sufficient context and activity ID information to support the required aggregations for correlating events.
Additionally, this data might be held in different formats, and it might be necessary to parse this information to
convert it into a standardized format for analysis.
Troubleshooting and diagnosing issues
Diagnosis requires the ability to determine the cause of faults or unexpected behavior, including performing root
cause analysis. The information that's required typically includes:
Detailed information from event logs and traces, either for the entire system or for a specified subsystem
during a specified time window.
Complete stack traces resulting from exceptions and faults of any specified level that occur within the system
or a specified subsystem during a specified period.
Crash dumps for any failed processes either anywhere in the system or for a specified subsystem during a
specified time window.
Activity logs recording the operations that are performed either by all users or for selected users during a
specified period.
Analyzing data for troubleshooting purposes often requires a deep technical understanding of the system
architecture and the various components that compose the solution. As a result, a large degree of manual
intervention is often required to interpret the data, establish the cause of problems, and recommend an
appropriate strategy to correct them. It might be appropriate simply to store a copy of this information in its
original format and make it available for cold analysis by an expert.

Visualizing data and raising alerts


An important aspect of any monitoring system is the ability to present the data in such a way that an operator can
quickly spot any trends or problems. Also important is the ability to quickly inform an operator if a significant
event has occurred that might require attention.
Data presentation can take several forms, including visualization by using dashboards, alerting, and reporting.
Visualization by using dashboards
The most common way to visualize data is to use dashboards that can display information as a series of charts,
graphs, or some other illustration. These items can be parameterized, and an analyst should be able to select the
important parameters (such as the time period) for any specific situation.
Dashboards can be organized hierarchically. Top-level dashboards can give an overall view of each aspect of the
system but enable an operator to drill down to the details. For example, a dashboard that depicts the overall disk
I/O for the system should allow an analyst to view the I/O rates for each individual disk to ascertain whether one
or more specific devices account for a disproportionate volume of traffic. Ideally, the dashboard should also
display related information, such as the source of each request (the user or activity) that's generating this I/O. This
information can then be used to determine whether (and how) to spread the load more evenly across devices, and
whether the system would perform better if more devices were added.
A dashboard might also use color-coding or some other visual cues to indicate values that appear anomalous or
that are outside an expected range. Using the previous example:
A disk with an I/O rate that's approaching its maximum capacity over an extended period (a hot disk) can be
highlighted in red.
A disk with an I/O rate that periodically runs at its maximum limit over short periods (a warm disk) can be
highlighted in yellow.
A disk that's exhibiting normal usage can be displayed in green.
Note that for a dashboard system to work effectively, it must have the raw data to work with. If you are building
your own dashboard system, or using a dashboard developed by another organization, you must understand
which instrumentation data you need to collect, at what levels of granularity, and how it should be formatted for
the dashboard to consume.
A good dashboard does not only display information, it also enables an analyst to pose ad hoc questions about
that information. Some systems provide management tools that an operator can use to perform these tasks and
explore the underlying data. Alternatively, depending on the repository that's used to hold this information, it
might be possible to query this data directly, or import it into tools such as Microsoft Excel for further analysis
and reporting.

NOTE
You should restrict access to dashboards to authorized personnel, because this information might be commercially sensitive.
You should also protect the underlying data for dashboards to prevent users from changing it.

Raising alerts
Alerting is the process of analyzing the monitoring and instrumentation data and generating a notification if a
significant event is detected.
Alerting helps ensure that the system remains healthy, responsive, and secure. It's an important part of any
system that makes performance, availability, and privacy guarantees to the users where the data might need to be
acted on immediately. An operator might need to be notified of the event that triggered the alert. Alerting can also
be used to invoke system functions such as autoscaling.
Alerting usually depends on the following instrumentation data:
Security events . If the event logs indicate that repeated authentication and/or authorization failures are
occurring, the system might be under attack and an operator should be informed.
Performance metrics . The system must quickly respond if a particular performance metric exceeds a
specified threshold.
Availability information . If a fault is detected, it might be necessary to quickly restart one or more
subsystems, or fail over to a backup resource. Repeated faults in a subsystem might indicate more serious
concerns.
Operators might receive alert information by using many delivery channels such as email, a pager device, or an
SMS text message. An alert might also include an indication of how critical a situation is. Many alerting systems
support subscriber groups, and all operators who are members of the same group can receive the same set of
alerts.
An alerting system should be customizable, and the appropriate values from the underlying instrumentation data
can be provided as parameters. This approach enables an operator to filter data and focus on those thresholds or
combinations of values that are of interest. Note that in some cases, the raw instrumentation data can be provided
to the alerting system. In other situations, it might be more appropriate to supply aggregated data. (For example,
an alert can be triggered if the CPU utilization for a node has exceeded 90 percent over the last 10 minutes). The
details provided to the alerting system should also include any appropriate summary and context information.
This data can help reduce the possibility that false-positive events will trip an alert.
Reporting
Reporting is used to generate an overall view of the system. It might incorporate historical data in addition to
current information. Reporting requirements themselves fall into two broad categories: operational reporting and
security reporting.
Operational reporting typically includes the following aspects:
Aggregating statistics that you can use to understand resource utilization of the overall system or specified
subsystems during a specified time window.
Identifying trends in resource usage for the overall system or specified subsystems during a specified period.
Monitoring the exceptions that have occurred throughout the system or in specified subsystems during a
specified period.
Determining the efficiency of the application in terms of the deployed resources, and understanding whether
the volume of resources (and their associated cost) can be reduced without affecting performance
unnecessarily.
Security reporting is concerned with tracking customers' use of the system. It can include:
Auditing user operations . This requires recording the individual requests that each user performs, together
with dates and times. The data should be structured to enable an administrator to quickly reconstruct the
sequence of operations that a user performs over a specified period.
Tracking resource use by user . This requires recording how each request for a user accesses the various
resources that compose the system, and for how long. An administrator must be able to use this data to
generate a utilization report by user over a specified period, possibly for billing purposes.
In many cases, batch processes can generate reports according to a defined schedule. (Latency is not normally an
issue.) But they should also be available for generation on an ad hoc basis if needed. As an example, if you are
storing data in a relational database such as Azure SQL Database, you can use a tool such as SQL Server
Reporting Services to extract and format data and present it as a set of reports.

Related patterns and guidance


Autoscaling guidance describes how to decrease management overhead by reducing the need for an operator
to continually monitor the performance of a system and make decisions about adding or removing resources.
Health Endpoint Monitoring pattern describes how to implement functional checks within an application that
external tools can access through exposed endpoints at regular intervals.
Priority Queue pattern shows how to prioritize queued messages so that urgent requests are received and can
be processed before less urgent messages.

Next steps
Azure Monitor overview
Monitor, diagnose, and troubleshoot Microsoft Azure Storage
Overview of alerts in Microsoft Azure
View service health notifications by using the Azure portal
What is Application Insights?
Performance diagnostics for Azure virtual machines
Download and install SQL Server Data Tools (SSDT) for Visual Studio
Retry guidance for Azure services
11/2/2020 • 41 minutes to read • Edit Online

Most Azure services and client SDKs include a retry mechanism. However, these differ because each service has
different characteristics and requirements, and so each retry mechanism is tuned to a specific service. This guide
summarizes the retry mechanism features for the majority of Azure services, and includes information to help you
use, adapt, or extend the retry mechanism for that service.
For general guidance on handling transient faults, and retrying connections and operations against services and
resources, see Retry guidance.
The following table summarizes the retry features for the Azure services described in this guidance.

P O L IC Y T EL EM ET RY
SERVIC E RET RY C A PA B IL IT IES C O N F IGURAT IO N SC O P E F EAT URES

Azure Active Native in ADAL Embedded into ADAL Internal None


Director y library library

Cosmos DB Native in service Non-configurable Global TraceSource

Data Lake Store Native in client Non-configurable Individual operations None

Event Hubs Native in client Programmatic Client None

IoT Hub Native in client SDK Programmatic Client None

Azure Cache for Native in client Programmatic Client TextWriter


Redis

Search Native in client Programmatic Client ETW or Custom

Ser vice Bus Native in client Programmatic Namespace Manager, ETW


Messaging Factory,
and Client

Ser vice Fabric Native in client Programmatic Client None

SQL Database with Polly Declarative and Single statements or Custom


ADO.NET programmatic blocks of code

SQL Database with Native in client Programmatic Global per None


Entity Framework AppDomain

SQL Database with Native in client Programmatic Global per None


Entity Framework AppDomain
Core

Storage Native in client Programmatic Client and individual TraceSource


operations
NOTE
For most of the Azure built-in retry mechanisms, there is currently no way apply a different retry policy for different types of
error or exception. You should configure a policy that provides the optimum average performance and availability. One way
to fine-tune the policy is to analyze log files to determine the type of transient faults that are occurring.

Azure Active Directory


Azure Active Directory (Azure AD) is a comprehensive identity and access management cloud solution that
combines core directory services, advanced identity governance, security, and application access management.
Azure AD also offers developers an identity management platform to deliver access control to their applications,
based on centralized policy and rules.

NOTE
For retry guidance on Managed Service Identity endpoints, see How to use an Azure VM Managed Service Identity (MSI)
for token acquisition.

Retry mechanism
There is a built-in retry mechanism for Azure Active Directory in the Active Directory Authentication Library
(ADAL). To avoid unexpected lockouts, we recommend that third-party libraries and application code do not retry
failed connections, but allow ADAL to handle retries.
Retry usage guidance
Consider the following guidelines when using Azure Active Directory:
When possible, use the ADAL library and the built-in support for retries.
If you are using the REST API for Azure Active Directory, retry the operation if the result code is 429 (Too Many
Requests) or an error in the 5xx range. Do not retry for any other errors.
For 429 errors, only retry after the time indicated in the Retr y-After header.
For 5xx errors, use exponential back-off, with the first retry at least 5 seconds after the response.
Do not retry on errors other than 429 and 5xx.
More information
Azure Active Directory Authentication Libraries

Cosmos DB
Cosmos DB is a fully managed multi-model database that supports schemaless JSON data. It offers configurable
and reliable performance, native JavaScript transactional processing, and is built for the cloud with elastic scale.
Retry mechanism
The DocumentClient class automatically retries failed attempts. To set the number of retries and the maximum wait
time, configure ConnectionPolicy.RetryOptions. Exceptions that the client raises are either beyond the retry policy
or are not transient errors.
If Cosmos DB throttles the client, it returns an HTTP 429 error. Check the status code in the
DocumentClientException .

Policy configuration
The following table shows the default settings for the RetryOptions class.
SET T IN G DEFA ULT VA L UE DESC RIP T IO N

MaxRetryAttemptsOnThrottledRequest 9 The maximum number of retries if the


s request fails because Cosmos DB
applied rate limiting on the client.

MaxRetryWaitTimeInSeconds 30 The maximum retry time in seconds.

Example

DocumentClient client = new DocumentClient(new Uri(endpoint), authKey);


var options = client.ConnectionPolicy.RetryOptions;
options.MaxRetryAttemptsOnThrottledRequests = 5;
options.MaxRetryWaitTimeInSeconds = 15;

Telemetry
Retry attempts are logged as unstructured trace messages through a .NET TraceSource . You must configure a
TraceListener to capture the events and write them to a suitable destination log.
For example, if you add the following to your App.config file, traces will be generated in a text file in the same
location as the executable:

<configuration>
<system.diagnostics>
<switches>
<add name="SourceSwitch" value="Verbose"/>
</switches>
<sources>
<source name="DocDBTrace" switchName="SourceSwitch" switchType="System.Diagnostics.SourceSwitch" >
<listeners>
<add name="MyTextListener" type="System.Diagnostics.TextWriterTraceListener"
traceOutputOptions="DateTime,ProcessId,ThreadId" initializeData="CosmosDBTrace.txt"></add>
</listeners>
</source>
</sources>
</system.diagnostics>
</configuration>

Event Hubs
Azure Event Hubs is a hyperscale telemetry ingestion service that collects, transforms, and stores millions of
events.
Retry mechanism
Retry behavior in the Azure Event Hubs Client Library is controlled by the RetryPolicy property on the
EventHubClient class. The default policy retries with exponential backoff when Azure Event Hub returns a
transient EventHubsException or an OperationCanceledException . Default retry policy for Event Hubs is to retry up
to 9 times with an exponential back-off time of up to 30 seconds .
Example

EventHubClient client = EventHubClient.CreateFromConnectionString("[event_hub_connection_string]");


client.RetryPolicy = RetryPolicy.Default;

More information
.NET Standard client library for Azure Event Hubs
IoT Hub
Azure IoT Hub is a service for connecting, monitoring, and managing devices to develop Internet of Things (IoT)
applications.
Retry mechanism
The Azure IoT device SDK can detect errors in the network, protocol, or application. Based on the error type, the
SDK checks whether a retry needs to be performed. If the error is recoverable, the SDK begins to retry using the
configured retry policy.
The default retry policy is exponential back-off with random jitter, but it can be configured.
Policy configuration
Policy configuration differs by language. For more details, see IoT Hub retry policy configuration.
More information
IoT Hub retry policy
Troubleshoot IoT Hub device disconnection

Azure Cache for Redis


Azure Cache for Redis is a fast data access and low latency cache service based on the popular open-source Redis
cache. It is secure, managed by Microsoft, and is accessible from any application in Azure.
The guidance in this section is based on using the StackExchange.Redis client to access the cache. A list of other
suitable clients can be found on the Redis website, and these may have different retry mechanisms.
Note that the StackExchange.Redis client uses multiplexing through a single connection. The recommended usage
is to create an instance of the client at application startup and use this instance for all operations against the
cache. For this reason, the connection to the cache is made only once, and so all of the guidance in this section is
related to the retry policy for this initial connection—and not for each operation that accesses the cache.
Retry mechanism
The StackExchange.Redis client uses a connection manager class that is configured through a set of options,
including:
ConnectRetr y . The number of times a failed connection to the cache will be retried.
ReconnectRetr yPolicy . The retry strategy to use.
ConnectTimeout . The maximum waiting time in milliseconds.
Policy configuration
Retry policies are configured programmatically by setting the options for the client before connecting to the
cache. This can be done by creating an instance of the ConfigurationOptions class, populating its properties,
and passing it to the Connect method.
The built-in classes support linear (constant) delay and exponential backoff with randomized retry intervals. You
can also create a custom retry policy by implementing the IReconnectRetr yPolicy interface.
The following example configures a retry strategy using exponential backoff.
var deltaBackOffInMilliseconds = TimeSpan.FromSeconds(5).Milliseconds;
var maxDeltaBackOffInMilliseconds = TimeSpan.FromSeconds(20).Milliseconds;
var options = new ConfigurationOptions
{
EndPoints = {"localhost"},
ConnectRetry = 3,
ReconnectRetryPolicy = new ExponentialRetry(deltaBackOffInMilliseconds, maxDeltaBackOffInMilliseconds),
ConnectTimeout = 2000
};
ConnectionMultiplexer redis = ConnectionMultiplexer.Connect(options, writer);

Alternatively, you can specify the options as a string, and pass this to the Connect method. The
ReconnectRetr yPolicy property cannot be set this way, only through code.

var options = "localhost,connectRetry=3,connectTimeout=2000";


ConnectionMultiplexer redis = ConnectionMultiplexer.Connect(options, writer);

You can also specify options directly when you connect to the cache.

var conn = ConnectionMultiplexer.Connect("redis0:6380,redis1:6380,connectRetry=3");

For more information, see Stack Exchange Redis Configuration in the StackExchange.Redis documentation.
The following table shows the default settings for the built-in retry policy.

DEFA ULT VA L UE
C O N T EXT SET T IN G ( V 1. 2. 2) M EA N IN G

ConfigurationOptions ConnectRetry 3 The number of times to


repeat connect attempts
ConnectTimeout Maximum 5000 ms plus during the initial connection
SyncTimeout operation.
SyncTimeout 1000 Timeout (ms) for connect
operations. Not a delay
ReconnectRetryPolicy LinearRetry 5000 ms between retry attempts.
Time (ms) to allow for
synchronous operations.

Retry every 5000 ms.

NOTE
For synchronous operations, SyncTimeout can add to the end-to-end latency, but setting the value too low can cause
excessive timeouts. See How to troubleshoot Azure Cache for Redis. In general, avoid using synchronous operations, and
use asynchronous operations instead. For more information, see Pipelines and Multiplexers.

Retry usage guidance


Consider the following guidelines when using Azure Cache for Redis:
The StackExchange Redis client manages its own retries, but only when establishing a connection to the cache
when the application first starts. You can configure the connection timeout, the number of retry attempts, and
the time between retries to establish this connection, but the retry policy does not apply to operations against
the cache.
Instead of using a large number of retry attempts, consider falling back by accessing the original data source
instead.
Telemetry
You can collect information about connections (but not other operations) using a TextWriter .

var writer = new StringWriter();


ConnectionMultiplexer redis = ConnectionMultiplexer.Connect(options, writer);

An example of the output this generates is shown below.

localhost:6379,connectTimeout=2000,connectRetry=3
1 unique nodes specified
Requesting tie-break from localhost:6379 > __Booksleeve_TieBreak...
Allowing endpoints 00:00:02 to respond...
localhost:6379 faulted: SocketFailure on PING
localhost:6379 failed to nominate (Faulted)
> UnableToResolvePhysicalConnection on GET
No masters detected
localhost:6379: Standalone v2.0.0, master; keep-alive: 00:01:00; int: Connecting; sub: Connecting; not in
use: DidNotRespond
localhost:6379: int ops=0, qu=0, qs=0, qc=1, wr=0, sync=1, socks=2; sub ops=0, qu=0, qs=0, qc=0, wr=0,
socks=2
Circular op-count snapshot; int: 0 (0.00 ops/s; spans 10s); sub: 0 (0.00 ops/s; spans 10s)
Sync timeouts: 0; fire and forget: 0; last heartbeat: -1s ago
resetting failing connections to retry...
retrying; attempts left: 2...
...

Examples
The following code example configures a constant (linear) delay between retries when initializing the
StackExchange.Redis client. This example shows how to set the configuration using a ConfigurationOptions
instance.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using StackExchange.Redis;

namespace RetryCodeSamples
{
class CacheRedisCodeSamples
{
public async static Task Samples()
{
var writer = new StringWriter();
{
try
{
var retryTimeInMilliseconds = TimeSpan.FromSeconds(4).Milliseconds; // delay between
retries

// Using object-based configuration.


var options = new ConfigurationOptions
{
EndPoints = { "localhost" },
ConnectRetry = 3,
ReconnectRetryPolicy = new LinearRetry(retryTimeInMilliseconds)
};
ConnectionMultiplexer redis = ConnectionMultiplexer.Connect(options, writer);

// Store a reference to the multiplexer for use in the application.


}
catch
{
Console.WriteLine(writer.ToString());
throw;
}
}
}
}
}

The next example sets the configuration by specifying the options as a string. The connection timeout is the
maximum period of time to wait for a connection to the cache, not the delay between retry attempts. Note that the
ReconnectRetr yPolicy property can only be set by code.
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using StackExchange.Redis;

namespace RetryCodeSamples
{
class CacheRedisCodeSamples
{
public async static Task Samples()
{
var writer = new StringWriter();
{
try
{
// Using string-based configuration.
var options = "localhost,connectRetry=3,connectTimeout=2000";
ConnectionMultiplexer redis = ConnectionMultiplexer.Connect(options, writer);

// Store a reference to the multiplexer for use in the application.


}
catch
{
Console.WriteLine(writer.ToString());
throw;
}
}
}
}
}

For more examples, see Configuration on the project website.


More information
Redis website

Azure Search
Azure Search can be used to add powerful and sophisticated search capabilities to a website or application, quickly
and easily tune search results, and construct rich and fine-tuned ranking models.
Retry mechanism
Retry behavior in the Azure Search SDK is controlled by the SetRetryPolicy method on the SearchServiceClient
and SearchIndexClient classes. The default policy retries with exponential backoff when Azure Search returns a 5xx
or 408 (Request Timeout) response.
Telemetry
Trace with ETW or by registering a custom trace provider. For more information, see the AutoRest documentation.

Service Bus
Service Bus is a cloud messaging platform that provides loosely coupled message exchange with improved scale
and resiliency for components of an application, whether hosted in the cloud or on-premises.
Retry mechanism
Service Bus implements retries using implementations of the abstract Retr yPolicy class. The namespace and
some of the configuration details depend on which Service Bus client SDK package is used:
PA C K A GE DESC RIP T IO N N A M ESPA C E

Microsoft.Azure.ServiceBus Azure Service Bus client library for .NET Microsoft.ServiceBus


Standard.

WindowsAzure.ServiceBus This package is the older Service Bus Microsoft.Azure.ServiceBus


client library. It requires .Net
Framework 4.5.2.

Both versions of the client library provide the following built-in implementations of RetryPolicy :
RetryExponential. Implements exponential backoff.
NoRetry. Does not perform retries. Use this class when you don't need retries at the Service Bus API level,
for example when another process manages retries as part of a batch or multistep operation.
The RetryPolicy.Default property returns a default policy of type RetryExponential . This policy object has the
following settings:

SET T IN G DEFA ULT VA L UE M EA N IN G

MinimalBackoff 0 Minimum back-off interval. Added to


the retry interval computed from
deltaBackoff .

MaximumBackoff 30 seconds Maximum back-off interval.

DeltaBackoff 3 seconds Back-off interval between retries.


Multiples of this timespan are used for
subsequent retry attempts.

MaxRetryCount 5 The maximum number of retries.


(Default value is 10 in the
WindowsAzure.ServiceBus package.)

In addition, the following property is defined in the older WindowsAzure.ServiceBus package:

SET T IN G DEFA ULT VA L UE M EA N IN G

TerminationTimeBuffer 5 seconds Retry attempts will be abandoned if the


remaining time is less than this value.

Service Bus actions can return a range of exceptions, listed in Service Bus messaging exceptions. Exceptions
returned from Service Bus expose the IsTransient property that indicates whether the client should retry the
operation. The built-in Retr yExponential policy checks this property before retrying.
If the last exception encountered was Ser verBusyException , the Retr yExponential policy adds 10 seconds to
the computed retry interval. This value cannot be changed.
Custom implementations could use a combination of the exception type and the IsTransient property to provide
more fine-grained control over retry actions. For example, you could detect a QuotaExceededException and
take action to drain the queue before retrying sending a message to it.
The following code sets the retry policy on a Service Bus client using the Microsoft.Azure.ServiceBus library:
const string QueueName = "queue1";
const string ServiceBusConnectionString = "<your_connection_string>";

var policy = new RetryExponential(


minimumBackoff: TimeSpan.FromSeconds(10),
maximumBackoff: TimeSpan.FromSeconds(30),
maximumRetryCount: 3);
var queueClient = new QueueClient(ServiceBusConnectionString, QueueName, ReceiveMode.PeekLock, policy);

The retry policy cannot be set at the individual operation level. It applies to all operations for the client.
Retry usage guidance
Consider the following guidelines when using Service Bus:
When using the built-in Retr yExponential implementation, do not implement a fallback operation as the
policy reacts to Server Busy exceptions and automatically switches to an appropriate retry mode.
Service Bus supports a feature called Paired Namespaces that implements automatic failover to a backup
queue in a separate namespace if the queue in the primary namespace fails. Messages from the secondary
queue can be sent back to the primary queue when it recovers. This feature helps to address transient failures.
For more information, see Asynchronous Messaging Patterns and High Availability.
Consider starting with the following settings for retrying operations. These settings are general purpose, and you
should monitor the operations and fine-tune the values to suit your own scenario.

EXA M P L E M A XIM UM
C O N T EXT L AT EN C Y RET RY P O L IC Y SET T IN GS H O W IT W O RK S

Interactive, UI, or 2 seconds* Exponential MinimumBackoff = 0 Attempt 1: Delay 0


foreground MaximumBackoff = sec.
30 sec. Attempt 2: Delay
DeltaBackoff = 300 ~300 msec.
msec. Attempt 3: Delay
TimeBuffer = 300 ~900 msec.
msec.
MaxRetryCount = 2

Background or batch 30 seconds Exponential MinimumBackoff = 1 Attempt 1: Delay ~1


MaximumBackoff = sec.
30 sec. Attempt 2: Delay ~3
DeltaBackoff = 1.75 sec.
sec. Attempt 3: Delay ~6
TimeBuffer = 5 sec. msec.
MaxRetryCount = 3 Attempt 4: Delay ~13
msec.

* Not including additional delay that is added if a Server Busy response is received.
Telemetry
Service Bus logs retries as ETW events using an EventSource . You must attach an EventListener to the event
source to capture the events and view them in Performance Viewer, or write them to a suitable destination log. The
retry events are of the following form:
Microsoft-ServiceBus-Client/RetryPolicyIteration
ThreadID="14,500"
FormattedMessage="[TrackingId:] RetryExponential: Operation Get:https://retry-
tests.servicebus.windows.net/TestQueue/?api-version=2014-05 at iteration 0 is retrying after 00:00:00.1000000
sleep because of Microsoft.ServiceBus.Messaging.MessagingCommunicationException: The remote name could not be
resolved: 'retry-tests.servicebus.windows.net'.TrackingId:6a26f99c-dc6d-422e-8565-f89fdd0d4fe3,
TimeStamp:9/5/2014 10:00:13 PM."
trackingId=""
policyType="RetryExponential"
operation="Get:https://retry-tests.servicebus.windows.net/TestQueue/?api-version=2014-05"
iteration="0"
iterationSleep="00:00:00.1000000"
lastExceptionType="Microsoft.ServiceBus.Messaging.MessagingCommunicationException"
exceptionMessage="The remote name could not be resolved: 'retry-
tests.servicebus.windows.net'.TrackingId:6a26f99c-dc6d-422e-8565-f89fdd0d4fe3,TimeStamp:9/5/2014 10:00:13 PM"

Examples
The following code example shows how to set the retry policy for:
A namespace manager. The policy applies to all operations on that manager, and cannot be overridden for
individual operations.
A messaging factory. The policy applies to all clients created from that factory, and cannot be overridden when
creating individual clients.
An individual messaging client. After a client has been created, you can set the retry policy for that client. The
policy applies to all operations on that client.

using System;
using System.Threading.Tasks;
using Microsoft.ServiceBus;
using Microsoft.ServiceBus.Messaging;

namespace RetryCodeSamples
{
class ServiceBusCodeSamples
{
private const string connectionString =
@"Endpoint=sb://[my-namespace].servicebus.windows.net/;
SharedAccessKeyName=RootManageSharedAccessKey;
SharedAccessKey=C99..........Mk=";

public async static Task Samples()


{
const string QueueName = "TestQueue";

ServiceBusEnvironment.SystemConnectivity.Mode = ConnectivityMode.Http;

var namespaceManager = NamespaceManager.CreateFromConnectionString(connectionString);

// The namespace manager will have a default exponential policy with 10 retry attempts
// and a 3 second delay delta.
// Retry delays will be approximately 0 sec, 3 sec, 9 sec, 25 sec and the fixed 30 sec,
// with an extra 10 sec added when receiving a ServiceBusyException.

{
// Set different values for the retry policy, used for all operations on the namespace
manager.
namespaceManager.Settings.RetryPolicy =
new RetryExponential(
minBackoff: TimeSpan.FromSeconds(0),
maxBackoff: TimeSpan.FromSeconds(30),
maxRetryCount: 3);

// Policies cannot be specified on a per-operation basis.


if (!await namespaceManager.QueueExistsAsync(QueueName))
if (!await namespaceManager.QueueExistsAsync(QueueName))
{
await namespaceManager.CreateQueueAsync(QueueName);
}
}

var messagingFactory = MessagingFactory.Create(


namespaceManager.Address, namespaceManager.Settings.TokenProvider);
// The messaging factory will have a default exponential policy with 10 retry attempts
// and a 3 second delay delta.
// Retry delays will be approximately 0 sec, 3 sec, 9 sec, 25 sec and the fixed 30 sec,
// with an extra 10 sec added when receiving a ServiceBusyException.

{
// Set different values for the retry policy, used for clients created from it.
messagingFactory.RetryPolicy =
new RetryExponential(
minBackoff: TimeSpan.FromSeconds(1),
maxBackoff: TimeSpan.FromSeconds(30),
maxRetryCount: 3);

// Policies cannot be specified on a per-operation basis.


var session = await messagingFactory.AcceptMessageSessionAsync();
}

{
var client = messagingFactory.CreateQueueClient(QueueName);
// The client inherits the policy from the factory that created it.

// Set different values for the retry policy on the client.


client.RetryPolicy =
new RetryExponential(
minBackoff: TimeSpan.FromSeconds(0.1),
maxBackoff: TimeSpan.FromSeconds(30),
maxRetryCount: 3);

// Policies cannot be specified on a per-operation basis.


var session = await client.AcceptMessageSessionAsync();
}
}
}
}

More information
Asynchronous Messaging Patterns and High Availability

Service Fabric
Distributing reliable services in a Service Fabric cluster guards against most of the potential transient faults
discussed in this article. Some transient faults are still possible, however. For example, the naming service might
be in the middle of a routing change when it gets a request, causing it to throw an exception. If the same request
comes 100 milliseconds later, it will probably succeed.
Internally, Service Fabric manages this kind of transient fault. You can configure some settings by using the
OperationRetrySettings class while setting up your services. The following code shows an example. In most cases,
this should not be necessary, and the default settings will be fine.
FabricTransportRemotingSettings transportSettings = new FabricTransportRemotingSettings
{
OperationTimeout = TimeSpan.FromSeconds(30)
};

var retrySettings = new OperationRetrySettings(TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(1), 5);

var clientFactory = new FabricTransportServiceRemotingClientFactory(transportSettings);

var serviceProxyFactory = new ServiceProxyFactory((c) => clientFactory, retrySettings);

var client = serviceProxyFactory.CreateServiceProxy<ISomeService>(


new Uri("fabric:/SomeApp/SomeStatefulReliableService"),
new ServicePartitionKey(0));

More information
Remote exception handling

SQL Database using ADO.NET


SQL Database is a hosted SQL database available in a range of sizes and as both a standard (shared) and premium
(non-shared) service.
Retry mechanism
SQL Database has no built-in support for retries when accessed using ADO.NET. However, the return codes from
requests can be used to determine why a request failed. For more information about SQL Database throttling, see
Azure SQL Database resource limits. For a list of relevant error codes, see SQL error codes for SQL Database client
applications.
You can use the Polly library to implement retries for SQL Database. See Transient fault handling with Polly.
Retry usage guidance
Consider the following guidelines when accessing SQL Database using ADO.NET:
Choose the appropriate service option (shared or premium). A shared instance may suffer longer than usual
connection delays and throttling due to the usage by other tenants of the shared server. If more predictable
performance and reliable low latency operations are required, consider choosing the premium option.
Ensure that you perform retries at the appropriate level or scope to avoid non-idempotent operations causing
inconsistency in the data. Ideally, all operations should be idempotent so that they can be repeated without
causing inconsistency. Where this is not the case, the retry should be performed at a level or scope that allows
all related changes to be undone if one operation fails; for example, from within a transactional scope. For
more information, see Cloud Service Fundamentals Data Access Layer – Transient Fault Handling.
A fixed interval strategy is not recommended for use with Azure SQL Database except for interactive scenarios
where there are only a few retries at very short intervals. Instead, consider using an exponential back-off
strategy for the majority of scenarios.
Choose a suitable value for the connection and command timeouts when defining connections. Too short a
timeout may result in premature failures of connections when the database is busy. Too long a timeout may
prevent the retry logic working correctly by waiting too long before detecting a failed connection. The value of
the timeout is a component of the end-to-end latency; it is effectively added to the retry delay specified in the
retry policy for every retry attempt.
Close the connection after a certain number of retries, even when using an exponential back off retry logic, and
retry the operation on a new connection. Retrying the same operation multiple times on the same connection
can be a factor that contributes to connection problems. For an example of this technique, see Cloud Service
Fundamentals Data Access Layer – Transient Fault Handling.
When connection pooling is in use (the default) there is a chance that the same connection will be chosen from
the pool, even after closing and reopening a connection. If this is the case, a technique to resolve it is to call the
ClearPool method of the SqlConnection class to mark the connection as not reusable. However, you should
do this only after several connection attempts have failed, and only when encountering the specific class of
transient failures such as SQL timeouts (error code -2) related to faulty connections.
If the data access code uses transactions initiated as TransactionScope instances, the retry logic should
reopen the connection and initiate a new transaction scope. For this reason, the retryable code block should
encompass the entire scope of the transaction.
Consider starting with the following settings for retrying operations. These settings are general purpose, and you
should monitor the operations and fine-tune the values to suit your own scenario.

SA M P L E TA RGET
E2E
C O N T EXT M A X L AT EN C Y RET RY ST RAT EGY SET T IN GS VA L UES H O W IT W O RK S

Interactive, UI, 2 sec FixedInterval Retry count 3 Attempt 1 -


or foreground Retry interval 500 ms delay 0 sec
First fast retry true Attempt 2 -
delay 500 ms
Attempt 3 -
delay 500 ms

Background 30 sec ExponentialBacko Retry count 5 Attempt 1 -


or batch ff Min back-off 0 sec delay 0 sec
Max back-off 60 sec Attempt 2 -
Delta back-off 2 sec delay ~2 sec
First fast retry false Attempt 3 -
delay ~6 sec
Attempt 4 -
delay ~14 sec
Attempt 5 -
delay ~30 sec

NOTE
The end-to-end latency targets assume the default timeout for connections to the service. If you specify longer connection
timeouts, the end-to-end latency will be extended by this additional time for every retry attempt.

Examples
This section shows how you can use Polly to access Azure SQL Database using a set of retry policies configured in
the Policy class.
The following code shows an extension method on the SqlCommand class that calls ExecuteAsync with exponential
backoff.
public async static Task<SqlDataReader> ExecuteReaderWithRetryAsync(this SqlCommand command)
{
GuardConnectionIsNotNull(command);

var policy = Policy.Handle<Exception>().WaitAndRetryAsync(


retryCount: 3, // Retry 3 times
sleepDurationProvider: attempt => TimeSpan.FromMilliseconds(200 * Math.Pow(2, attempt - 1)), //
Exponential backoff based on an initial 200 ms delay.
onRetry: (exception, attempt) =>
{
// Capture some information for logging/telemetry.
logger.LogWarn($"ExecuteReaderWithRetryAsync: Retry {attempt} due to {exception}.");
});

// Retry the following call according to the policy.


await policy.ExecuteAsync<SqlDataReader>(async token =>
{
// This code is executed within the Policy

if (conn.State != System.Data.ConnectionState.Open) await conn.OpenAsync(token);


return await command.ExecuteReaderAsync(System.Data.CommandBehavior.Default, token);

}, cancellationToken);
}

This asynchronous extension method can be used as follows.

var sqlCommand = sqlConnection.CreateCommand();


sqlCommand.CommandText = "[some query]";

using (var reader = await sqlCommand.ExecuteReaderWithRetryAsync())


{
// Do something with the values
}

More information
Cloud Service Fundamentals Data Access Layer – Transient Fault Handling
For general guidance on getting the most from SQL Database, see Azure SQL Database performance and elasticity
guide.

SQL Database using Entity Framework 6


SQL Database is a hosted SQL database available in a range of sizes and as both a standard (shared) and premium
(non-shared) service. Entity Framework is an object-relational mapper that enables .NET developers to work with
relational data using domain-specific objects. It eliminates the need for most of the data-access code that
developers usually need to write.
Retry mechanism
Retry support is provided when accessing SQL Database using Entity Framework 6.0 and higher through a
mechanism called Connection resiliency / retry logic. The main features of the retry mechanism are:
The primary abstraction is the IDbExecutionStrategy interface. This interface:
Defines synchronous and asynchronous Execute methods.
Defines classes that can be used directly or can be configured on a database context as a default
strategy, mapped to provider name, or mapped to a provider name and server name. When configured
on a context, retries occur at the level of individual database operations, of which there might be several
for a given context operation.
Defines when to retry a failed connection, and how.
It includes several built-in implementations of the IDbExecutionStrategy interface:
Default: no retrying.
Default for SQL Database (automatic): no retrying, but inspects exceptions and wraps them with
suggestion to use the SQL Database strategy.
Default for SQL Database: exponential (inherited from base class) plus SQL Database detection logic.
It implements an exponential back-off strategy that includes randomization.
The built-in retry classes are stateful and are not thread-safe. However, they can be reused after the current
operation is completed.
If the specified retry count is exceeded, the results are wrapped in a new exception. It does not bubble up the
current exception.
Policy configuration
Retry support is provided when accessing SQL Database using Entity Framework 6.0 and higher. Retry policies are
configured programmatically. The configuration cannot be changed on a per-operation basis.
When configuring a strategy on the context as the default, you specify a function that creates a new strategy on
demand. The following code shows how you can create a retry configuration class that extends the
DbConfiguration base class.

public class BloggingContextConfiguration : DbConfiguration


{
public BlogConfiguration()
{
// Set up the execution strategy for SQL Database (exponential) with 5 retries and 4 sec delay
this.SetExecutionStrategy(
"System.Data.SqlClient", () => new SqlAzureExecutionStrategy(5, TimeSpan.FromSeconds(4)));
}
}

You can then specify this as the default retry strategy for all operations using the SetConfiguration method of
the DbConfiguration instance when the application starts. By default, EF will automatically discover and use the
configuration class.

DbConfiguration.SetConfiguration(new BloggingContextConfiguration());

You can specify the retry configuration class for a context by annotating the context class with a
DbConfigurationType attribute. However, if you have only one configuration class, EF will use it without the
need to annotate the context.

[DbConfigurationType(typeof(BloggingContextConfiguration))]
public class BloggingContext : DbContext

If you need to use different retry strategies for specific operations, or disable retries for specific operations, you
can create a configuration class that allows you to suspend or swap strategies by setting a flag in the
CallContext . The configuration class can use this flag to switch strategies, or disable the strategy you provide and
use a default strategy. For more information, see Suspend Execution Strategy (EF6 onwards).
Another technique for using specific retry strategies for individual operations is to create an instance of the
required strategy class and supply the desired settings through parameters. You then invoke its ExecuteAsync
method.
var executionStrategy = new SqlAzureExecutionStrategy(5, TimeSpan.FromSeconds(4));
var blogs = await executionStrategy.ExecuteAsync(
async () =>
{
using (var db = new BloggingContext("Blogs"))
{
// Acquire some values asynchronously and return them
}
},
new CancellationToken()
);

The simplest way to use a DbConfiguration class is to locate it in the same assembly as the DbContext class.
However, this is not appropriate when the same context is required in different scenarios, such as different
interactive and background retry strategies. If the different contexts execute in separate AppDomains, you can use
the built-in support for specifying configuration classes in the configuration file or set it explicitly using code. If the
different contexts must execute in the same AppDomain, a custom solution will be required.
For more information, see Code-Based Configuration (EF6 onwards).
The following table shows the default settings for the built-in retry policy when using EF6.

SET T IN G DEFA ULT VA L UE M EA N IN G

Policy Exponential Exponential back-off.

MaxRetryCount 5 The maximum number of retries.

MaxDelay 30 seconds The maximum delay between retries.


This value does not affect how the
series of delays are computed. It only
defines an upper bound.

DefaultCoefficient 1 second The coefficient for the exponential back-


off computation. This value cannot be
changed.

DefaultRandomFactor 1.1 The multiplier used to add a random


delay for each entry. This value cannot
be changed.

DefaultExponentialBase 2 The multiplier used to calculate the next


delay. This value cannot be changed.

Retry usage guidance


Consider the following guidelines when accessing SQL Database using EF6:
Choose the appropriate service option (shared or premium). A shared instance may suffer longer than
usual connection delays and throttling due to the usage by other tenants of the shared server. If predictable
performance and reliable low latency operations are required, consider choosing the premium option.
A fixed interval strategy is not recommended for use with Azure SQL Database. Instead, use an exponential
back-off strategy because the service may be overloaded, and longer delays allow more time for it to
recover.
Choose a suitable value for the connection and command timeouts when defining connections. Base the
timeout on both your business logic design and through testing. You may need to modify this value over
time as the volumes of data or the business processes change. Too short a timeout may result in premature
failures of connections when the database is busy. Too long a timeout may prevent the retry logic working
correctly by waiting too long before detecting a failed connection. The value of the timeout is a component
of the end-to-end latency, although you cannot easily determine how many commands will execute when
saving the context. You can change the default timeout by setting the CommandTimeout property of the
DbContext instance.
Entity Framework supports retry configurations defined in configuration files. However, for maximum
flexibility on Azure you should consider creating the configuration programmatically within the application.
The specific parameters for the retry policies, such as the number of retries and the retry intervals, can be
stored in the service configuration file and used at runtime to create the appropriate policies. This allows
the settings to be changed without requiring the application to be restarted.
Consider starting with the following settings for retrying operations. You cannot specify the delay between retry
attempts (it is fixed and generated as an exponential sequence). You can specify only the maximum values, as
shown here; unless you create a custom retry strategy. These settings are general purpose, and you should
monitor the operations and fine-tune the values to suit your own scenario.

SA M P L E TA RGET
E2E
C O N T EXT M A X L AT EN C Y RET RY P O L IC Y SET T IN GS VA L UES H O W IT W O RK S

Interactive, UI, 2 seconds Exponential MaxRetryCount 3 Attempt 1 -


or foreground MaxDelay 750 ms delay 0 sec
Attempt 2 -
delay 750 ms
Attempt 3 –
delay 750 ms

Background 30 seconds Exponential MaxRetryCount 5 Attempt 1 -


or batch MaxDelay 12 seconds delay 0 sec
Attempt 2 -
delay ~1 sec
Attempt 3 -
delay ~3 sec
Attempt 4 -
delay ~7 sec
Attempt 5 -
delay 12 sec

NOTE
The end-to-end latency targets assume the default timeout for connections to the service. If you specify longer connection
timeouts, the end-to-end latency will be extended by this additional time for every retry attempt.

Examples
The following code example defines a simple data access solution that uses Entity Framework. It sets a specific
retry strategy by defining an instance of a class named BlogConfiguration that extends DbConfiguration .
using System;
using System.Collections.Generic;
using System.Data.Entity;
using System.Data.Entity.SqlServer;
using System.Threading.Tasks;

namespace RetryCodeSamples
{
public class BlogConfiguration : DbConfiguration
{
public BlogConfiguration()
{
// Set up the execution strategy for SQL Database (exponential) with 5 retries and 12 sec delay.
// These values could be loaded from configuration rather than being hard-coded.
this.SetExecutionStrategy(
"System.Data.SqlClient", () => new SqlAzureExecutionStrategy(5,
TimeSpan.FromSeconds(12)));
}
}

// Specify the configuration type if more than one has been defined.
// [DbConfigurationType(typeof(BlogConfiguration))]
public class BloggingContext : DbContext
{
// Definition of content goes here.
}

class EF6CodeSamples
{
public async static Task Samples()
{
// Execution strategy configured by DbConfiguration subclass, discovered automatically or
// or explicitly indicated through configuration or with an attribute. Default is no retries.
using (var db = new BloggingContext("Blogs"))
{
// Add, edit, delete blog items here, then:
await db.SaveChangesAsync();
}
}
}
}

More examples of using the Entity Framework retry mechanism can be found in Connection resiliency / retry
logic.
More information
Azure SQL Database performance and elasticity guide

SQL Database using Entity Framework Core


Entity Framework Core is an object-relational mapper that enables .NET Core developers to work with data using
domain-specific objects. It eliminates the need for most of the data-access code that developers usually need to
write. This version of Entity Framework was written from the ground up, and doesn't automatically inherit all the
features from EF6.x.
Retry mechanism
Retry support is provided when accessing SQL Database using Entity Framework Core through a mechanism
called connection resiliency. Connection resiliency was introduced in EF Core 1.1.0.
The primary abstraction is the IExecutionStrategy interface. The execution strategy for SQL Server, including SQL
Azure, is aware of the exception types that can be retried and has sensible defaults for maximum retries, delay
between retries, and so on.
Examples
The following code enables automatic retries when configuring the DbContext object, which represents a session
with the database.

protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)


{
optionsBuilder
.UseSqlServer(
@"Server=
(localdb)\mssqllocaldb;Database=EFMiscellaneous.ConnectionResiliency;Trusted_Connection=True;",
options => options.EnableRetryOnFailure());
}

The following code shows how to execute a transaction with automatic retries, by using an execution strategy. The
transaction is defined in a delegate. If a transient failure occurs, the execution strategy will invoke the delegate
again.

using (var db = new BloggingContext())


{
var strategy = db.Database.CreateExecutionStrategy();

strategy.Execute(() =>
{
using (var transaction = db.Database.BeginTransaction())
{
db.Blogs.Add(new Blog { Url = "https://blogs.msdn.com/dotnet" });
db.SaveChanges();

db.Blogs.Add(new Blog { Url = "https://blogs.msdn.com/visualstudio" });


db.SaveChanges();

transaction.Commit();
}
});
}

Azure Storage
Azure Storage services include blob storage, files, and storage queues.
Blobs, Queues and Files
The ClientOptions Class is the base type for all client option types and exposes various common client options like
Diagnostics, Retry, Transport. To provide the client configuration options for connecting to Azure Queue, Blob, and
File Storage you must use the corresponding derived type. In the next example, you use the QueueClientOptions
class (derived from ClientOptions) to configure a client to connect to Azure Queue Service. The Retry property is
the set of options that can be specified to influence how retry attempts are made, and how a failure is eligible to
be retried.
using System;
using System.Threading;
using Azure.Core;
using Azure.Identity;
using Azure.Storage;
using Azure.Storage.Queues;
using Azure.Storage.Queues.Models;

namespace RetryCodeSamples
{
class AzureStorageCodeSamples {

public async static Task Samples() {

// Provide the client configuration options for connecting to Azure Queue Storage
QueueClientOptions queueClientOptions = new QueueClientOptions()
{
Retry = {
Delay = TimeSpan.FromSeconds(2), //The delay between retry attempts for a fixed
approach or the delay on which to base
//calculations for a backoff-based approach
MaxRetries = 5, //The maximum number of retry attempts before giving
up
Mode = RetryMode.Exponential, //The approach to use for calculating retry delays
MaxDelay = TimeSpan.FromSeconds(10) //The maximum permissible delay between retry
attempts
},

GeoRedundantSecondaryUri = new Uri("https://...")


// If the GeoRedundantSecondaryUri property is set, the secondary Uri will be used for
GET or HEAD requests during retries.
// If the status of the response from the secondary Uri is a 404, then subsequent retries
for the request will not use the
// secondary Uri again, as this indicates that the resource may not have propagated there
yet.
// Otherwise, subsequent retries will alternate back and forth between primary and
secondary Uri.
};

Uri queueServiceUri = new Uri("https://storageaccount.queue.core.windows.net/");


string accountName = "Storage account name";
string accountKey = "storage account key";

// Create a client object for the Queue service, including QueueClientOptions.


QueueServiceClient serviceClient = new QueueServiceClient(queueServiceUri, new
DefaultAzureCredential(), queueClientOptions);

CancellationTokenSource source = new CancellationTokenSource();


CancellationToken cancellationToken = source.Token;

// Return an async collection of queues in the storage account.


var queues = serviceClient.GetQueuesAsync(QueueTraits.None, null, cancellationToken);

Table Support

NOTE
WindowsAzure.Storage Nuget Package has been deprecated. For Azure table support, see Microsoft.Azure.Cosmos.Table
Nuget Package

Retry mechanism
Retries occur at the individual REST operation level and are an integral part of the client API implementation. The
client storage SDK uses classes that implement the IExtendedRetryPolicy Interface.
The built-in classes provide support for linear (constant delay) and exponential with randomization retry intervals.
There is also a no retry policy for use when another process is handling retries at a higher level. However, you can
implement your own retry classes if you have specific requirements not provided by the built-in classes.
Alternate retries switch between primary and secondary storage service location if you are using read access geo-
redundant storage (RA-GRS) and the result of the request is a retryable error. See Azure Storage Redundancy
Options for more information.
Policy configuration
Retry policies are configured programmatically. A typical procedure is to create and populate a
TableRequestOptions , BlobRequestOptions , FileRequestOptions , or QueueRequestOptions instance.

TableRequestOptions interactiveRequestOption = new TableRequestOptions()


{
RetryPolicy = new LinearRetry(TimeSpan.FromMilliseconds(500), 3),
// For Read-access geo-redundant storage, use PrimaryThenSecondary.
// Otherwise set this to PrimaryOnly.
LocationMode = LocationMode.PrimaryThenSecondary,
// Maximum execution time based on the business use case.
MaximumExecutionTime = TimeSpan.FromSeconds(2)
};

The request options instance can then be set on the client, and all operations with the client will use the specified
request options.

client.DefaultRequestOptions = interactiveRequestOption;
var stats = await client.GetServiceStatsAsync();

You can override the client request options by passing a populated instance of the request options class as a
parameter to operation methods.

var stats = await client.GetServiceStatsAsync(interactiveRequestOption, operationContext: null);

You use an OperationContext instance to specify the code to execute when a retry occurs and when an
operation has completed. This code can collect information about the operation for use in logs and telemetry.

// Set up notifications for an operation


var context = new OperationContext();
context.ClientRequestID = "some request id";
context.Retrying += (sender, args) =>
{
// Collect retry information
};
context.RequestCompleted += (sender, args) =>
{
// Collect operation completion information
};
var stats = await client.GetServiceStatsAsync(null, context);

In addition to indicating whether a failure is suitable for retry, the extended retry policies return a Retr yContext
object that indicates the number of retries, the results of the last request, whether the next retry will happen in the
primary or secondary location (see table below for details). The properties of the Retr yContext object can be
used to decide if and when to attempt a retry. For more information, see IExtendedRetryPolicy.Evaluate Method.
The following tables show the default settings for the built-in retry policies.
Request options:
SET T IN G DEFA ULT VA L UE M EA N IN G

MaximumExecutionTime None Maximum execution time for the


request, including all potential retry
attempts. If it is not specified, then the
amount of time that a request is
permitted to take is unlimited. In other
words, the request might stop
responding.

ServerTimeout None Server timeout interval for the request


(value is rounded to seconds). If not
specified, it will use the default value for
all requests to the server. Usually, the
best option is to omit this setting so
that the server default is used.

LocationMode None If the storage account is created with


the Read access geo-redundant storage
(RA-GRS) replication option, you can
use the location mode to indicate which
location should receive the request. For
example, if Primar yThenSecondar y is
specified, requests are always sent to
the primary location first. If a request
fails, it is sent to the secondary
location.

RetryPolicy ExponentialPolicy See below for details of each option.

Exponential policy:

SET T IN G DEFA ULT VA L UE M EA N IN G

maxAttempt 3 Number of retry attempts.

deltaBackoff 4 seconds Back-off interval between retries.


Multiples of this timespan, including a
random element, will be used for
subsequent retry attempts.

MinBackoff 3 seconds Added to all retry intervals computed


from deltaBackoff. This value cannot be
changed.

MaxBackoff 120 seconds MaxBackoff is used if the computed


retry interval is greater than
MaxBackoff. This value cannot be
changed.

Linear policy:

SET T IN G DEFA ULT VA L UE M EA N IN G

maxAttempt 3 Number of retry attempts.

deltaBackoff 30 seconds Back-off interval between retries.


Retry usage guidance
Consider the following guidelines when accessing Azure storage services using the storage client API:
Use the built-in retry policies from the Microsoft.Azure.Storage.RetryPolicies namespace where they are
appropriate for your requirements. In most cases, these policies will be sufficient.
Use the ExponentialRetr y policy in batch operations, background tasks, or non-interactive scenarios. In
these scenarios, you can typically allow more time for the service to recover—with a consequently
increased chance of the operation eventually succeeding.
Consider specifying the MaximumExecutionTime property of the RequestOptions parameter to limit
the total execution time, but take into account the type and size of the operation when choosing a timeout
value.
If you need to implement a custom retry, avoid creating wrappers around the storage client classes. Instead,
use the capabilities to extend the existing policies through the IExtendedRetr yPolicy interface.
If you are using read access geo-redundant storage (RA-GRS) you can use the LocationMode to specify
that retry attempts will access the secondary read-only copy of the store should the primary access fail.
However, when using this option you must ensure that your application can work successfully with data
that may be stale if the replication from the primary store has not yet completed.
Consider starting with the following settings for retrying operations. These settings are general purpose, and you
should monitor the operations and fine-tune the values to suit your own scenario.

SA M P L E TA RGET
E2E
C O N T EXT M A X L AT EN C Y RET RY P O L IC Y SET T IN GS VA L UES H O W IT W O RK S

Interactive, UI, 2 seconds Linear maxAttempt 3 Attempt 1 -


or foreground deltaBackoff 500 ms delay 500 ms
Attempt 2 -
delay 500 ms
Attempt 3 -
delay 500 ms

Background 30 seconds Exponential maxAttempt 5 Attempt 1 -


or batch deltaBackoff 4 seconds delay ~3 sec
Attempt 2 -
delay ~7 sec
Attempt 3 -
delay ~15 sec

Telemetry
Retry attempts are logged to a TraceSource . You must configure a TraceListener to capture the events and write
them to a suitable destination log. You can use the TextWriterTraceListener or XmlWriterTraceListener to
write the data to a log file, the EventLogTraceListener to write to the Windows Event Log, or the
EventProviderTraceListener to write trace data to the ETW subsystem. You can also configure autoflushing of
the buffer, and the verbosity of events that will be logged (for example, Error, Warning, Informational, and
Verbose). For more information, see Client-side Logging with the .NET Storage Client Library.
Operations can receive an OperationContext instance, which exposes a Retr ying event that can be used to
attach custom telemetry logic. For more information, see OperationContext.Retrying Event.
Examples
The following code example shows how to create two TableRequestOptions instances with different retry
settings; one for interactive requests and one for background requests. The example then sets these two retry
policies on the client so that they apply for all requests, and also sets the interactive strategy on a specific request
so that it overrides the default settings applied to the client.

using System;
using System.Threading.Tasks;
using Microsoft.Azure.Cosmos.Table;

namespace RetryCodeSamples
{
class AzureStorageCodeSamples
{
private const string connectionString = "UseDevelopmentStorage=true";

public async static Task Samples()


{
var storageAccount = CloudStorageAccount.Parse(connectionString);

TableRequestOptions interactiveRequestOption = new TableRequestOptions()


{
RetryPolicy = new LinearRetry(TimeSpan.FromMilliseconds(500), 3),
// For Read-access geo-redundant storage, use PrimaryThenSecondary.
// Otherwise set this to PrimaryOnly.
LocationMode = LocationMode.PrimaryThenSecondary,
// Maximum execution time based on the business use case.
MaximumExecutionTime = TimeSpan.FromSeconds(2)
};

TableRequestOptions backgroundRequestOption = new TableRequestOptions()


{
// Client has a default exponential retry policy with 4 sec delay and 3 retry attempts
// Retry delays will be approximately 3 sec, 7 sec, and 15 sec
MaximumExecutionTime = TimeSpan.FromSeconds(30),
// PrimaryThenSecondary in case of Read-access geo-redundant storage, else set this to
PrimaryOnly
LocationMode = LocationMode.PrimaryThenSecondary
};

var client = storageAccount.CreateCloudTableClient();


// Client has a default exponential retry policy with 4 sec delay and 3 retry attempts
// Retry delays will be approximately 3 sec, 7 sec, and 15 sec
// ServerTimeout and MaximumExecutionTime are not set

{
// Set properties for the client (used on all requests unless overridden)
// Different exponential policy parameters for background scenarios
client.DefaultRequestOptions = backgroundRequestOption;
// Linear policy for interactive scenarios
client.DefaultRequestOptions = interactiveRequestOption;
}

{
// set properties for a specific request
var stats = await client.GetServiceStatsAsync(interactiveRequestOption, operationContext:
null);
}

{
// Set up notifications for an operation
var context = new OperationContext();
context.ClientRequestID = "some request id";
context.Retrying += (sender, args) =>
{
// Collect retry information
};
context.RequestCompleted += (sender, args) =>
{
// Collect operation completion information
};
var stats = await client.GetServiceStatsAsync(null, context);
var stats = await client.GetServiceStatsAsync(null, context);
}
}
}
}

More information
Azure Storage client Library retry policy recommendations
Storage Client Library 2.0 – Implementing retry policies

General REST and retry guidelines


Consider the following when accessing Azure or third-party services:
Use a systematic approach to managing retries, perhaps as reusable code, so that you can apply a
consistent methodology across all clients and all solutions.
Consider using a retry framework such as Polly to manage retries if the target service or client has no built-
in retry mechanism. This will help you implement a consistent retry behavior, and it may provide a suitable
default retry strategy for the target service. However, you may need to create custom retry code for
services that have nonstandard behavior, that do not rely on exceptions to indicate transient failures, or if
you want to use a Retr y-Response reply to manage retry behavior.
The transient detection logic will depend on the actual client API you use to invoke the REST calls. Some
clients, such as the newer HttpClient class, will not throw exceptions for completed requests with a non-
success HTTP status code.
The HTTP status code returned from the service can help to indicate whether the failure is transient. You
may need to examine the exceptions generated by a client or the retry framework to access the status code
or to determine the equivalent exception type. The following HTTP codes typically indicate that a retry is
appropriate:
408 Request Timeout
429 Too Many Requests
500 Internal Server Error
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
If you base your retry logic on exceptions, the following typically indicate a transient failure where no
connection could be established:
WebExceptionStatus.ConnectionClosed
WebExceptionStatus.ConnectFailure
WebExceptionStatus.Timeout
WebExceptionStatus.RequestCanceled
In the case of a service unavailable status, the service might indicate the appropriate delay before retrying
in the Retr y-After response header or a different custom header. Services might also send additional
information as custom headers, or embedded in the content of the response.
Do not retry for status codes representing client errors (errors in the 4xx range) except for a 408 Request
Timeout and 429 Too Many Requests.
Thoroughly test your retry strategies and mechanisms under a range of conditions, such as different
network states and varying system loadings.
Retry strategies
The following are the typical types of retry strategy intervals:
Exponential . A retry policy that performs a specified number of retries, using a randomized exponential
back off approach to determine the interval between retries. For example:

var random = new Random();

var delta = (int)((Math.Pow(2.0, currentRetryCount) - 1.0) *


random.Next((int)(this.deltaBackoff.TotalMilliseconds * 0.8),
(int)(this.deltaBackoff.TotalMilliseconds * 1.2)));
var interval = (int)Math.Min(checked(this.minBackoff.TotalMilliseconds + delta),
this.maxBackoff.TotalMilliseconds);
retryInterval = TimeSpan.FromMilliseconds(interval);

Incremental . A retry strategy with a specified number of retry attempts and an incremental time interval
between retries. For example:

retryInterval = TimeSpan.FromMilliseconds(this.initialInterval.TotalMilliseconds +
(this.increment.TotalMilliseconds * currentRetryCount));

LinearRetr y . A retry policy that performs a specified number of retries, using a specified fixed time
interval between retries. For example:

retryInterval = this.deltaBackoff;

Transient fault handling with Polly


Polly is a library to programmatically handle retries and circuit breaker strategies. The Polly project is a member of
the .NET Foundation. For services where the client does not natively support retries, Polly is a valid alternative and
avoids the need to write custom retry code, which can be hard to implement correctly. Polly also provides a way to
trace errors when they occur, so that you can log retries.
More information
connection resiliency
Data Points - EF Core 1.1
Transient fault handling
11/2/2020 • 19 minutes to read • Edit Online

All applications that communicate with remote services and resources must be sensitive to transient faults. This is
especially the case for applications that run in the cloud, where the nature of the environment and connectivity
over the Internet means these types of faults are likely to be encountered more often. Transient faults include the
momentary loss of network connectivity to components and services, the temporary unavailability of a service, or
timeouts that arise when a service is busy. These faults are often self-correcting, and if the action is repeated after a
suitable delay it is likely to succeed.
This document covers general guidance for transient fault handling. For information about handling transient
faults when using Microsoft Azure services, see Azure service-specific retry guidelines.

Why do transient faults occur in the cloud?


Transient faults can occur in any environment, on any platform or operating system, and in any kind of application.
In solutions that run on local on-premises infrastructure, the performance and availability of the application and its
components is typically maintained through expensive and often underused hardware redundancy, and
components and resources are located close to each other. While this approach makes a failure less likely, it can
still result in transient faults - and even an outage through unforeseen events such as external power supply or
network issues, or other disaster scenarios.
Cloud hosting, including private cloud systems, can offer higher overall availability by using shared resources,
redundancy, automatic failover, and dynamic resource allocation across many commodity compute nodes.
However, the nature of these environments can mean that transient faults are more likely to occur. There are
several reasons for this:
Many resources in a cloud environment are shared, and access to these resources is subject to throttling in
order to protect the resource. Some services will refuse connections when the load rises to a specific level,
or a maximum throughput rate is reached, in order to allow processing of existing requests and to maintain
performance of the service for all users. Throttling helps to maintain the quality of service for neighbors
and other tenants using the shared resource.
Cloud environments are built using vast numbers of commodity hardware units. They deliver performance
by dynamically distributing the load across multiple computing units and infrastructure components, and
deliver reliability by automatically recycling or replacing failed units. This dynamic nature means that
transient faults and temporary connection failures may occasionally occur.
There are often more hardware components, including network infrastructure such as routers and load
balancers, between the application and the resources and services it uses. This additional infrastructure can
occasionally introduce additional connection latency and transient connection faults.
Network conditions between the client and the server may be variable, especially when communication
crosses the Internet. Even in on-premises locations, heavy traffic loads may slow communication and cause
intermittent connection failures.

Challenges
Transient faults can have a huge effect on the perceived availability of an application, even if it has been thoroughly
tested under all foreseeable circumstances. To ensure that cloud-hosted applications operate reliably, they must be
able to respond to the following challenges:
The application must be able to detect faults when they occur, and determine if these faults are likely to be
transient, more long-lasting, or are terminal failures. Different resources are likely to return different
responses when a fault occurs, and these responses may also vary depending on the context of the
operation; for example, the response for an error when reading from storage may be different from
response for an error when writing to storage. Many resources and services have well-documented
transient failure contracts. However, where such information is not available, it may be difficult to discover
the nature of the fault and whether it is likely to be transient.
The application must be able to retry the operation if it determines that the fault is likely to be transient and
keep track of the number of times the operation was retried.
The application must use an appropriate strategy for the retries. This strategy specifies the number of times
it should retry, the delay between each attempt, and the actions to take after a failed attempt. The
appropriate number of attempts and the delay between each one are often difficult to determine, and vary
based on the type of resource as well as the current operating conditions of the resource and the
application itself.

General guidelines
The following guidelines will help you to design a suitable transient fault handling mechanism for your
applications:
Determine if there is a built-in retr y mechanism:
Many services provide an SDK or client library that contains a transient fault handling mechanism.
The retry policy it uses is typically tailored to the nature and requirements of the target service.
Alternatively, REST interfaces for services may return information that is useful in determining
whether a retry is appropriate, and how long to wait before the next retry attempt.
Use the built-in retry mechanism where available, unless you have specific and well-understood
requirements that make a different retry behavior more appropriate.
Determine if the operation is suitable for retr ying :
You should only retry operations where the faults are transient (typically indicated by the nature of
the error), and if there is at least some likelihood that the operation will succeed when reattempted.
There is no point in reattempting operations that indicate an invalid operation such as a database
update to an item that does not exist, or requests to a service or resource that has suffered a fatal
error.
In general, you should implement retries only where the full impact of this can be determined, and
the conditions are well understood and can be validated. If not, leave it to the calling code to
implement retries. Remember that the errors returned from resources and services outside your
control may evolve over time, and you may need to revisit your transient fault detection logic.
When you create services or components, consider implementing error codes and messages that will
help clients determine whether they should retry failed operations. In particular, indicate if the client
should retry the operation (perhaps by returning an isTransient value) and suggest a suitable delay
before the next retry attempt. If you build a web service, consider returning custom errors defined
within your service contracts. Even though generic clients may not be able to read these, they will be
useful when building custom clients.
Determine an appropriate retr y count and inter val:
It is vital to optimize the retry count and the interval to the type of use case. If you do not retry a
sufficient number of times, the application will be unable to complete the operation and is likely to
experience a failure. If you retry too many times, or with too short an interval between tries, the
application can potentially hold resources such as threads, connections, and memory for long
periods, which will adversely affect the health of the application.
The appropriate values for the time interval and the number of retry attempts depend on the type of
operation being attempted. For example, if the operation is part of a user interaction, the interval
should be short and only a few retries attempted to avoid making users wait for a response (which
holds open connections and can reduce availability for other users). If the operation is part of a long
running or critical workflow, where canceling and restarting the process is expensive or time-
consuming, it is appropriate to wait longer between attempts and retry more times.
Determining the appropriate intervals between retries is the most difficult part of designing a
successful strategy. Typical strategies use the following types of retry interval:
Exponential back-off . The application waits a short time before the first retry, and then
exponentially increasing times between each subsequent retry. For example, it may retry the
operation after 3 seconds, 12 seconds, 30 seconds, and so on.
Incremental inter vals . The application waits a short time before the first retry, and then
incrementally increasing times between each subsequent retry. For example, it may retry the
operation after 3 seconds, 7 seconds, 13 seconds, and so on.
Regular inter vals . The application waits for the same period of time between each attempt.
For example, it may retry the operation every 3 seconds.
Immediate retr y . Sometimes a transient fault is brief, perhaps due to an event such as a
network packet collision or a spike in a hardware component. In this case, retrying the
operation immediately is appropriate because it may succeed if the fault has cleared in the
time it takes the application to assemble and send the next request. However, there should
never be more than one immediate retry attempt, and you should switch to alternative
strategies, such as exponential back-off or fallback actions, if the immediate retry fails.
Randomization . Any of the retry strategies listed above may include a randomization to
prevent multiple instances of the client sending subsequent retry attempts at the same time.
For example, one instance may retry the operation after 3 seconds, 11 seconds, 28 seconds,
and so on, while another instance may retry the operation after 4 seconds, 12 seconds, 26
seconds, and so on. Randomization is a useful technique that may be combined with other
strategies.
As a general guideline, use an exponential back-off strategy for background operations, and
immediate or regular interval retry strategies for interactive operations. In both cases, you should
choose the delay and the retry count so that the maximum latency for all retry attempts is within the
required end-to-end latency requirement.
Take into account the combination of all the factors that contribute to the overall maximum timeout
for a retried operation. These factors include the time taken for a failed connection to produce a
response (typically set by a timeout value in the client) as well as the delay between retry attempts
and the maximum number of retries. The total of all these times can result in long overall operation
times, especially when using an exponential delay strategy where the interval between retries grows
rapidly after each failure. If a process must meet a specific service level agreement (SLA), the overall
operation time, including all timeouts and delays, must be within the limits defined in the SLA.
Overly aggressive retry strategies, which have intervals that are too short or retries that are too
frequent, can have an adverse effect on the target resource or service. This may prevent the resource
or service from recovering from its overloaded state, and it will continue to block or refuse requests.
This results in a vicious circle where more and more requests are sent to the resource or service, and
consequently its ability to recover is further reduced.
Take into account the timeout of the operations when choosing the retry intervals to avoid launching
a subsequent attempt immediately (for example, if the timeout period is similar to the retry interval).
Also consider if you need to keep the total possible period (the timeout plus the retry intervals) to
below a specific total time. Operations that have unusually short or very long timeouts may
influence how long to wait, and how often to retry the operation.
Use the type of the exception and any data it contains, or the error codes and messages returned
from the service, to optimize the interval and the number of retries. For example, some exceptions or
error codes (such as the HTTP code 503 Service Unavailable with a Retry-After header in the
response) may indicate how long the error might last, or that the service has failed and will not
respond to any subsequent attempt.
Avoid anti-patterns :
In the vast majority of cases, you should avoid implementations that include duplicated layers of
retry code. Avoid designs that include cascading retry mechanisms, or that implement retry at every
stage of an operation that involves a hierarchy of requests, unless you have specific requirements
that demand this. In these exceptional circumstances, use policies that prevent excessive numbers of
retries and delay periods, and make sure you understand the consequences. For example, if one
component makes a request to another, which then accesses the target service, and you implement
retry with a count of three on both calls there will be nine retry attempts in total against the service.
Many services and resources implement a built-in retry mechanism and you should investigate how
you can disable or modify this if you need to implement retries at a higher level.
Never implement an endless retry mechanism. This is likely to prevent the resource or service
recovering from overload situations, and cause throttling and refused connections to continue for a
longer period. Use a finite number or retries, or implement a pattern such as Circuit Breaker to allow
the service to recover.
Never perform an immediate retry more than once.
Avoid using a regular retry interval, especially when you have a large number of retry attempts,
when accessing services and resources in Azure. The optimum approach is this scenario is an
exponential back-off strategy with a circuit-breaking capability.
Prevent multiple instances of the same client, or multiple instances of different clients, from sending
retries at the same times. If this is likely to occur, introduce randomization into the retry intervals.
Test your retr y strategy and implementation:
Ensure you fully test your retry strategy implementation under as wide a set of circumstances as
possible, especially when both the application and the target resources or services it uses are under
extreme load. To check behavior during testing, you can:
Inject transient and nontransient faults into the service. For example, send invalid requests or
add code that detects test requests and responds with different types of errors. For an
example using TestApi, see Fault Injection Testing with TestApi and Introduction to TestApi –
Part 5: Managed Code Fault Injection APIs.
Create a mock of the resource or service that returns a range of errors that the real service
may return. Ensure you cover all the types of error that your retry strategy is designed to
detect.
Force transient errors to occur by temporarily disabling or overloading the service if it is a
custom service that you created and deployed (of course, you should not attempt to overload
any shared resources or shared services within Azure).
For HTTP-based APIs, consider using the FiddlerCore library in your automated tests to
change the outcome of HTTP requests, either by adding extra roundtrip times or by changing
the response (such as the HTTP status code, headers, body, or other factors). This enables
deterministic testing of a subset of the failure conditions, whether transient faults or other
types of failure. For more information, see FiddlerCore. For examples of how to use the library,
particularly the HttpMangler class, examine the source code for the Azure Storage SDK.
Perform high load factor and concurrent tests to ensure that the retry mechanism and
strategy works correctly under these conditions, and does not have an adverse effect on the
operation of the client or cause cross-contamination between requests.
Manage retr y policy configurations:
A retry policy is a combination of all of the elements of your retry strategy. It defines the detection
mechanism that determines whether a fault is likely to be transient, the type of interval to use (such
as regular, exponential back-off, and randomization), the actual interval value(s), and the number of
times to retry.
Retries must be implemented in many places within even the simplest application, and in every layer
of more complex applications. Rather than hard-coding the elements of each policy at multiple
locations, consider using a central point for storing all the policies. For example, store the values such
as the interval and retry count in application configuration files, read them at runtime, and
programmatically build the retry policies. This makes it easier to manage the settings, and to modify
and fine-tune the values in order to respond to changing requirements and scenarios. However,
design the system to store the values rather than rereading a configuration file every time, and
ensure suitable defaults are used if the values cannot be obtained from configuration.
In an Azure Cloud Services application, consider storing the values that are used to build the retry
policies at runtime in the service configuration file so that they can be changed without needing to
restart the application.
Take advantage of built-in or default retry strategies available in the client APIs you use, but only
where they are appropriate for your scenario. These strategies are typically general purpose. In some
scenarios they may be all that is required, but in other scenarios they may not offer the full range of
options to suit your specific requirements. You must understand how the settings will affect your
application through testing to determine the most appropriate values.
Log and track transient and nontransient faults:
As part of your retry strategy, include exception handling and other instrumentation that logs when
retry attempts are made. While an occasional transient failure and retry are to be expected, and do
not indicate a problem, regular and increasing numbers of retries are often an indicator of an issue
that may cause a failure, or is currently degrading application performance and availability.
Log transient faults as Warning entries rather than Error entries so that monitoring systems do not
detect them as application errors that may trigger false alerts.
Consider storing a value in your log entries that indicates if the retries were caused by throttling in
the service, or by other types of faults such as connection failures, so that you can differentiate them
during analysis of the data. An increase in the number of throttling errors is often an indicator of a
design flaw in the application or the need to switch to a premium service that offers dedicated
hardware.
Consider measuring and logging the overall time taken for operations that include a retry
mechanism. This is a good indicator of the overall effect of transient faults on user response times,
process latency, and the efficiency of the application use cases. Also log the number of retries
occurred in order to understand the factors that contributed to the response time.
Consider implementing a telemetry and monitoring system that can raise alerts when the number
and rate of failures, the average number of retries, or the overall times taken for operations to
succeed, is increasing.
Manage operations that continually fail:
There will be circumstances where the operation continues to fail at every attempt, and it is vital to
consider how you will handle this situation:
Although a retry strategy will define the maximum number of times that an operation should
be retried, it does not prevent the application repeating the operation again, with the same
number of retries. For example, if an order processing service fails with a fatal error that puts
it out of action permanently, the retry strategy may detect a connection timeout and consider
it to be a transient fault. The code will retry the operation a specified number of times and
then give up. However, when another customer places an order, the operation will be
attempted again - even though it is sure to fail every time.
To prevent continual retries for operations that continually fail, consider implementing the
Circuit Breaker pattern. In this pattern, if the number of failures within a specified time
window exceeds the threshold, requests are returned to the caller immediately as errors,
without attempting to access the failed resource or service.
The application can periodically test the service, on an intermittent basis and with long
intervals between requests, to detect when it becomes available. An appropriate interval will
depend on the scenario, such as the criticality of the operation and the nature of the service,
and might be anything between a few minutes and several hours. At the point where the test
succeeds, the application can resume normal operations and pass requests to the newly
recovered service.
In the meantime, it may be possible to fall back to another instance of the service (perhaps in
a different datacenter or application), use a similar service that offers compatible (perhaps
simpler) functionality, or perform some alternative operations in the hope that the service will
become available soon. For example, it may be appropriate to store requests for the service in
a queue or data store and replay them later. Otherwise you might be able to redirect the user
to an alternative instance of the application, degrade the performance of the application but
still offer acceptable functionality, or just return a message to the user indicating that the
application is not available at present.
Other considerations
When deciding on the values for the number of retries and the retry intervals for a policy, consider if
the operation on the service or resource is part of a long-running or multistep operation. It may be
difficult or expensive to compensate all the other operational steps that have already succeeded
when one fails. In this case, a very long interval and a large number of retries may be acceptable as
long as it does not block other operations by holding or locking scarce resources.
Consider if retrying the same operation may cause inconsistencies in data. If some parts of a
multistep process are repeated, and the operations are not idempotent, it may result in an
inconsistency. For example, an operation that increments a value, if repeated, will produce an invalid
result. Repeating an operation that sends a message to a queue may cause an inconsistency in the
message consumer if it cannot detect duplicate messages. To prevent this, ensure that you design
each step as an idempotent operation. For more information about idempotency, see Idempotency
patterns.
Consider the scope of the operations that will be retried. For example, it may be easier to implement
retry code at a level that encompasses several operations, and retry them all if one fails. However,
doing this may result in idempotency issues or unnecessary rollback operations.
If you choose a retry scope that encompasses several operations, take into account the total latency
of all of them when determining the retry intervals, when monitoring the time taken, and before
raising alerts for failures.
Consider how your retry strategy may affect neighbors and other tenants in a shared application, or
when using shared resources and services. Aggressive retry policies can cause an increasing number
of transient faults to occur for these other users and for applications that share the resources and
services. Likewise, your application may be affected by the retry policies implemented by other users
of the resources and services. For mission-critical applications, you may decide to use premium
services that are not shared. This provides you with much more control over the load and
consequent throttling of these resources and services, which can help to justify the additional cost.

More information
Azure service-specific retry guidelines
Circuit Breaker pattern
Compensating Transaction pattern
Idempotency patterns
Performance tuning a distributed application
11/2/2020 • 2 minutes to read • Edit Online

In this series, we walk through several cloud application scenarios, showing how a development team used load
tests and metrics to diagnose performance issues. These articles are based on actual load testing that we
performed when developing example applications. The code for each scenario is available on GitHub.
Scenarios:
Distributed business transaction
Calling multiple backend services
Event stream processing

What is performance?
Performance is frequently measured in terms of throughput, response time, and availability. Performance targets
should be based on business operations. Customer-facing tasks may have more stringent requirements than
operational tasks such as generating reports.
Define a service level objective (SLO) that defines performance targets for each workload. You typically achieve this
by breaking a performance target into a set of Key Performance Indicators (KPIs), such as:
Latency or response time of specific requests
The number of requests performed per second
The rate at which the system generates exceptions.
Performance targets should explicitly include a target load. Also, not all users will receive exactly the same level of
performance, even when accessing the system simultaneously and performing the same work. So an SLO should
be framed in terms of percentiles.
An example SLO for might be: "Client requests will have a response within 500 ms @ P90, at loads up to 25 K
requests/second."

Challenges of performance tuning a distributed system


It can be especially challenging to diagnose performance issues in a distributed application. Some of the
challenges are:
A single business transaction or operation typically involves multiple components of the system. It can be
hard to get a holistic end-to-end view of a single operation.
Resource consumption is distributed across multiple nodes. To get a consistent view, you need to aggregate
logs and metrics in one place.
The cloud offers elastic scale. Autoscaling is an important technique for handling spikes in load, but it can
also mask underlying issues. Also, it can be hard to know which components need to scale and when.
Cascading failures can cause failures upstream of the root problem. As a result, the first signal of the
problem may appear in a different component than the root cause.

General best practices


Performance tuning is both an art and a science, but it can be made closer to science by taking a systematic
approach. Here are some best practices:
Enable telemetry to collect metrics. Instrument your code. Follow best practices for monitoring. Use
correlated tracing so that you can view all the steps in a transaction.
Monitor the 90/95/99 percentiles, not just average. The average can mask outliers. The sampling rate for
metrics also matters. If the sampling rate is too low, it can hide spikes or outliers that might indicate
problems.
Attack one bottleneck at a time. Form a hypothesis and test it by changing one variable at a time. Removing
one bottleneck will often uncover another bottleneck further upstream or downstream.
Errors and retries can have a large impact on performance. If you see that you are being throttled by
backend services, scale out or try to optimize usage (for example by tuning database queries).
Look for common performance anti-patterns.
Look for opportunities to parallelize. Two common sources of bottlenecks are message queues and
databases. In both cases, sharding can help. For more information, see Horizontal, vertical, and functional
data partitioning. Look for hot partitions that might indicate imbalanced read or write loads.

Next steps
Read the performance tuning scenarios
Distributed business transaction
Calling multiple backend services
Event stream processing
Performance tuning scenario: Distributed business
transactions
11/2/2020 • 9 minutes to read • Edit Online

This article describes how a development team used metrics to find bottlenecks and improve the performance of a
distributed system. The article is based on actual load testing that we did for a sample application. The application
code is available on GitHub.
This article is part of a series. Read the first part here.
Scenario : A client application initiates a business transaction that involves multiple steps.
This scenario involves a drone delivery application that runs on Azure Kubernetes Service (AKS). Customers use a
web app to schedule deliveries by drone. Each transaction requires multiple steps that are performed by separate
microservices on the back end:
The Delivery service manages deliveries.
The Drone Scheduler service schedules drones for pickup.
The Package service manages packages.
There are two other services: An Ingestion service that accepts client requests and puts them on a queue for
processing, and a Workflow service that coordinates the steps in the workflow.

For more information about this scenario, see Designing a microservices architecture.

Test 1: Baseline
For the first load test, the team created a 6-node AKS cluster and deployed three replicas of each microservice. The
load test was a step-load test, starting at two simulated users and ramping up to 40 simulated users.

SET T IN G VA L UE

Cluster nodes 6

Pods 3 per service

The following graph shows the results of the load test, as shown in Visual Studio. The purple line plots user load,
and the orange line plots total requests.
The first thing to realize about this scenario is that client requests per second is not a useful metric of performance.
That's because the application processes requests asynchronously, so the client gets a response right away. The
response code is always HTTP 202 (Accepted), meaning the request was accepted but processing is not complete.
What we really want to know is whether the backend is keeping up with the request rate. The Service Bus queue
can absorb spikes, but if the backend cannot handle a sustained load, processing will fall further and further
behind.
Here's a more informative graph. It plots the number incoming and outgoing messages on the Service Bus queue.
Incoming messages are shown in light blue, and outgoing messages are shown in dark blue:

This chart is showing that the rate of incoming messages increases, reaching a peak and then dropping back to
zero at the end of the load test. But the number of outgoing messages peaks early in the test and then actually
drops. That means the Workflow service, which handles the requests, isn't keeping up. Even after the load test ends
(around 9:22 on the graph), messages are still being processed as the Workflow service continues to drain the
queue.
What's slowing down the processing? The first thing to look for is errors or exceptions that might indicate a
systematic issue. The Application Map in Azure Monitor shows the graph of calls between components, and is a
quick way to spot issues and then click through to get more details.
Sure enough, the Application Map shows that the Workflow service is getting errors from the Delivery service:
To see more details, you can select a node in the graph and click into an end-to-end transaction view. In this case, it
shows that the Delivery service is returning HTTP 500 errors. The error messages indicate that an exception is
being thrown due to memory limits in Azure Cache for Redis.

You may notice that these calls to Redis don't appear in the Application Map. That's because the .NET library for
Application Insights doesn't have built-in support for tracking Redis as a dependency. (For a list of what's
supported out of the box, see Dependency auto-collection.) As a fallback, you can use the TrackDependency API to
track any dependency. Load testing often reveals these kinds of gaps in the telemetry, which can be remediated.

Test 2: Increased cache size


For the second load test, the development team increased the cache size in Azure Cache for Redis. (See How to
Scale Azure Cache for Redis.) This change resolved the out-of-memory exceptions, and now the Application Map
shows zero errors:

However, there is still a dramatic lag in processing messages. At the peak of the load test, the incoming message
rate is more than 5× the outgoing rate:

The following graph measures throughput in terms of message completion — that is, the rate at which the
Workflow service marks the Service Bus messages as completed. Each point on the graph represents 5 seconds of
data, showing ~16/sec maximum throughput.
This graph was generated by running a query in the Log Analytics workspace, using the Kusto query language:

let start=datetime("2019-07-31T22:30:00.000Z");
let end=datetime("2019-07-31T22:45:00.000Z");
dependencies
| where cloud_RoleName == 'fabrikam-workflow'
| where timestamp > start and timestamp < end
| where type == 'Azure Service Bus'
| where target has 'https://dev-i-iuosnlbwkzkau.servicebus.windows.net'
| where client_Type == "PC"
| where name == "Complete"
| summarize succeeded=sumif(itemCount, success == true), failed=sumif(itemCount, success == false) by
bin(timestamp, 5s)
| render timechart

Test 3: Scale out the backend services


It appears the back end is the bottleneck. An easy next step is to scale out the business services (Package, Delivery,
and Drone Scheduler), and see if throughput improves. For the next load test, the team scaled these services out
from three replicas to six replicas.

SET T IN G VA L UE

Cluster nodes 6

Ingestion service 3 replicas

Workflow service 3 replicas

Package, Delivery, Drone Scheduler services 6 replicas each

Unfortunately this load test shows only modest improvement. Outgoing messages are still not keeping up with
incoming messages:
Throughput is more consistent, but the maximum achieved is about the same as the previous test:

Moreover, looking at Azure Monitor for containers, it appears the problem is not caused by resource exhaustion
within the cluster. First, the node-level metrics show that CPU utilization remains under 40% even at the 95th
percentile, and memory utilization is about 20%.

In a Kubernetes environment, it's possible for individual pods to be resource-constrained even when the nodes
aren't. But the pod-level view shows that all pods are healthy.
From this test, it seems that just adding more pods to the back end won't help. The next step is to look more
closely at the Workflow service to understand what's happening when it processes messages. Application Insights
shows that the average duration of the Workflow service's Process operation is 246 ms.

We can also run a query to get metrics on the individual operations within each transaction:

TA RGET P ERC EN T IL E_DURAT IO N _50 P ERC EN T IL E_DURAT IO N _95

https://dev-i- 86.66950203 283.4255578


iuosnlbwkzkau.servicebus.windows.net/
| dev-i-iuosnlbwkzkau

delivery 37 57

package 12 17

dronescheduler 21 41

The first row in this table represents the Service Bus queue. The other rows are the calls to the backend services.
For reference, here's the Log Analytics query for this table:

let start=datetime("2019-07-31T22:30:00.000Z");
let end=datetime("2019-07-31T22:45:00.000Z");
let dataset=dependencies
| where timestamp > start and timestamp < end
| where (cloud_RoleName == 'fabrikam-workflow')
| where name == 'Complete' or target in ('package', 'delivery', 'dronescheduler');
dataset
| summarize percentiles(duration, 50, 95) by target
These latencies look reasonable. But here is the key insight: If the total operation time is ~250 ms, that puts a strict
upper bound on how fast messages can be processed in serial. The key to improving throughput, therefore, is
greater parallelism.
That should be possible in this scenario, for two reasons:
These are network calls, so most of the time is spent waiting for I/O completion
The messages are independent, and don't need to be processed in order.

Test 4: Increase parallelism


For this test, the team focused on increasing parallelism. To do so, they adjusted two settings on the Service Bus
client used by the Workflow service:

SET T IN G DESC RIP T IO N DEFA ULT N EW VA L UE

MaxConcurrentCalls The maximum number of 1 20


messages to process
concurrently.

PrefetchCount How many messages the 0 3000


client will fetch ahead of
time into its local cache.

For more information about these settings, see Best Practices for performance improvements using Service Bus
Messaging. Running the test with these settings produced the following graph:
Recall that incoming messages are shown in light blue, and outgoing messages are shown in dark blue.
At first glance, this is a very weird graph. For a while, the outgoing message rate exactly tracks the incoming rate.
But then, at about the 2:03 mark, the rate of incoming messages levels off, while the number of outgoing
messages continues to rise, actually exceeding the total number of incoming messages. That seems impossible.
The clue to this mystery can be found in the Dependencies view in Application Insights. This chart summarizes all
of the calls that the Workflow service made to Service Bus:

Notice that entry for DeadLetter . That calls indicates messages are going into the Service Bus dead-letter queue.
To understand what's happening, you need to understand Peek-Lock semantics in Service Bus. When a client uses
Peek-Lock, Service Bus atomically retrieves and locks a message. While the lock is held, the message is guaranteed
not to be delivered to other receivers. If the lock expires, the message becomes available to other receivers. After a
maximum number of delivery attempts (which is configurable), Service Bus will put the messages onto a dead-
letter queue, where it can be examined later.
Remember that the Workflow service is prefetching large batches of messages — 3000 messages at a time). That
means the total time to process each message is longer, which results in messages timing out, going back onto the
queue, and eventually going into the dead-letter queue.
You can also see this behavior in the exceptions, where numerous MessageLostLockException exceptions are
recorded:

Test 5: Increase lock duration


For this load test, the message lock duration was set to 5 minutes, to prevent lock timeouts. The graph of incoming
and outgoing messages now shows that the system is keeping up with the rate of incoming messages:

Over the total duration of the 8-minute load test, the application completed 25 K operations, with a peak
throughput of 72 operations/sec, representing a 400% increase in maximum throughput.

However, running the same test with a longer duration showed that the application could not sustain this rate:
The container metrics show that maximum CPU utilization was close to 100%. At this point, the application
appears to be CPU-bound. Scaling the cluster might improve performance now, unlike the previous attempt at
scaling out.

Test 6: Scale out the backend services (again)


For the final load test in the series, the team scaled out the Kubernetes cluster and pods as follows:

SET T IN G VA L UE

Cluster nodes 12

Ingestion service 3 replicas

Workflow service 6 replicas

Package, Delivery, Drone Scheduler services 9 replicas each

This test resulted in a higher sustained throughput, with no significant lags in processing messages. Moreover,
node CPU utilization stayed below 80%.
Summary
For this scenario, the following bottlenecks were identified:
Out-of-memory exceptions in Azure Cache for Redis.
Lack of parallelism in message processing.
Insufficient message lock duration, leading to lock timeouts and messages being placed in the dead letter
queue.
CPU exhaustion.
To diagnose these issues, the development team relied on the following metrics:
The rate of incoming and outgoing Service Bus messages.
Application Map in Application Insights.
Errors and exceptions.
Custom Log Analytics queries.
CPU and memory utilization in Azure Monitor for containers.

Next steps
For more information about the design of this scenario, see Designing a microservices architecture.
Performance tuning scenario: Multiple backend
services
11/2/2020 • 10 minutes to read • Edit Online

This article describes how a development team used metrics to find bottlenecks and improve the performance of a
distributed system. The article is based on actual load testing that we did for a sample application. The application
code is available on GitHub, along with the Visual Studio load test project used to generate the results.
This article is part of a series. Read the first part here.
Scenario : Call multiple backend services to retrieve information and then aggregate the results.
This scenario involves a drone delivery application. Clients can query a REST API to get their latest invoice
information. The invoice includes a summary of the customer's deliveries, packages, and total drone utilization.
This application uses a microservices architecture running on Azure Kubernetes Service (AKS), and the information
needed for the invoice is spread across several microservices.
Rather than the client calling each service directly, the application implements the Gateway Aggregation pattern.
Using this pattern, the client makes a single request to a gateway service. The gateway in turn calls the backend
services in parallel, and then aggregates the results into a single response payload.

Test 1: Baseline performance


To establish a baseline, the development team started with a step-load test, ramping the load from one simulated
user up to 40 users, over a total duration of 8 minutes. The following chart, taken from Visual Studio, shows the
results. The purple line shows the user load, and the orange line shows throughput (average requests per second).
The red line along the bottom of the chart shows that no errors were returned to the client, which is encouraging.
However, the average throughput peaks about half way through the test, and then drops off for the remainder,
even while the load continues to increase. That indicates the back end is not able to keep up. The pattern seen here
is common when a system starts to reach resource limits — after reaching a maximum, throughput actually falls
significantly. Resource contention, transient errors, or an increase in the rate of exceptions can all contribute to this
pattern.
Let's dig into the monitoring data to learn what's happening inside the system. The next chart is taken from
Application Insights. It shows the average durations of the HTTP calls from the gateway to the backend services.

This chart shows that one operation in particular, GetDroneUtilization , takes much longer on average — by an
order of magnitude. The gateway makes these calls in parallel, so the slowest operation determines how long it
takes for the entire request to complete.
Clearly the next step is dig into the GetDroneUtilization operation and look for any bottlenecks. One possibility is
resource exhaustion. Perhaps this particular backend service is running out of CPU or memory. For an AKS cluster,
this information is available in the Azure portal through the Azure Monitor for containers feature. The following
graphs show resource utilization at the cluster level:
In this screenshot, both the average and maximum values are shown. It's important to look at more than just the
average, because the average can hide spikes in the data. Here, the average CPU utilization stays below 50%, but
there are a couple of spikes to 80%. That's close to capacity but still within tolerances. Something else is causing
the bottleneck.
The next chart reveals the true culprit. This chart shows HTTP response codes from the Delivery service's backend
database, which in this case is Cosmos DB. The blue line represents success codes (HTTP 2xx), while the green line
represents HTTP 429 errors. An HTTP 429 return code means that Cosmos DB is temporarily throttling requests,
because the caller is consuming more resource units (RU) than provisioned.

To get further insight, the development team used Application Insights to view the end-to-end telemetry for a
representative sample of requests. Here is one instance:
This view shows the calls related to a single client request, along with timing information and response codes. The
top-level calls are from the gateway to the backend services. The call to GetDroneUtilization is expanded to show
calls to external dependencies — in this case, to Cosmos DB. The call in red returned an HTTP 429 error.
Note the large gap between the HTTP 429 error and the next call. When the Cosmos DB client library receives an
HTTP 429 error, it automatically backs off and waits to retry the operation. What this view shows is that during the
672 ms this operation took, most of that time was spent waiting to retry Cosmos DB.
Here's another interesting graph for this analysis. It shows RU consumption per physical partition versus
provisioned RUs per physical partition:

To make sense of this graph, you need to understand how Cosmos DB manages partitions. Collections in Cosmos
DB can have a partition key. Each possible key value defines a logical partition of the data within the collection.
Cosmos DB distributes these logical partitions across one or more physical partitions. The management of physical
partitions is handled automatically by Cosmos DB. As you store more data, Cosmos DB might move logical
partitions into new physical partitions, in order to spread load across the physical partitions.
For this load test, the Cosmos DB collection was provisioned with 900 RUs. The chart shows 100 RU per physical
partition, which implies a total of nine physical partitions. Although Cosmos DB automatically handles the sharding
of physical partitions, knowing the partition count can give insight into performance. The development team will
use this information later, as they continue to optimize. Where the blue line crosses the purple horizontal line, RU
consumption has exceeded the provisioned RUs. That's the point where Cosmos DB will begin to throttle calls.

Test 2: Increase resource units


For the second load test, the team scaled out the Cosmos DB collection from 900 RU to 2500 RU. Throughput
increased from 19 requests/second to 23 requests/second, and average latency dropped from 669 ms to 569 ms.

M ET RIC T EST 1 T EST 2

Throughput (req/sec) 19 23

Average latency (ms) 669 569

Successful requests 9.8 K 11 K

These aren't huge gains, but looking at the graph over time shows a more complete picture:

Whereas the previous test showed an initial spike followed by a sharp drop, this test shows more consistent
throughput. However, the maximum throughput is not significantly higher.
All requests to Cosmos DB returned a 2xx status, and the HTTP 429 errors went away:
The graph of RU consumption versus provisioned RUs shows there is plenty of headroom. There are about 275
RUs per physical partition, and the load test peaked at about 100 RUs consumed per second.

Another interesting metric is the number of calls to Cosmos DB per successful operation:

M ET RIC T EST 1 T EST 2

Calls per operation 11 9

Assuming no errors, the number of calls should match the actual query plan. In this case, the operation involves a
cross-partition query that hits all nine physical partitions. The higher value in the first load test reflects the number
of calls that returned a 429 error.
This metric was calculated by running a custom Log Analytics query:

let start=datetime("2019-06-18T20:59:00.000Z");
let end=datetime("2019-07-24T21:10:00.000Z");
let operationNameToEval="GET DroneDeliveries/GetDroneUtilization";
let dependencyType="Azure DocumentDB";
let dataset=requests
| where timestamp > start and timestamp < end
| where success == true
| where name == operationNameToEval;
dataset
| project reqOk=itemCount
| summarize
SuccessRequests=sum(reqOk),
TotalNumberOfDepCalls=(toscalar(dependencies
| where timestamp > start and timestamp < end
| where type == dependencyType
| summarize sum(itemCount)))
| project
OperationName=operationNameToEval,
DependencyName=dependencyType,
SuccessRequests,
AverageNumberOfDepCallsPerOperation=(TotalNumberOfDepCalls/SuccessRequests)

To summarize, the second load test shows improvement. However, the GetDroneUtilization operation still takes
about an order of magnitude longer than the next-slowest operation. Looking at the end-to-end transactions helps
to explain why:

As mentioned earlier, the GetDroneUtilization operation involves a cross-partition query to Cosmos DB. This
means the Cosmos DB client has to fan out the query to each physical partition and collect the results. As the end-
to-end transaction view shows, these queries are being performed in serial. The operation takes as long as the sum
of all the queries — and this problem will only get worse as the size of the data grows and more physical partitions
are added.
Test 3: Parallel queries
Based on the previous results, an obvious way to reduce latency is to issue the queries in parallel. The Cosmos DB
client SDK has a setting that controls the maximum degree of parallelism.

VA L UE DESC RIP T IO N

0 No parallelism (default)

>0 Maximum number of parallel calls

-1 The client SDK selects an optimal degree of parallelism

For the third load test, this setting was changed from 0 to -1. The following table summarizes the results:

M ET RIC T EST 1 T EST 2 T EST 3

Throughput (req/sec) 19 23 42

Average latency (ms) 669 569 215

Successful requests 9.8 K 11 K 20 K

Throttled requests 2.72 K 0 0

From the load test graph, not only is the overall throughput much higher (the orange line), but throughput also
keeps pace with the load (the purple line).

We can verify that the Cosmos DB client is making queries in parallel by looking at the end-to-end transaction
view:
Interestingly, a side effect of increasing the throughput is that the number of RUs consumed per second also
increases. Although Cosmos DB did not throttle any requests during this test, the consumption was close to the
provisioned RU limit:

This graph might be a signal to further scale out the database. However, it turns out that we can optimize the query
instead.

Step 4: Optimize the query


The previous load test showed better performance in terms of latency and throughput. Average request latency
was reduced by 68% and throughput increased 220%. However, the cross-partition query is a concern.
The problem with cross-partition queries is that you pay for RU across every partition. If the query is only run
occasionally — say, once an hour — it might not matter. But whenever you see a read-heavy workload that
involves a cross-partition query, you should see whether the query can be optimized by including a partition key.
(You might need to redesign the collection to use a different partition key.)
Here's the query for this particular scenario:

SELECT * FROM c
WHERE c.ownerId = <ownerIdValue> and
c.year = <yearValue> and
c.month = <monthValue>

This query selects records that match a particular owner ID and month/year. In the original design, none of these
properties is the partition key. That requires the client to fan out the query to each physical partition and gather
the results. To improve query performance, the development team changed the design so that owner ID is the
partition key for the collection. That way, the query can target a specific physical partition. (Cosmos DB handles this
automatically; you don't have to manage the mapping between partition key values and physical partitions.)
After switching the collection to the new partition key, there was a dramatic improvement in RU consumption,
which translates directly into lower costs.

M ET RIC T EST 1 T EST 2 T EST 3 T EST 4

RUs per operation 29 29 29 3.4

Calls per operation 11 9 10 1

The end-to-end transaction view shows that as predicted, the query reads only one physical partition:

The load test shows improved throughput and latency:

M ET RIC T EST 1 T EST 2 T EST 3 T EST 4

Throughput (req/sec) 19 23 42 59

Average latency (ms) 669 569 215 176

Successful requests 9.8 K 11 K 20 K 29 K

Throttled requests 2.72 K 0 0 0

A consequence of the improved performance is that node CPU utilization becomes very high:
Toward the end of the load test, average CPU reached about 90%, and maximum CPU reached 100%. This metric
indicates that CPU is the next bottleneck in the system. If higher throughput is needed, the next step might be
scaling out the Delivery service to more instances.

Summary
For this scenario, the following bottlenecks were identified:
Cosmos DB throttling requests due to insufficient RUs provisioned.
High latency caused by querying multiple database partitions in serial.
Inefficient cross-partition query, because the query did not include the partition key.
In addition, CPU utilization was identified as a potential bottleneck at higher scale. To diagnose these issues, the
development team looked at:
Latency and throughput from the load test.
Cosmos DB errors and RU consumption.
The end-to-end transaction view in Application Insight.
CPU and memory utilization in Azure Monitor for containers.

Next steps
Review performance antipatterns
Performance tuning scenario: Event streaming with
Azure Functions
11/2/2020 • 7 minutes to read • Edit Online

This article describes how a development team used metrics to find bottlenecks and improve the performance of a
distributed system. The article is based on actual load testing that we did for a sample application. The application
code is available on GitHub.
This article is part of a series. Read the first part here.
Scenario : Process a stream of events using Azure Functions.

In this scenario, a fleet of drones sends position data in real time to Azure IoT Hub. A Functions app receives the
events, transforms the data into GeoJSON format, and writes the transformed data to Cosmos DB. Cosmos DB has
native support for geospatial data, and Cosmos DB collections can be indexed for efficient spatial queries. For
example, a client application could query for all drones within 1 km of a given location, or find all drones within a
certain area.
These processing requirements are simple enough that they don't require a full-fledged stream processing engine.
In particular, the processing doesn't join streams, aggregate data, or process across time windows. Based on these
requirements, Azure Functions is a good fit for processing the messages. Cosmos DB can also scale to support
very high write throughput.

Monitoring throughput
This scenario presents an interesting performance challenge. The data rate per device is known, but the number of
devices may fluctuate. For this business scenario, the latency requirements are not particularly stringent. The
reported position of a drone only needs to be accurate within a minute. That said, the function app must keep up
with the average ingestion rate over time.
IoT Hub stores messages in a log stream. Incoming messages are appended to the tail of the stream. A reader of
the stream — in this case, the function app — controls its own rate of traversing the stream. This decoupling of the
read and write paths makes IoT Hub very efficient, but also means that a slow reader can fall behind. To detect this
condition, the development team added a custom metric to measure message lateness. This metric records the
delta between when a message arrives at IoT Hub, and when the function receives the message for processing.

var ticksUTCNow = DateTimeOffset.UtcNow;

// Track whether messages are arriving at the function late.


DateTime? firstMsgEnqueuedTicksUtc = messages[0]?.EnqueuedTimeUtc;
if (firstMsgEnqueuedTicksUtc.HasValue)
{
CustomTelemetry.TrackMetric(
context,
"IoTHubMessagesReceivedFreshnessMsec",
(ticksUTCNow - firstMsgEnqueuedTicksUtc.Value).TotalMilliseconds);
}
The TrackMetric method writes a custom metric to Application Insights. For information about using TrackMetric
inside an Azure Function, see Custom telemetry in C# function.
If the function keeps up with the volume of messages, this metric should stay at a low steady state. Some latency is
unavoidable, so the value will never be zero. But if the function falls behind, the delta between enqueued time and
processing time will start to go up.

Test 1: Baseline
The first load test showed an immediate problem: The Function app consistently received HTTP 429 errors from
Cosmos DB, indicating that Cosmos DB was throttling the write requests.

In response, the team scaled Cosmos DB by increasing the number RUs allocated for the collection, but the errors
continued. This seemed strange, because their back-of-envelope calculation showed that Cosmos DB should have
no problem keeping up with the volume of write requests.
Later that day, one of the developers sent the following email to the team:

I looked at Cosmos DB for the warm path. There's one thing I don't understand. The partition key is deliveryId,
however we don't send deliveryId to Cosmos DB. Am I missing something?

That was the clue. Looking at the partition heat map, it turned out that all of the documents were landing on the
same partition.
What you want to see in the heat map is an even distribution across all of the partitions. In this case, because every
document was getting written to the same partition, adding RUs didn't help. The problem turned out to be a bug in
the code. Although the Cosmos DB collection had a partition key, the Azure Function didn't actually include the
partition key in the document. For more information about the partition heat map, see Determine the throughput
distribution across partitions.

Test 2: Fix partitioning issue


When the team deployed a code fix and re-ran the test, Cosmos DB stopped throttling. For a while, everything
looked good. But at a certain load, telemetry showed that the function was writing fewer documents that it should.
The following graph shows messages receives from IoT Hub versus documents written to Cosmos DB. The yellow
line is number of messages received per batch, and the green is the number of documents written per batch. These
should be proportional. Instead, the number of database write operations per batch drops significantly at about
07:30.
The next graph shows the latency between when a message arrives at IoT Hub from a device, and when the
function app processes that message. You can see that at the same point in time, the lateness spikes dramatically,
levels off, and the declines.

The reason the value peaks at 5 minutes and then drops to zero is because the function app discards messages
that are more than 5 minutes late:

foreach (var message in messages)


{
// Drop stale messages,
if (message.EnqueuedTimeUtc < cutoffTime)
{
log.Info($"Dropping late message batch. Enqueued time = {message.EnqueuedTimeUtc}, Cutoff =
{cutoffTime}");
droppedMessages++;
continue;
}

You can see this in the graph when the lateness metric drops back to zero. In the meantime, data has been lost,
because the function was throwing away messages.
What was happening? For this particular load test, the Cosmos DB collection had RUs to spare, so the bottleneck
was not at the database. Rather, the problem was in the message processing loop. Simply put, the function was not
writing documents quickly enough to keep up with the incoming volume of messages. Over time, it fell further and
further behind.
Test 3: Parallel writes
If the time to process a message is the bottleneck, one solution is to process more messages in parallel. In this
scenario:
Increase the number of IoT Hub partitions. Each IoT Hub partition gets assigned one function instance at a time,
so we would expect throughput to scale linearly with the number of partitions.
Parallelize the document writes within the function.
To explore the second option, the team modified the function to support parallel writes. The original version of the
function used the Cosmos DB output binding. The optimized version calls the Cosmos DB client directly and
performs the writes in parallel using Task.WhenAll:

private async Task<(long documentsUpserted,


long droppedMessages,
long cosmosDbTotalMilliseconds)>
ProcessMessagesFromEventHub(
int taskCount,
int numberOfDocumentsToUpsertPerTask,
EventData[] messages,
TraceWriter log)
{
DateTimeOffset cutoffTime = DateTimeOffset.UtcNow.AddMinutes(-5);

var tasks = new List<Task>();

for (var i = 0; i < taskCount; i++)


{
var docsToUpsert = messages
.Skip(i * numberOfDocumentsToUpsertPerTask)
.Take(numberOfDocumentsToUpsertPerTask);
// client will attempt to create connections to the data
// nodes on Cosmos db's clusters on a range of port numbers
tasks.Add(UpsertDocuments(i, docsToUpsert, cutoffTime, log));
}

await Task.WhenAll(tasks);

return (this.UpsertedDocuments,
this.DroppedMessages,
this.CosmosDbTotalMilliseconds);
}

Note that race conditions are possible with approach. Suppose that two messages from the same drone happen to
arrive in the same batch of messages. By writing them in parallel, the earlier message could overwrite the later
message. For this particular scenario, the application can tolerate losing an occasional message. Drones send new
position data every 5 seconds, so the data in Cosmos DB is updated continually. In other scenarios, however, it may
be important to process messages strictly in order.
After deploying this code change, the application was able to ingest more than 2500 requests/sec, using an IoT
Hub with 32 partitions.

Client-side Consideration
Overall client experience may be diminished by aggressive parallelization on server side. Consider leveraging
Azure Cosmos DB bulk executor library (not shown in this implementation) which significantly reduces the client-
side compute resources needed to saturate the throughput allocated to a Cosmos DB container. A single threaded
application that writes data using the bulk import API achieves nearly ten times greater write throughput when
compared to a multi-threaded application that writes data in parallel while saturating the client machine's CPU.
Summary
For this scenario, the following bottlenecks were identified:
Hot write partition, due to a missing partition key value in the documents being written.
Writing documents in serial per IoT Hub partition.
To diagnose these issues, the development team relied on the following metrics:
Throttled requests in Cosmos DB.
Partition heat map — Maximum consumed RUs per partition.
Messages received versus documents created.
Message lateness.

Next steps
Review performance antipatterns
Performance antipatterns for cloud applications
11/2/2020 • 2 minutes to read • Edit Online

A performance antipattern is a common practice that is likely to cause scalability problems when an application is
under pressure.
Here is a common scenario: An application behaves well during performance testing. It's released to production,
and begins to handle real workloads. At that point, it starts to perform poorly—rejecting user requests, stalling, or
throwing exceptions. The development team is then faced with two questions:
Why didn't this behavior show up during testing?
How do we fix it?
The answer to the first question is straightforward. It's difficult to simulate real users in a test environment, along
with their behavior patterns and the volumes of work they might perform. The only completely sure way to
understand how a system behaves under load is to observe it in production. To be clear, we aren't suggesting that
you should skip performance testing. Performance tests are crucial for getting baseline performance metrics. But
you must be prepared to observe and correct performance issues when they arise in the live system.
The answer to the second question, how to fix the problem, is less straightforward. Any number of factors might
contribute, and sometimes the problem only manifests under certain circumstances. Instrumentation and logging
are key to finding the root cause, but you also have to know what to look for.
Based on our engagements with Microsoft Azure customers, we've identified some of the most common
performance issues that customers see in production. For each antipattern, we describe why the antipattern
typically occurs, symptoms of the antipattern, and techniques for resolving the problem. We also provide sample
code that illustrates both the antipattern and a suggested solution.
Some of these antipatterns may seem obvious when you read the descriptions, but they occur more often than
you might think. Sometimes an application inherits a design that worked on-premises, but doesn't scale in the
cloud. Or an application might start with a very clean design, but as new features are added, one or more of these
antipatterns creeps in. Regardless, this guide will help you to identify and fix these antipatterns.

Catalog of antipatterns
Here is the list of the antipatterns that we've identified:

A N T IPAT T ERN DESC RIP T IO N

Busy Database Offloading too much processing to a data store.

Busy Front End Moving resource-intensive tasks onto background threads.

Chatty I/O Continually sending many small network requests.

Extraneous Fetching Retrieving more data than is needed, resulting in unnecessary


I/O.

Improper Instantiation Repeatedly creating and destroying objects that are designed
to be shared and reused.
A N T IPAT T ERN DESC RIP T IO N

Monolithic Persistence Using the same data store for data with very different usage
patterns.

No Caching Failing to cache data.

Synchronous I/O Blocking the calling thread while I/O completes.

Next steps
For more about performance tuning, see Performance tuning a distributed application
Busy Database antipattern
11/2/2020 • 7 minutes to read • Edit Online

Offloading processing to a database server can cause it to spend a significant proportion of time running code,
rather than responding to requests to store and retrieve data.

Problem description
Many database systems can run code. Examples include stored procedures and triggers. Often, it's more efficient to
perform this processing close to the data, rather than transmitting the data to a client application for processing.
However, overusing these features can hurt performance, for several reasons:
The database server may spend too much time processing, rather than accepting new client requests and
fetching data.
A database is usually a shared resource, so it can become a bottleneck during periods of high use.
Runtime costs may be excessive if the data store is metered. That's particularly true of managed database
services. For example, Azure SQL Database charges for Database Transaction Units (DTUs).
Databases have finite capacity to scale up, and it's not trivial to scale a database horizontally. Therefore, it may
be better to move processing into a compute resource, such as a VM or App Service app, that can easily scale
out.
This antipattern typically occurs because:
The database is viewed as a service rather than a repository. An application might use the database server to
format data (for example, converting to XML), manipulate string data, or perform complex calculations.
Developers try to write queries whose results can be displayed directly to users. For example, a query might
combine fields or format dates, times, and currency according to locale.
Developers are trying to correct the Extraneous Fetching antipattern by pushing computations to the database.
Stored procedures are used to encapsulate business logic, perhaps because they are considered easier to
maintain and update.
The following example retrieves the 20 most valuable orders for a specified sales territory and formats the results
as XML. It uses Transact-SQL functions to parse the data and convert the results to XML. You can find the complete
sample here.
SELECT TOP 20
soh.[SalesOrderNumber] AS '@OrderNumber',
soh.[Status] AS '@Status',
soh.[ShipDate] AS '@ShipDate',
YEAR(soh.[OrderDate]) AS '@OrderDateYear',
MONTH(soh.[OrderDate]) AS '@OrderDateMonth',
soh.[DueDate] AS '@DueDate',
FORMAT(ROUND(soh.[SubTotal],2),'C')
AS '@SubTotal',
FORMAT(ROUND(soh.[TaxAmt],2),'C')
AS '@TaxAmt',
FORMAT(ROUND(soh.[TotalDue],2),'C')
AS '@TotalDue',
CASE WHEN soh.[TotalDue] > 5000 THEN 'Y' ELSE 'N' END
AS '@ReviewRequired',
(
SELECT
c.[AccountNumber] AS '@AccountNumber',
UPPER(LTRIM(RTRIM(REPLACE(
CONCAT( p.[Title], ' ', p.[FirstName], ' ', p.[MiddleName], ' ', p.[LastName], ' ', p.[Suffix]),
' ', ' ')))) AS '@FullName'
FROM [Sales].[Customer] c
INNER JOIN [Person].[Person] p
ON c.[PersonID] = p.[BusinessEntityID]
WHERE c.[CustomerID] = soh.[CustomerID]
FOR XML PATH ('Customer'), TYPE
),

(
SELECT
sod.[OrderQty] AS '@Quantity',
FORMAT(sod.[UnitPrice],'C')
AS '@UnitPrice',
FORMAT(ROUND(sod.[LineTotal],2),'C')
AS '@LineTotal',
sod.[ProductID] AS '@ProductId',
CASE WHEN (sod.[ProductID] >= 710) AND (sod.[ProductID] <= 720) AND (sod.[OrderQty] >= 5) THEN 'Y' ELSE
'N' END
AS '@InventoryCheckRequired'

FROM [Sales].[SalesOrderDetail] sod


WHERE sod.[SalesOrderID] = soh.[SalesOrderID]
ORDER BY sod.[SalesOrderDetailID]
FOR XML PATH ('LineItem'), TYPE, ROOT('OrderLineItems')
)

FROM [Sales].[SalesOrderHeader] soh


WHERE soh.[TerritoryId] = @TerritoryId
ORDER BY soh.[TotalDue] DESC
FOR XML PATH ('Order'), ROOT('Orders')

Clearly, this is complex query. As we'll see later, it turns out to use significant processing resources on the database
server.

How to fix the problem


Move processing from the database server into other application tiers. Ideally, you should limit the database to
performing data access operations, using only the capabilities that the database is optimized for, such as
aggregation in an RDBMS.
For example, the previous Transact-SQL code can be replaced with a statement that simply retrieves the data to be
processed.
SELECT
soh.[SalesOrderNumber] AS [OrderNumber],
soh.[Status] AS [Status],
soh.[OrderDate] AS [OrderDate],
soh.[DueDate] AS [DueDate],
soh.[ShipDate] AS [ShipDate],
soh.[SubTotal] AS [SubTotal],
soh.[TaxAmt] AS [TaxAmt],
soh.[TotalDue] AS [TotalDue],
c.[AccountNumber] AS [AccountNumber],
p.[Title] AS [CustomerTitle],
p.[FirstName] AS [CustomerFirstName],
p.[MiddleName] AS [CustomerMiddleName],
p.[LastName] AS [CustomerLastName],
p.[Suffix] AS [CustomerSuffix],
sod.[OrderQty] AS [Quantity],
sod.[UnitPrice] AS [UnitPrice],
sod.[LineTotal] AS [LineTotal],
sod.[ProductID] AS [ProductId]
FROM [Sales].[SalesOrderHeader] soh
INNER JOIN [Sales].[Customer] c ON soh.[CustomerID] = c.[CustomerID]
INNER JOIN [Person].[Person] p ON c.[PersonID] = p.[BusinessEntityID]
INNER JOIN [Sales].[SalesOrderDetail] sod ON soh.[SalesOrderID] = sod.[SalesOrderID]
WHERE soh.[TerritoryId] = @TerritoryId
AND soh.[SalesOrderId] IN (
SELECT TOP 20 SalesOrderId
FROM [Sales].[SalesOrderHeader] soh
WHERE soh.[TerritoryId] = @TerritoryId
ORDER BY soh.[TotalDue] DESC)
ORDER BY soh.[TotalDue] DESC, sod.[SalesOrderDetailID]

The application then uses the .NET Framework System.Xml.Linq APIs to format the results as XML.

// Create a new SqlCommand to run the Transact-SQL query


using (var command = new SqlCommand(...))
{
command.Parameters.AddWithValue("@TerritoryId", id);

// Run the query and create the initial XML document


using (var reader = await command.ExecuteReaderAsync())
{
var lastOrderNumber = string.Empty;
var doc = new XDocument();
var orders = new XElement("Orders");
doc.Add(orders);

XElement lineItems = null;


// Fetch each row in turn, format the results as XML, and add them to the XML document
while (await reader.ReadAsync())
{
var orderNumber = reader["OrderNumber"].ToString();
if (orderNumber != lastOrderNumber)
{
lastOrderNumber = orderNumber;

var order = new XElement("Order");


orders.Add(order);
var customer = new XElement("Customer");
lineItems = new XElement("OrderLineItems");
order.Add(customer, lineItems);

var orderDate = (DateTime)reader["OrderDate"];


var totalDue = (Decimal)reader["TotalDue"];
var reviewRequired = totalDue > 5000 ? 'Y' : 'N';

order.Add(
new XAttribute("OrderNumber", orderNumber),
new XAttribute("Status", reader["Status"]),
new XAttribute("ShipDate", reader["ShipDate"]),
... // More attributes, not shown.

var fullName = string.Join(" ",


reader["CustomerTitle"],
reader["CustomerFirstName"],
reader["CustomerMiddleName"],
reader["CustomerLastName"],
reader["CustomerSuffix"]
)
.Replace(" ", " ") //remove double spaces
.Trim()
.ToUpper();

customer.Add(
new XAttribute("AccountNumber", reader["AccountNumber"]),
new XAttribute("FullName", fullName));
}

var productId = (int)reader["ProductID"];


var quantity = (short)reader["Quantity"];
var inventoryCheckRequired = (productId >= 710 && productId <= 720 && quantity >= 5) ? 'Y' : 'N';

lineItems.Add(
new XElement("LineItem",
new XAttribute("Quantity", quantity),
new XAttribute("UnitPrice", ((Decimal)reader["UnitPrice"]).ToString("C")),
new XAttribute("LineTotal", RoundAndFormat(reader["LineTotal"])),
new XAttribute("ProductId", productId),
new XAttribute("InventoryCheckRequired", inventoryCheckRequired)
));
}
// Match the exact formatting of the XML returned from SQL
var xml = doc
.ToString(SaveOptions.DisableFormatting)
.Replace(" />", "/>");
}
}

NOTE
This code is somewhat complex. For a new application, you might prefer to use a serialization library. However, the
assumption here is that the development team is refactoring an existing application, so the method needs to return the
exact same format as the original code.

Considerations
Many database systems are highly optimized to perform certain types of data processing, such as
calculating aggregate values over large datasets. Don't move those types of processing out of the database.
Do not relocate processing if doing so causes the database to transfer far more data over the network. See
the Extraneous Fetching antipattern.
If you move processing to an application tier, that tier may need to scale out to handle the additional work.

How to detect the problem


Symptoms of a busy database include a disproportionate decline in throughput and response times in operations
that access the database.
You can perform the following steps to help identify this problem:
1. Use performance monitoring to identify how much time the production system spends performing
database activity.
2. Examine the work performed by the database during these periods.
3. If you suspect that particular operations might cause too much database activity, perform load testing in a
controlled environment. Each test should run a mixture of the suspect operations with a variable user load.
Examine the telemetry from the load tests to observe how the database is used.
4. If the database activity reveals significant processing but little data traffic, review the source code to
determine whether the processing can better be performed elsewhere.
If the volume of database activity is low or response times are relatively fast, then a busy database is unlikely to be
a performance problem.

Example diagnosis
The following sections apply these steps to the sample application described earlier.
Monitor the volume of database activity
The following graph shows the results of running a load test against the sample application, using a step load of up
to 50 concurrent users. The volume of requests quickly reaches a limit and stays at that level, while the average
response time steadily increases. A logarithmic scale is used for those two metrics.

This line graph shows user load, requests per second, and average response time. The graph shows that response
time increases as load increases.
The next graph shows CPU utilization and DTUs as a percentage of service quota. DTUs provide a measure of how
much processing the database performs. The graph shows that CPU and DTU utilization both quickly reached
100%.
This line graph shows CPU percentage and DTU percentage over time. The graph shows that both quickly reach
100%.
Examine the work performed by the database
It could be that the tasks performed by the database are genuine data access operations, rather than processing, so
it is important to understand the SQL statements being run while the database is busy. Monitor the system to
capture the SQL traffic and correlate the SQL operations with application requests.
If the database operations are purely data access operations, without a lot of processing, then the problem might
be Extraneous Fetching.
Implement the solution and verify the result
The following graph shows a load test using the updated code. Throughput is significantly higher, over 400
requests per second versus 12 earlier. The average response time is also much lower, just above 0.1 seconds
compared to over 4 seconds.
This line graph shows user load, requests per second, and average response time. The graph shows that response
time remains roughly constant throughout the load test.
CPU and DTU utilization shows that the system took longer to reach saturation, despite the increased throughput.

This line graph shows CPU percentage and DTU percentage over time. The graph shows that CPU and DTU take
longer to reach 100% than previously.

Related resources
Extraneous Fetching antipattern
Busy Front End antipattern
11/2/2020 • 8 minutes to read • Edit Online

Performing asynchronous work on a large number of background threads can starve other concurrent foreground
tasks of resources, decreasing response times to unacceptable levels.

Problem description
Resource-intensive tasks can increase the response times for user requests and cause high latency. One way to
improve response times is to offload a resource-intensive task to a separate thread. This approach lets the
application stay responsive while processing happens in the background. However, tasks that run on a background
thread still consume resources. If there are too many of them, they can starve the threads that are handling
requests.

NOTE
The term resource can encompass many things, such as CPU utilization, memory occupancy, and network or disk I/O.

This problem typically occurs when an application is developed as monolithic piece of code, with all of the business
logic combined into a single tier shared with the presentation layer.
Here's an example using ASP.NET that demonstrates the problem. You can find the complete sample here.

public class WorkInFrontEndController : ApiController


{
[HttpPost]
[Route("api/workinfrontend")]
public HttpResponseMessage Post()
{
new Thread(() =>
{
//Simulate processing
Thread.SpinWait(Int32.MaxValue / 100);
}).Start();

return Request.CreateResponse(HttpStatusCode.Accepted);
}
}

public class UserProfileController : ApiController


{
[HttpGet]
[Route("api/userprofile/{id}")]
public UserProfile Get(int id)
{
//Simulate processing
return new UserProfile() { FirstName = "Alton", LastName = "Hudgens" };
}
}

The Post method in the WorkInFrontEnd controller implements an HTTP POST operation. This operation
simulates a long-running, CPU-intensive task. The work is performed on a separate thread, in an attempt to
enable the POST operation to complete quickly.
The Get method in the UserProfile controller implements an HTTP GET operation. This method is much
less CPU intensive.
The primary concern is the resource requirements of the Post method. Although it puts the work onto a
background thread, the work can still consume considerable CPU resources. These resources are shared with other
operations being performed by other concurrent users. If a moderate number of users send this request at the
same time, overall performance is likely to suffer, slowing down all operations. Users might experience significant
latency in the Get method, for example.

How to fix the problem


Move processes that consume significant resources to a separate back end.
With this approach, the front end puts resource-intensive tasks onto a message queue. The back end picks up the
tasks for asynchronous processing. The queue also acts as a load leveler, buffering requests for the back end. If the
queue length becomes too long, you can configure autoscaling to scale out the back end.
Here is a revised version of the previous code. In this version, the Post method puts a message on a Service Bus
queue.

public class WorkInBackgroundController : ApiController


{
private static readonly QueueClient QueueClient;
private static readonly string QueueName;
private static readonly ServiceBusQueueHandler ServiceBusQueueHandler;

public WorkInBackgroundController()
{
var serviceBusConnectionString = ...;
QueueName = ...;
ServiceBusQueueHandler = new ServiceBusQueueHandler(serviceBusConnectionString);
QueueClient = ServiceBusQueueHandler.GetQueueClientAsync(QueueName).Result;
}

[HttpPost]
[Route("api/workinbackground")]
public async Task<long> Post()
{
return await ServiceBusQueueHandler.AddWorkLoadToQueueAsync(QueueClient, QueueName, 0);
}
}

The back end pulls messages from the Service Bus queue and does the processing.
public async Task RunAsync(CancellationToken cancellationToken)
{
this._queueClient.OnMessageAsync(
// This lambda is invoked for each message received.
async (receivedMessage) =>
{
try
{
// Simulate processing of message
Thread.SpinWait(Int32.MaxValue / 1000);

await receivedMessage.CompleteAsync();
}
catch
{
receivedMessage.Abandon();
}
});
}

Considerations
This approach adds some additional complexity to the application. You must handle queuing and dequeuing
safely to avoid losing requests in the event of a failure.
The application takes a dependency on an additional service for the message queue.
The processing environment must be sufficiently scalable to handle the expected workload and meet the
required throughput targets.
While this approach should improve overall responsiveness, the tasks that are moved to the back end may take
longer to complete.

How to detect the problem


Symptoms of a busy front end include high latency when resource-intensive tasks are being performed. End users
are likely to report extended response times or failures caused by services timing out. These failures could also
return HTTP 500 (Internal Server) errors or HTTP 503 (Service Unavailable) errors. Examine the event logs for the
web server, which are likely to contain more detailed information about the causes and circumstances of the errors.
You can perform the following steps to help identify this problem:
1. Perform process monitoring of the production system, to identify points when response times slow down.
2. Examine the telemetry data captured at these points to determine the mix of operations being performed and
the resources being used.
3. Find any correlations between poor response times and the volumes and combinations of operations that were
happening at those times.
4. Load test each suspected operation to identify which operations are consuming resources and starving other
operations.
5. Review the source code for those operations to determine why they might cause excessive resource
consumption.

Example diagnosis
The following sections apply these steps to the sample application described earlier.
Identify points of slowdown
Instrument each method to track the duration and resources consumed by each request. Then monitor the
application in production. This can provide an overall view of how requests compete with each other. During
periods of stress, slow-running resource-hungry requests will likely affect other operations, and this behavior can
be observed by monitoring the system and noting the drop off in performance.
The following image shows a monitoring dashboard. (We used AppDynamics for our tests.) Initially, the system has
light load. Then users start requesting the UserProfile GET method. The performance is reasonably good until
other users start issuing requests to the WorkInFrontEnd POST method. At that point, response times increase
dramatically (first arrow). Response times only improve after the volume of requests to the WorkInFrontEnd
controller diminishes (second arrow).

Examine telemetry data and find correlations


The next image shows some of the metrics gathered to monitor resource utilization during the same interval. At
first, few users are accessing the system. As more users connect, CPU utilization becomes very high (100%). Also
notice that the network I/O rate initially goes up as CPU usage rises. But once CPU usage peaks, network I/O
actually goes down. That's because the system can only handle a relatively small number of requests once the CPU
is at capacity. As users disconnect, the CPU load tails off.

At this point, it appears the Post method in the WorkInFrontEnd controller is a prime candidate for closer
examination. Further work in a controlled environment is needed to confirm the hypothesis.
Perform load testing
The next step is to perform tests in a controlled environment. For example, run a series of load tests that include
and then omit each request in turn to see the effects.
The graph below shows the results of a load test performed against an identical deployment of the cloud service
used in the previous tests. The test used a constant load of 500 users performing the Get operation in the
UserProfile controller, along with a step load of users performing the Post operation in the WorkInFrontEnd
controller.

Initially, the step load is 0, so the only active users are performing the UserProfile requests. The system is able to
respond to approximately 500 requests per second. After 60 seconds, a load of 100 additional users starts sending
POST requests to the WorkInFrontEnd controller. Almost immediately, the workload sent to the UserProfile
controller drops to about 150 requests per second. This is due to the way the load-test runner functions. It waits for
a response before sending the next request, so the longer it takes to receive a response, the lower the request rate.
As more users send POST requests to the WorkInFrontEnd controller, the response rate of the UserProfile
controller continues to drop. But note that the volume of requests handled by the WorkInFrontEnd controller
remains relatively constant. The saturation of the system becomes apparent as the overall rate of both requests
tends toward a steady but low limit.
Review the source code
The final step is to look at the source code. The development team was aware that the Post method could take a
considerable amount of time, which is why the original implementation used a separate thread. That solved the
immediate problem, because the Post method did not block waiting for a long-running task to complete.
However, the work performed by this method still consumes CPU, memory, and other resources. Enabling this
process to run asynchronously might actually damage performance, as users can trigger a large number of these
operations simultaneously, in an uncontrolled manner. There is a limit to the number of threads that a server can
run. Past this limit, the application is likely to get an exception when it tries to start a new thread.
NOTE
This doesn't mean you should avoid asynchronous operations. Performing an asynchronous await on a network call is a
recommended practice. (See the Synchronous I/O antipattern.) The problem here is that CPU-intensive work was spawned on
another thread.

Implement the solution and verify the result


The following image shows performance monitoring after the solution was implemented. The load was similar to
that shown earlier, but the response times for the UserProfile controller are now much faster. The volume of
requests increased over the same duration, from 2,759 to 23,565.

Note that the WorkInBackground controller also handled a much larger volume of requests. However, you can't
make a direct comparison in this case, because the work being performed in this controller is very different from
the original code. The new version simply queues a request, rather than performing a time consuming calculation.
The main point is that this method no longer drags down the entire system under load.
CPU and network utilization also show the improved performance. The CPU utilization never reached 100%, and
the volume of handled network requests was far greater than earlier, and did not tail off until the workload
dropped.

The following graph shows the results of a load test. The overall volume of requests serviced is greatly improved
compared to the earlier tests.
Related guidance
Autoscaling best practices
Background jobs best practices
Queue-Based Load Leveling pattern
Web Queue Worker architecture style
Chatty I/O antipattern
11/13/2020 • 9 minutes to read • Edit Online

The cumulative effect of a large number of I/O requests can have a significant impact on performance and
responsiveness.

Problem description
Network calls and other I/O operations are inherently slow compared to compute tasks. Each I/O request typically
has significant overhead, and the cumulative effect of numerous I/O operations can slow down the system. Here
are some common causes of chatty I/O.
Reading and writing individual records to a database as distinct requests
The following example reads from a database of products. There are three tables, Product , ProductSubcategory ,
and ProductPriceListHistory . The code retrieves all of the products in a subcategory, along with the pricing
information, by executing a series of queries:
1. Query the subcategory from the ProductSubcategory table.
2. Find all products in that subcategory by querying the Product table.
3. For each product, query the pricing data from the ProductPriceListHistory table.
The application uses Entity Framework to query the database. You can find the complete sample here.

public async Task<IHttpActionResult> GetProductsInSubCategoryAsync(int subcategoryId)


{
using (var context = GetContext())
{
// Get product subcategory.
var productSubcategory = await context.ProductSubcategories
.Where(psc => psc.ProductSubcategoryId == subcategoryId)
.FirstOrDefaultAsync();

// Find products in that category.


productSubcategory.Product = await context.Products
.Where(p => subcategoryId == p.ProductSubcategoryId)
.ToListAsync();

// Find price history for each product.


foreach (var prod in productSubcategory.Product)
{
int productId = prod.ProductId;
var productListPriceHistory = await context.ProductListPriceHistory
.Where(pl => pl.ProductId == productId)
.ToListAsync();
prod.ProductListPriceHistory = productListPriceHistory;
}
return Ok(productSubcategory);
}
}

This example shows the problem explicitly, but sometimes an O/RM can mask the problem, if it implicitly fetches
child records one at a time. This is known as the "N+1 problem".
Implementing a single logical operation as a series of HTTP requests
This often happens when developers try to follow an object-oriented paradigm, and treat remote objects as if they
were local objects in memory. This can result in too many network round trips. For example, the following web API
exposes the individual properties of User objects through individual HTTP GET methods.

public class UserController : ApiController


{
[HttpGet]
[Route("users/{id:int}/username")]
public HttpResponseMessage GetUserName(int id)
{
...
}

[HttpGet]
[Route("users/{id:int}/gender")]
public HttpResponseMessage GetGender(int id)
{
...
}

[HttpGet]
[Route("users/{id:int}/dateofbirth")]
public HttpResponseMessage GetDateOfBirth(int id)
{
...
}
}

While there's nothing technically wrong with this approach, most clients will probably need to get several
properties for each User , resulting in client code like the following.

HttpResponseMessage response = await client.GetAsync("users/1/username");


response.EnsureSuccessStatusCode();
var userName = await response.Content.ReadAsStringAsync();

response = await client.GetAsync("users/1/gender");


response.EnsureSuccessStatusCode();
var gender = await response.Content.ReadAsStringAsync();

response = await client.GetAsync("users/1/dateofbirth");


response.EnsureSuccessStatusCode();
var dob = await response.Content.ReadAsStringAsync();

Reading and writing to a file on disk


File I/O involves opening a file and moving to the appropriate point before reading or writing data. When the
operation is complete, the file might be closed to save operating system resources. An application that continually
reads and writes small amounts of information to a file will generate significant I/O overhead. Small write requests
can also lead to file fragmentation, slowing subsequent I/O operations still further.
The following example uses a FileStream to write a Customer object to a file. Creating the FileStream opens the
file, and disposing it closes the file. (The using statement automatically disposes the FileStream object.) If the
application calls this method repeatedly as new customers are added, the I/O overhead can accumulate quickly.
private async Task SaveCustomerToFileAsync(Customer customer)
{
using (Stream fileStream = new FileStream(CustomersFileName, FileMode.Append))
{
BinaryFormatter formatter = new BinaryFormatter();
byte [] data = null;
using (MemoryStream memStream = new MemoryStream())
{
formatter.Serialize(memStream, customer);
data = memStream.ToArray();
}
await fileStream.WriteAsync(data, 0, data.Length);
}
}

How to fix the problem


Reduce the number of I/O requests by packaging the data into larger, fewer requests.
Fetch data from a database as a single query, instead of several smaller queries. Here's a revised version of the
code that retrieves product information.

public async Task<IHttpActionResult> GetProductCategoryDetailsAsync(int subCategoryId)


{
using (var context = GetContext())
{
var subCategory = await context.ProductSubcategories
.Where(psc => psc.ProductSubcategoryId == subCategoryId)
.Include("Product.ProductListPriceHistory")
.FirstOrDefaultAsync();

if (subCategory == null)
return NotFound();

return Ok(subCategory);
}
}

Follow REST design principles for web APIs. Here's a revised version of the web API from the earlier example.
Instead of separate GET methods for each property, there is a single GET method that returns the User . This
results in a larger response body per request, but each client is likely to make fewer API calls.

public class UserController : ApiController


{
[HttpGet]
[Route("users/{id:int}")]
public HttpResponseMessage GetUser(int id)
{
...
}
}

// Client code
HttpResponseMessage response = await client.GetAsync("users/1");
response.EnsureSuccessStatusCode();
var user = await response.Content.ReadAsStringAsync();

For file I/O, consider buffering data in memory and then writing the buffered data to a file as a single operation.
This approach reduces the overhead from repeatedly opening and closing the file, and helps to reduce
fragmentation of the file on disk.
// Save a list of customer objects to a file
private async Task SaveCustomerListToFileAsync(List<Customer> customers)
{
using (Stream fileStream = new FileStream(CustomersFileName, FileMode.Append))
{
BinaryFormatter formatter = new BinaryFormatter();
foreach (var customer in customers)
{
byte[] data = null;
using (MemoryStream memStream = new MemoryStream())
{
formatter.Serialize(memStream, customer);
data = memStream.ToArray();
}
await fileStream.WriteAsync(data, 0, data.Length);
}
}
}

// In-memory buffer for customers.


List<Customer> customers = new List<Customers>();

// Create a new customer and add it to the buffer


var customer = new Customer(...);
customers.Add(customer);

// Add more customers to the list as they are created


...

// Save the contents of the list, writing all customers in a single operation
await SaveCustomerListToFileAsync(customers);

Considerations
The first two examples make fewer I/O calls, but each one retrieves more information. You must consider
the tradeoff between these two factors. The right answer will depend on the actual usage patterns. For
example, in the web API example, it might turn out that clients often need just the user name. In that case, it
might make sense to expose it as a separate API call. For more information, see the Extraneous Fetching
antipattern.
When reading data, do not make your I/O requests too large. An application should only retrieve the
information that it is likely to use.
Sometimes it helps to partition the information for an object into two chunks, frequently accessed data that
accounts for most requests, and less frequently accessed data that is used rarely. Often the most frequently
accessed data is a relatively small portion of the total data for an object, so returning just that portion can
save significant I/O overhead.
When writing data, avoid locking resources for longer than necessary, to reduce the chances of contention
during a lengthy operation. If a write operation spans multiple data stores, files, or services, then adopt an
eventually consistent approach. See Data Consistency guidance.
If you buffer data in memory before writing it, the data is vulnerable if the process crashes. If the data rate
typically has bursts or is relatively sparse, it may be safer to buffer the data in an external durable queue
such as Event Hubs.
Consider caching data that you retrieve from a service or a database. This can help to reduce the volume of
I/O by avoiding repeated requests for the same data. For more information, see Caching best practices.

How to detect the problem


Symptoms of chatty I/O include high latency and low throughput. End users are likely to report extended response
times or failures caused by services timing out, due to increased contention for I/O resources.
You can perform the following steps to help identify the causes of any problems:
1. Perform process monitoring of the production system to identify operations with poor response times.
2. Perform load testing of each operation identified in the previous step.
3. During the load tests, gather telemetry data about the data access requests made by each operation.
4. Gather detailed statistics for each request sent to a data store.
5. Profile the application in the test environment to establish where possible I/O bottlenecks might be occurring.
Look for any of these symptoms:
A large number of small I/O requests made to the same file.
A large number of small network requests made by an application instance to the same service.
A large number of small requests made by an application instance to the same data store.
Applications and services becoming I/O bound.

Example diagnosis
The following sections apply these steps to the example shown earlier that queries a database.
Load test the application
This graph shows the results of load testing. Median response time is measured in tens of seconds per request.
The graph shows very high latency. With a load of 1000 users, a user might have to wait for nearly a minute to see
the results of a query.

NOTE
The application was deployed as an Azure App Service web app, using Azure SQL Database. The load test used a simulated
step workload of up to 1000 concurrent users. The database was configured with a connection pool supporting up to 1000
concurrent connections, to reduce the chance that contention for connections would affect the results.

Monitor the application


You can use an application performance monitoring (APM) package to capture and analyze the key metrics that
might identify chatty I/O. Which metrics are important will depend on the I/O workload. For this example, the
interesting I/O requests were the database queries.
The following image shows results generated using New Relic APM. The average database response time peaked
at approximately 5.6 seconds per request during the maximum workload. The system was able to support an
average of 410 requests per minute throughout the test.

Gather detailed data access information


Digging deeper into the monitoring data shows the application executes three different SQL SELECT statements.
These correspond to the requests generated by Entity Framework to fetch data from the ProductListPriceHistory ,
Product , and ProductSubcategory tables. Furthermore, the query that retrieves data from the
ProductListPriceHistory table is by far the most frequently executed SELECT statement, by an order of magnitude.
It turns out that the GetProductsInSubCategoryAsync method, shown earlier, performs 45 SELECT queries. Each
query causes the application to open a new SQL connection.
NOTE
This image shows trace information for the slowest instance of the GetProductsInSubCategoryAsync operation in the load
test. In a production environment, it's useful to examine traces of the slowest instances, to see if there is a pattern that
suggests a problem. If you just look at the average values, you might overlook problems that will get dramatically worse
under load.

The next image shows the actual SQL statements that were issued. The query that fetches price information is run
for each individual product in the product subcategory. Using a join would considerably reduce the number of
database calls.

If you are using an O/RM, such as Entity Framework, tracing the SQL queries can provide insight into how the
O/RM translates programmatic calls into SQL statements, and indicate areas where data access might be
optimized.
Implement the solution and verify the result
Rewriting the call to Entity Framework produced the following results.
This load test was performed on the same deployment, using the same load profile. This time the graph shows
much lower latency. The average request time at 1000 users is between 5 and 6 seconds, down from nearly a
minute.
This time the system supported an average of 3,970 requests per minute, compared to 410 for the earlier test.

Tracing the SQL statement shows that all the data is fetched in a single SELECT statement. Although this query is
considerably more complex, it is performed only once per operation. And while complex joins can become
expensive, relational database systems are optimized for this type of query.
Related resources
API Design best practices
Caching best practices
Data Consistency Primer
Extraneous Fetching antipattern
No Caching antipattern
Extraneous Fetching antipattern
11/2/2020 • 10 minutes to read • Edit Online

Retrieving more data than needed for a business operation can result in unnecessary I/O overhead and reduce
responsiveness.

Problem description
This antipattern can occur if the application tries to minimize I/O requests by retrieving all of the data that it
might need. This is often a result of overcompensating for the Chatty I/O antipattern. For example, an application
might fetch the details for every product in a database. But the user may need just a subset of the details (some
may not be relevant to customers), and probably doesn't need to see all of the products at once. Even if the user
is browsing the entire catalog, it would make sense to paginate the results—showing 20 at a time, for example.
Another source of this problem is following poor programming or design practices. For example, the following
code uses Entity Framework to fetch the complete details for every product. Then it filters the results to return
only a subset of the fields, discarding the rest. You can find the complete sample here.

public async Task<IHttpActionResult> GetAllFieldsAsync()


{
using (var context = new AdventureWorksContext())
{
// Execute the query. This happens at the database.
var products = await context.Products.ToListAsync();

// Project fields from the query results. This happens in application memory.
var result = products.Select(p => new ProductInfo { Id = p.ProductId, Name = p.Name });
return Ok(result);
}
}

In the next example, the application retrieves data to perform an aggregation that could be done by the database
instead. The application calculates total sales by getting every record for all orders sold, and then computing the
sum over those records. You can find the complete sample here.

public async Task<IHttpActionResult> AggregateOnClientAsync()


{
using (var context = new AdventureWorksContext())
{
// Fetch all order totals from the database.
var orderAmounts = await context.SalesOrderHeaders.Select(soh => soh.TotalDue).ToListAsync();

// Sum the order totals in memory.


var total = orderAmounts.Sum();
return Ok(total);
}
}

The next example shows a subtle problem caused by the way Entity Framework uses LINQ to Entities.
var query = from p in context.Products.AsEnumerable()
where p.SellStartDate < DateTime.Now.AddDays(-7) // AddDays cannot be mapped by LINQ to Entities
select ...;

List<Product> products = query.ToList();

The application is trying to find products with a SellStartDate more than a week old. In most cases, LINQ to
Entities would translate a where clause to a SQL statement that is executed by the database. In this case, however,
LINQ to Entities cannot map the AddDays method to SQL. Instead, every row from the Product table is returned,
and the results are filtered in memory.
The call to AsEnumerable is a hint that there is a problem. This method converts the results to an IEnumerable
interface. Although IEnumerable supports filtering, the filtering is done on the client side, not the database. By
default, LINQ to Entities uses IQueryable , which passes the responsibility for filtering to the data source.

How to fix the problem


Avoid fetching large volumes of data that may quickly become outdated or might be discarded, and only fetch
the data needed for the operation being performed.
Instead of getting every column from a table and then filtering them, select the columns that you need from the
database.

public async Task<IHttpActionResult> GetRequiredFieldsAsync()


{
using (var context = new AdventureWorksContext())
{
// Project fields as part of the query itself
var result = await context.Products
.Select(p => new ProductInfo {Id = p.ProductId, Name = p.Name})
.ToListAsync();
return Ok(result);
}
}

Similarly, perform aggregation in the database and not in application memory.

public async Task<IHttpActionResult> AggregateOnDatabaseAsync()


{
using (var context = new AdventureWorksContext())
{
// Sum the order totals as part of the database query.
var total = await context.SalesOrderHeaders.SumAsync(soh => soh.TotalDue);
return Ok(total);
}
}

When using Entity Framework, ensure that LINQ queries are resolved using the IQueryable interface and not
IEnumerable . You may need to adjust the query to use only functions that can be mapped to the data source. The
earlier example can be refactored to remove the AddDays method from the query, allowing filtering to be done
by the database.
DateTime dateSince = DateTime.Now.AddDays(-7); // AddDays has been factored out.
var query = from p in context.Products
where p.SellStartDate < dateSince // This criterion can be passed to the database by LINQ to
Entities
select ...;

List<Product> products = query.ToList();

Considerations
In some cases, you can improve performance by partitioning data horizontally. If different operations
access different attributes of the data, horizontal partitioning may reduce contention. Often, most
operations are run against a small subset of the data, so spreading this load may improve performance.
See Data partitioning.
For operations that have to support unbounded queries, implement pagination and only fetch a limited
number of entities at a time. For example, if a customer is browsing a product catalog, you can show one
page of results at a time.
When possible, take advantage of features built into the data store. For example, SQL databases typically
provide aggregate functions.
If you're using a data store that doesn't support a particular function, such as aggregation, you could store
the calculated result elsewhere, updating the value as records are added or updated, so the application
doesn't have to recalculate the value each time it's needed.
If you see that requests are retrieving a large number of fields, examine the source code to determine
whether all of these fields are necessary. Sometimes these requests are the result of poorly designed
SELECT * query.

Similarly, requests that retrieve a large number of entities may be sign that the application is not filtering
data correctly. Verify that all of these entities are needed. Use database-side filtering if possible, for
example, by using WHERE clauses in SQL.
Offloading processing to the database is not always the best option. Only use this strategy when the
database is designed or optimized to do so. Most database systems are highly optimized for certain
functions, but are not designed to act as general-purpose application engines. For more information, see
the Busy Database antipattern.

How to detect the problem


Symptoms of extraneous fetching include high latency and low throughput. If the data is retrieved from a data
store, increased contention is also probable. End users are likely to report extended response times or failures
caused by services timing out. These failures could return HTTP 500 (Internal Server) errors or HTTP 503 (Service
Unavailable) errors. Examine the event logs for the web server, which likely contain more detailed information
about the causes and circumstances of the errors.
The symptoms of this antipattern and some of the telemetry obtained might be very similar to those of the
Monolithic Persistence antipattern.
You can perform the following steps to help identify the cause:
1. Identify slow workloads or transactions by performing load-testing, process monitoring, or other methods of
capturing instrumentation data.
2. Observe any behavioral patterns exhibited by the system. Are there particular limits in terms of transactions
per second or volume of users?
3. Correlate the instances of slow workloads with behavioral patterns.
4. Identify the data stores being used. For each data source, run lower-level telemetry to observe the behavior of
operations.
5. Identify any slow-running queries that reference these data sources.
6. Perform a resource-specific analysis of the slow-running queries and ascertain how the data is used and
consumed.
Look for any of these symptoms:
Frequent, large I/O requests made to the same resource or data store.
Contention in a shared resource or data store.
An operation that frequently receives large volumes of data over the network.
Applications and services spending significant time waiting for I/O to complete.

Example diagnosis
The following sections apply these steps to the previous examples.
Identify slow workloads
This graph shows performance results from a load test that simulated up to 400 concurrent users running the
GetAllFieldsAsync method shown earlier. Throughput diminishes slowly as the load increases. Average response
time goes up as the workload increases.

A load test for the AggregateOnClientAsync operation shows a similar pattern. The volume of requests is
reasonably stable. The average response time increases with the workload, although more slowly than the
previous graph.
Correlate slow workloads with behavioral patterns
Any correlation between regular periods of high usage and slowing performance can indicate areas of concern.
Closely examine the performance profile of functionality that is suspected to be slow running, to determine
whether it matches the load testing performed earlier.
Load test the same functionality using step-based user loads, to find the point where performance drops
significantly or fails completely. If that point falls within the bounds of your expected real-world usage, examine
how the functionality is implemented.
A slow operation is not necessarily a problem, if it is not being performed when the system is under stress, is not
time critical, and does not negatively affect the performance of other important operations. For example,
generating monthly operational statistics might be a long-running operation, but it can probably be performed
as a batch process and run as a low-priority job. On the other hand, customers querying the product catalog is a
critical business operation. Focus on the telemetry generated by these critical operations to see how the
performance varies during periods of high usage.
Identify data sources in slow workloads
If you suspect that a service is performing poorly because of the way it retrieves data, investigate how the
application interacts with the repositories it uses. Monitor the live system to see which sources are accessed
during periods of poor performance.
For each data source, instrument the system to capture the following:
The frequency that each data store is accessed.
The volume of data entering and exiting the data store.
The timing of these operations, especially the latency of requests.
The nature and rate of any errors that occur while accessing each data store under typical load.
Compare this information against the volume of data being returned by the application to the client. Track the
ratio of the volume of data returned by the data store against the volume of data returned to the client. If there is
any large disparity, investigate to determine whether the application is fetching data that it doesn't need.
You may be able to capture this data by observing the live system and tracing the lifecycle of each user request,
or you can model a series of synthetic workloads and run them against a test system.
The following graphs show telemetry captured using New Relic APM during a load test of the GetAllFieldsAsync
method. Note the difference between the volumes of data received from the database and the corresponding
HTTP responses.

For each request, the database returned 80,503 bytes, but the response to the client only contained 19,855 bytes,
about 25% of the size of the database response. The size of the data returned to the client can vary depending on
the format. For this load test, the client requested JSON data. Separate testing using XML (not shown) had a
response size of 35,655 bytes, or 44% of the size of the database response.
The load test for the AggregateOnClientAsync method shows more extreme results. In this case, each test
performed a query that retrieved over 280 Kb of data from the database, but the JSON response was a mere 14
bytes. The wide disparity is because the method calculates an aggregated result from a large volume of data.
Identify and analyze slow queries
Look for database queries that consume the most resources and take the most time to execute. You can add
instrumentation to find the start and completion times for many database operations. Many data stores also
provide in-depth information on how queries are performed and optimized. For example, the Query Performance
pane in the Azure SQL Database management portal lets you select a query and view detailed runtime
performance information. Here is the query generated by the GetAllFieldsAsync operation:
Implement the solution and verify the result
After changing the GetRequiredFieldsAsync method to use a SELECT statement on the database side, load testing
showed the following results.
This load test used the same deployment and the same simulated workload of 400 concurrent users as before.
The graph shows much lower latency. Response time rises with load to approximately 1.3 seconds, compared to 4
seconds in the previous case. The throughput is also higher at 350 requests per second compared to 100 earlier.
The volume of data retrieved from the database now closely matches the size of the HTTP response messages.

Load testing using the AggregateOnDatabaseAsync method generates the following results:
The average response time is now minimal. This is an order of magnitude improvement in performance, caused
primarily by the large reduction in I/O from the database.
Here is the corresponding telemetry for the AggregateOnDatabaseAsync method. The amount of data retrieved
from the database was vastly reduced, from over 280 Kb per transaction to 53 bytes. As a result, the maximum
sustained number of requests per minute was raised from around 2,000 to over 25,000.
Related resources
Busy Database antipattern
Chatty I/O antipattern
Data partitioning best practices
Improper Instantiation antipattern
11/2/2020 • 6 minutes to read • Edit Online

It can hurt performance to continually create new instances of an object that is meant to be created once and then
shared.

Problem description
Many libraries provide abstractions of external resources. Internally, these classes typically manage their own
connections to the resource, acting as brokers that clients can use to access the resource. Here are some examples
of broker classes that are relevant to Azure applications:
System.Net.Http.HttpClient . Communicates with a web service using HTTP.
Microsoft.ServiceBus.Messaging.QueueClient . Posts and receives messages to a Service Bus queue.
Microsoft.Azure.Documents.Client.DocumentClient . Connects to a Cosmos DB instance.
StackExchange.Redis.ConnectionMultiplexer . Connects to Redis, including Azure Cache for Redis.

These classes are intended to be instantiated once and reused throughout the lifetime of an application. However,
it's a common misunderstanding that these classes should be acquired only as necessary and released quickly. (The
ones listed here happen to be .NET libraries, but the pattern is not unique to .NET.) The following ASP.NET example
creates an instance of HttpClient to communicate with a remote service. You can find the complete sample here.

public class NewHttpClientInstancePerRequestController : ApiController


{
// This method creates a new instance of HttpClient and disposes it for every call to GetProductAsync.
public async Task<Product> GetProductAsync(string id)
{
using (var httpClient = new HttpClient())
{
var hostName = HttpContext.Current.Request.Url.Host;
var result = await httpClient.GetStringAsync(string.Format("http://{0}:8080/api/...", hostName));
return new Product { Name = result };
}
}
}

In a web application, this technique is not scalable. A new HttpClient object is created for each user request. Under
heavy load, the web server may exhaust the number of available sockets, resulting in SocketException errors.
This problem is not restricted to the HttpClient class. Other classes that wrap resources or are expensive to create
might cause similar issues. The following example creates an instance of the ExpensiveToCreateService class. Here
the issue is not necessarily socket exhaustion, but simply how long it takes to create each instance. Continually
creating and destroying instances of this class might adversely affect the scalability of the system.
public class NewServiceInstancePerRequestController : ApiController
{
public async Task<Product> GetProductAsync(string id)
{
var expensiveToCreateService = new ExpensiveToCreateService();
return await expensiveToCreateService.GetProductByIdAsync(id);
}
}

public class ExpensiveToCreateService


{
public ExpensiveToCreateService()
{
// Simulate delay due to setup and configuration of ExpensiveToCreateService
Thread.SpinWait(Int32.MaxValue / 100);
}
...
}

How to fix the problem


If the class that wraps the external resource is shareable and thread-safe, create a shared singleton instance or a
pool of reusable instances of the class.
The following example uses a static HttpClient instance, thus sharing the connection across all requests.

public class SingleHttpClientInstanceController : ApiController


{
private static readonly HttpClient httpClient;

static SingleHttpClientInstanceController()
{
httpClient = new HttpClient();
}

// This method uses the shared instance of HttpClient for every call to GetProductAsync.
public async Task<Product> GetProductAsync(string id)
{
var hostName = HttpContext.Current.Request.Url.Host;
var result = await httpClient.GetStringAsync(string.Format("http://{0}:8080/api/...", hostName));
return new Product { Name = result };
}
}

Considerations
The key element of this antipattern is repeatedly creating and destroying instances of a shareable object. If a
class is not shareable (not thread-safe), then this antipattern does not apply.
The type of shared resource might dictate whether you should use a singleton or create a pool. The
HttpClient class is designed to be shared rather than pooled. Other objects might support pooling,
enabling the system to spread the workload across multiple instances.
Objects that you share across multiple requests must be thread-safe. The HttpClient class is designed to be
used in this manner, but other classes might not support concurrent requests, so check the available
documentation.
Be careful about setting properties on shared objects, as this can lead to race conditions. For example,
setting DefaultRequestHeaders on the HttpClient class before each request can create a race condition. Set
such properties once (for example, during startup), and create separate instances if you need to configure
different settings.
Some resource types are scarce and should not be held onto. Database connections are an example. Holding
an open database connection that is not required may prevent other concurrent users from gaining access
to the database.
In the .NET Framework, many objects that establish connections to external resources are created by using
static factory methods of other classes that manage these connections. These objects are intended to be
saved and reused, rather than disposed and re-created. For example, in Azure Service Bus, the QueueClient
object is created through a MessagingFactory object. Internally, the MessagingFactory manages connections.
For more information, see Best Practices for performance improvements using Service Bus Messaging.

How to detect the problem


Symptoms of this problem include a drop in throughput or an increased error rate, along with one or more of the
following:
An increase in exceptions that indicate exhaustion of resources such as sockets, database connections, file
handles, and so on.
Increased memory use and garbage collection.
An increase in network, disk, or database activity.
You can perform the following steps to help identify this problem:
1. Performing process monitoring of the production system, to identify points when response times slow down or
the system fails due to lack of resources.
2. Examine the telemetry data captured at these points to determine which operations might be creating and
destroying resource-consuming objects.
3. Load test each suspected operation, in a controlled test environment rather than the production system.
4. Review the source code and examine the how broker objects are managed.
Look at stack traces for operations that are slow-running or that generate exceptions when the system is under
load. This information can help to identify how these operations are using resources. Exceptions can help to
determine whether errors are caused by shared resources being exhausted.

Example diagnosis
The following sections apply these steps to the sample application described earlier.
Identify points of slowdown or failure
The following image shows results generated using New Relic APM, showing operations that have a poor response
time. In this case, the GetProductAsync method in the NewHttpClientInstancePerRequest controller is worth
investigating further. Notice that the error rate also increases when these operations are running.
Examine telemetry data and find correlations
The next image shows data captured using thread profiling, over the same period corresponding as the previous
image. The system spends a significant time opening socket connections, and even more time closing them and
handling socket exceptions.

Performing load testing


Use load testing to simulate the typical operations that users might perform. This can help to identify which parts
of a system suffer from resource exhaustion under varying loads. Perform these tests in a controlled environment
rather than the production system. The following graph shows the throughput of requests handled by the
NewHttpClientInstancePerRequest controller as the user load increases to 100 concurrent users.
At first, the volume of requests handled per second increases as the workload increases. At about 30 users,
however, the volume of successful requests reaches a limit, and the system starts to generate exceptions. From
then on, the volume of exceptions gradually increases with the user load.
The load test reported these failures as HTTP 500 (Internal Server) errors. Reviewing the telemetry showed that
these errors were caused by the system running out of socket resources, as more and more HttpClient objects
were created.
The next graph shows a similar test for a controller that creates the custom ExpensiveToCreateService object.
This time, the controller does not generate any exceptions, but throughput still reaches a plateau, while the average
response time increases by a factor of 20. (The graph uses a logarithmic scale for response time and throughput.)
Telemetry showed that creating new instances of the ExpensiveToCreateService was the main cause of the problem.
Implement the solution and verify the result
After switching the GetProductAsync method to share a single HttpClient instance, a second load test showed
improved performance. No errors were reported, and the system was able to handle an increasing load of up to
500 requests per second. The average response time was cut in half, compared with the previous test.

For comparison, the following image shows the stack trace telemetry. This time, the system spends most of its time
performing real work, rather than opening and closing sockets.

The next graph shows a similar load test using a shared instance of the ExpensiveToCreateService object. Again, the
volume of handled requests increases in line with the user load, while the average response time remains low.
Monolithic Persistence antipattern
11/2/2020 • 6 minutes to read • Edit Online

Putting all of an application's data into a single data store can hurt performance, either because it leads to resource
contention, or because the data store is not a good fit for some of the data.

Problem description
Historically, applications have often used a single data store, regardless of the different types of data that the
application might need to store. Usually this was done to simplify the application design, or else to match the
existing skill set of the development team.
Modern cloud-based systems often have additional functional and nonfunctional requirements, and need to store
many heterogeneous types of data, such as documents, images, cached data, queued messages, application logs,
and telemetry. Following the traditional approach and putting all of this information into the same data store can
hurt performance, for two main reasons:
Storing and retrieving large amounts of unrelated data in the same data store can cause contention, which in
turn leads to slow response times and connection failures.
Whichever data store is chosen, it might not be the best fit for all of the different types of data, or it might not
be optimized for the operations that the application performs.
The following example shows an ASP.NET Web API controller that adds a new record to a database and also
records the result to a log. The log is held in the same database as the business data. You can find the complete
sample here.

public class MonoController : ApiController


{
private static readonly string ProductionDb = ...;

public async Task<IHttpActionResult> PostAsync([FromBody]string value)


{
await DataAccess.InsertPurchaseOrderHeaderAsync(ProductionDb);
await DataAccess.LogAsync(ProductionDb, LogTableName);
return Ok();
}
}

The rate at which log records are generated will probably affect the performance of the business operations. And if
another component, such as an application process monitor, regularly reads and processes the log data, that can
also affect the business operations.

How to fix the problem


Separate data according to its use. For each data set, select a data store that best matches how that data set will be
used. In the previous example, the application should be logging to a separate store from the database that holds
business data:
public class PolyController : ApiController
{
private static readonly string ProductionDb = ...;
private static readonly string LogDb = ...;

public async Task<IHttpActionResult> PostAsync([FromBody]string value)


{
await DataAccess.InsertPurchaseOrderHeaderAsync(ProductionDb);
// Log to a different data store.
await DataAccess.LogAsync(LogDb, LogTableName);
return Ok();
}
}

Considerations
Separate data by the way it is used and how it is accessed. For example, don't store log information and
business data in the same data store. These types of data have significantly different requirements and
patterns of access. Log records are inherently sequential, while business data is more likely to require
random access, and is often relational.
Consider the data access pattern for each type of data. For example, store formatted reports and documents
in a document database such as Cosmos DB, but use Azure Cache for Redis to cache temporary data.
If you follow this guidance but still reach the limits of the database, you may need to scale up the database.
Also consider scaling horizontally and partitioning the load across database servers. However, partitioning
may require redesigning the application. For more information, see Data partitioning.

How to detect the problem


The system will likely slow down dramatically and eventually fail, as the system runs out of resources such as
database connections.
You can perform the following steps to help identify the cause.
1. Instrument the system to record the key performance statistics. Capture timing information for each operation,
as well as the points where the application reads and writes data.
2. If possible, monitor the system running for a few days in a production environment to get a real-world view of
how the system is used. If this is not possible, run scripted load tests with a realistic volume of virtual users
performing a typical series of operations.
3. Use the telemetry data to identify periods of poor performance.
4. Identify which data stores were accessed during those periods.
5. Identify data storage resources that might be experiencing contention.

Example diagnosis
The following sections apply these steps to the sample application described earlier.
Instrument and monitor the system
The following graph shows the results of load testing the sample application described earlier. The test used a step
load of up to 1000 concurrent users.
As the load increases to 700 users, so does the throughput. But at that point, throughput levels off, and the system
appears to be running at its maximum capacity. The average response gradually increases with user load, showing
that the system can't keep up with demand.
Identify periods of poor performance
If you are monitoring the production system, you might notice patterns. For example, response times might drop
off significantly at the same time each day. This could be caused by a regular workload or scheduled batch job, or
just because the system has more users at certain times. You should focus on the telemetry data for these events.
Look for correlations between increased response times and increased database activity or I/O to shared resources.
If there are correlations, it means the database might be a bottleneck.
Identify which data stores are accessed during those periods
The next graph shows the utilization of database throughput units (DTU) during the load test. (A DTU is a measure
of available capacity, and is a combination of CPU utilization, memory allocation, I/O rate.) Utilization of DTUs
quickly reached 100%. This is roughly the point where throughput peaked in the previous graph. Database
utilization remained very high until the test finished. There is a slight drop toward the end, which could be caused
by throttling, competition for database connections, or other factors.
Examine the telemetry for the data stores
Instrument the data stores to capture the low-level details of the activity. In the sample application, the data access
statistics showed a high volume of insert operations performed against both the PurchaseOrderHeader table and
the MonoLog table.

Identify resource contention


At this point, you can review the source code, focusing on the points where contended resources are accessed by
the application. Look for situations such as:
Data that is logically separate being written to the same store. Data such as logs, reports, and queued messages
should not be held in the same database as business information.
A mismatch between the choice of data store and the type of data, such as large blobs or XML documents in a
relational database.
Data with significantly different usage patterns that share the same store, such as high-write/low-read data
being stored with low-write/high-read data.
Implement the solution and verify the result
The application was changed to write logs to a separate data store. Here are the load test results:

The pattern of throughput is similar to the earlier graph, but the point at which performance peaks is
approximately 500 requests per second higher. The average response time is marginally lower. However, these
statistics don't tell the full story. Telemetry for the business database shows that DTU utilization peaks at around
75%, rather than 100%.

Similarly, the maximum DTU utilization of the log database only reaches about 70%. The databases are no longer
the limiting factor in the performance of the system.
Related resources
Choose the right data store
Criteria for choosing a data store
Data Access for Highly Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence
Data partitioning
No Caching antipattern
11/2/2020 • 8 minutes to read • Edit Online

In a cloud application that handles many concurrent requests, repeatedly fetching the same data can reduce
performance and scalability.

Problem description
When data is not cached, it can cause a number of undesirable behaviors, including:
Repeatedly fetching the same information from a resource that is expensive to access, in terms of I/O overhead
or latency.
Repeatedly constructing the same objects or data structures for multiple requests.
Making excessive calls to a remote service that has a service quota and throttles clients past a certain limit.
In turn, these problems can lead to poor response times, increased contention in the data store, and poor
scalability.
The following example uses Entity Framework to connect to a database. Every client request results in a call to the
database, even if multiple requests are fetching exactly the same data. The cost of repeated requests, in terms of
I/O overhead and data access charges, can accumulate quickly.

public class PersonRepository : IPersonRepository


{
public async Task<Person> GetAsync(int id)
{
using (var context = new AdventureWorksContext())
{
return await context.People
.Where(p => p.Id == id)
.FirstOrDefaultAsync()
.ConfigureAwait(false);
}
}
}

You can find the complete sample here.


This antipattern typically occurs because:
Not using a cache is simpler to implement, and it works fine under low loads. Caching makes the code more
complicated.
The benefits and drawbacks of using a cache are not clearly understood.
There is concern about the overhead of maintaining the accuracy and freshness of cached data.
An application was migrated from an on-premises system, where network latency was not an issue, and the
system ran on expensive high-performance hardware, so caching wasn't considered in the original design.
Developers aren't aware that caching is a possibility in a given scenario. For example, developers may not think
of using ETags when implementing a web API.

How to fix the problem


The most popular caching strategy is the on-demand or cache-aside strategy.
On read, the application tries to read the data from the cache. If the data isn't in the cache, the application
retrieves it from the data source and adds it to the cache.
On write, the application writes the change directly to the data source and removes the old value from the
cache. It will be retrieved and added to the cache the next time it is required.
This approach is suitable for data that changes frequently. Here is the previous example updated to use the Cache-
Aside pattern.

public class CachedPersonRepository : IPersonRepository


{
private readonly PersonRepository _innerRepository;

public CachedPersonRepository(PersonRepository innerRepository)


{
_innerRepository = innerRepository;
}

public async Task<Person> GetAsync(int id)


{
return await CacheService.GetAsync<Person>("p:" + id, () =>
_innerRepository.GetAsync(id)).ConfigureAwait(false);
}
}

public class CacheService


{
private static ConnectionMultiplexer _connection;

public static async Task<T> GetAsync<T>(string key, Func<Task<T>> loadCache, double


expirationTimeInMinutes)
{
IDatabase cache = Connection.GetDatabase();
T value = await GetAsync<T>(cache, key).ConfigureAwait(false);
if (value == null)
{
// Value was not found in the cache. Call the lambda to get the value from the database.
value = await loadCache().ConfigureAwait(false);
if (value != null)
{
// Add the value to the cache.
await SetAsync(cache, key, value, expirationTimeInMinutes).ConfigureAwait(false);
}
}
return value;
}
}

Notice that the GetAsync method now calls the CacheService class, rather than calling the database directly. The
CacheService class first tries to get the item from Azure Cache for Redis. If the value isn't found in the cache, the
CacheService invokes a lambda function that was passed to it by the caller. The lambda function is responsible for
fetching the data from the database. This implementation decouples the repository from the particular caching
solution, and decouples the CacheService from the database.

Considerations
If the cache is unavailable, perhaps because of a transient failure, don't return an error to the client. Instead,
fetch the data from the original data source. However, be aware that while the cache is being recovered, the
original data store could be swamped with requests, resulting in timeouts and failed connections. (After all,
this is one of the motivations for using a cache in the first place.) Use a technique such as the Circuit Breaker
pattern to avoid overwhelming the data source.
Applications that cache dynamic data should be designed to support eventual consistency.
For web APIs, you can support client-side caching by including a Cache-Control header in request and
response messages, and using ETags to identify versions of objects. For more information, see API
implementation.
You don't have to cache entire entities. If most of an entity is static but only a small piece changes frequently,
cache the static elements and retrieve the dynamic elements from the data source. This approach can help
to reduce the volume of I/O being performed against the data source.
In some cases, if volatile data is short-lived, it can be useful to cache it. For example, consider a device that
continually sends status updates. It might make sense to cache this information as it arrives, and not write it
to a persistent store at all.
To prevent data from becoming stale, many caching solutions support configurable expiration periods, so
that data is automatically removed from the cache after a specified interval. You may need to tune the
expiration time for your scenario. Data that is highly static can stay in the cache for longer periods than
volatile data that may become stale quickly.
If the caching solution doesn't provide built-in expiration, you may need to implement a background
process that occasionally sweeps the cache, to prevent it from growing without limits.
Besides caching data from an external data source, you can use caching to save the results of complex
computations. Before you do that, however, instrument the application to determine whether the application
is really CPU bound.
It might be useful to prime the cache when the application starts. Populate the cache with the data that is
most likely to be used.
Always include instrumentation that detects cache hits and cache misses. Use this information to tune
caching policies, such what data to cache, and how long to hold data in the cache before it expires.
If the lack of caching is a bottleneck, then adding caching may increase the volume of requests so much that
the web front end becomes overloaded. Clients may start to receive HTTP 503 (Service Unavailable) errors.
These are an indication that you should scale out the front end.

How to detect the problem


You can perform the following steps to help identify whether lack of caching is causing performance problems:
1. Review the application design. Take an inventory of all the data stores that the application uses. For each,
determine whether the application is using a cache. If possible, determine how frequently the data changes.
Good initial candidates for caching include data that changes slowly, and static reference data that is read
frequently.
2. Instrument the application and monitor the live system to find out how frequently the application retrieves
data or calculates information.
3. Profile the application in a test environment to capture low-level metrics about the overhead associated
with data access operations or other frequently performed calculations.
4. Perform load testing in a test environment to identify how the system responds under a normal workload
and under heavy load. Load testing should simulate the pattern of data access observed in the production
environment using realistic workloads.
5. Examine the data access statistics for the underlying data stores and review how often the same data
requests are repeated.
Example diagnosis
The following sections apply these steps to the sample application described earlier.
Instrument the application and monitor the live system
Instrument the application and monitor it to get information about the specific requests that users make while the
application is in production.
The following image shows monitoring data captured by New Relic during a load test. In this case, the only HTTP
GET operation performed is Person/GetAsync . But in a live production environment, knowing the relative frequency
that each request is performed can give you insight into which resources should be cached.

If you need a deeper analysis, you can use a profiler to capture low-level performance data in a test environment
(not the production system). Look at metrics such as I/O request rates, memory usage, and CPU utilization. These
metrics may show a large number of requests to a data store or service, or repeated processing that performs the
same calculation.
Load test the application
The following graph shows the results of load testing the sample application. The load test simulates a step load of
up to 800 users performing a typical series of operations.
The number of successful tests performed each second reaches a plateau, and additional requests are slowed as a
result. The average test time steadily increases with the workload. The response time levels off once the user load
peaks.
Examine data access statistics
Data access statistics and other information provided by a data store can give useful information, such as which
queries are repeated most frequently. For example, in Microsoft SQL Server, the sys.dm_exec_query_stats
management view has statistical information for recently executed queries. The text for each query is available in
the sys.dm_exec-query_plan view. You can use a tool such as SQL Server Management Studio to run the following
SQL query and determine how frequently queries are performed.

SELECT UseCounts, Text, Query_Plan


FROM sys.dm_exec_cached_plans
CROSS APPLY sys.dm_exec_sql_text(plan_handle)
CROSS APPLY sys.dm_exec_query_plan(plan_handle)

The UseCount column in the results indicates how frequently each query is run. The following image shows that
the third query was run more than 250,000 times, significantly more than any other query.
Here is the SQL query that is causing so many database requests:

(@p__linq__0 int)SELECT TOP (2)


[Extent1].[BusinessEntityId] AS [BusinessEntityId],
[Extent1].[FirstName] AS [FirstName],
[Extent1].[LastName] AS [LastName]
FROM [Person].[Person] AS [Extent1]
WHERE [Extent1].[BusinessEntityId] = @p__linq__0

This is the query that Entity Framework generates in GetByIdAsync method shown earlier.
Implement the solution and verify the result
After you incorporate a cache, repeat the load tests and compare the results to the earlier load tests without a
cache. Here are the load test results after adding a cache to the sample application.
The volume of successful tests still reaches a plateau, but at a higher user load. The request rate at this load is
significantly higher than earlier. Average test time still increases with load, but the maximum response time is 0.05
ms, compared with 1 ms earlier—a 20× improvement.

Related resources
API implementation best practices
Cache-Aside pattern
Caching best practices
Circuit Breaker pattern
Synchronous I/O antipattern
11/2/2020 • 6 minutes to read • Edit Online

Blocking the calling thread while I/O completes can reduce performance and affect vertical scalability.

Problem description
A synchronous I/O operation blocks the calling thread while the I/O completes. The calling thread enters a wait
state and is unable to perform useful work during this interval, wasting processing resources.
Common examples of I/O include:
Retrieving or persisting data to a database or any type of persistent storage.
Sending a request to a web service.
Posting a message or retrieving a message from a queue.
Writing to or reading from a local file.
This antipattern typically occurs because:
It appears to be the most intuitive way to perform an operation.
The application requires a response from a request.
The application uses a library that only provides synchronous methods for I/O.
An external library performs synchronous I/O operations internally. A single synchronous I/O call can block an
entire call chain.
The following code uploads a file to Azure blob storage. There are two places where the code blocks waiting for
synchronous I/O, the CreateIfNotExists method and the UploadFromStream method.

var blobClient = storageAccount.CreateCloudBlobClient();


var container = blobClient.GetContainerReference("uploadedfiles");

container.CreateIfNotExists();
var blockBlob = container.GetBlockBlobReference("myblob");

// Create or overwrite the "myblob" blob with contents from a local file.
using (var fileStream = File.OpenRead(HostingEnvironment.MapPath("~/FileToUpload.txt")))
{
blockBlob.UploadFromStream(fileStream);
}

Here's an example of waiting for a response from an external service. The GetUserProfile method calls a remote
service that returns a UserProfile .
public interface IUserProfileService
{
UserProfile GetUserProfile();
}

public class SyncController : ApiController


{
private readonly IUserProfileService _userProfileService;

public SyncController()
{
_userProfileService = new FakeUserProfileService();
}

// This is a synchronous method that calls the synchronous GetUserProfile method.


public UserProfile GetUserProfile()
{
return _userProfileService.GetUserProfile();
}
}

You can find the complete code for both of these examples here.

How to fix the problem


Replace synchronous I/O operations with asynchronous operations. This frees the current thread to continue
performing meaningful work rather than blocking, and helps improve the utilization of compute resources.
Performing I/O asynchronously is particularly efficient for handling an unexpected surge in requests from client
applications.
Many libraries provide both synchronous and asynchronous versions of methods. Whenever possible, use the
asynchronous versions. Here is the asynchronous version of the previous example that uploads a file to Azure blob
storage.

var blobClient = storageAccount.CreateCloudBlobClient();


var container = blobClient.GetContainerReference("uploadedfiles");

await container.CreateIfNotExistsAsync();

var blockBlob = container.GetBlockBlobReference("myblob");

// Create or overwrite the "myblob" blob with contents from a local file.
using (var fileStream = File.OpenRead(HostingEnvironment.MapPath("~/FileToUpload.txt")))
{
await blockBlob.UploadFromStreamAsync(fileStream);
}

The await operator returns control to the calling environment while the asynchronous operation is performed.
The code after this statement acts as a continuation that runs when the asynchronous operation has completed.
A well designed service should also provide asynchronous operations. Here is an asynchronous version of the web
service that returns user profiles. The GetUserProfileAsync method depends on having an asynchronous version
of the User Profile service.
public interface IUserProfileService
{
Task<UserProfile> GetUserProfileAsync();
}

public class AsyncController : ApiController


{
private readonly IUserProfileService _userProfileService;

public AsyncController()
{
_userProfileService = new FakeUserProfileService();
}

// This is an synchronous method that calls the Task based GetUserProfileAsync method.
public Task<UserProfile> GetUserProfileAsync()
{
return _userProfileService.GetUserProfileAsync();
}
}

For libraries that don't provide asynchronous versions of operations, it may be possible to create asynchronous
wrappers around selected synchronous methods. Follow this approach with caution. While it may improve
responsiveness on the thread that invokes the asynchronous wrapper, it actually consumes more resources. An
extra thread may be created, and there is overhead associated with synchronizing the work done by this thread.
Some tradeoffs are discussed in this blog post: Should I expose asynchronous wrappers for synchronous methods?
Here is an example of an asynchronous wrapper around a synchronous method.

// Asynchronous wrapper around synchronous library method


private async Task<int> LibraryIOOperationAsync()
{
return await Task.Run(() => LibraryIOOperation());
}

Now the calling code can await on the wrapper:

// Invoke the asynchronous wrapper using a task


await LibraryIOOperationAsync();

Considerations
I/O operations that are expected to be very short lived and are unlikely to cause contention might be more
performant as synchronous operations. An example might be reading small files on an SSD drive. The
overhead of dispatching a task to another thread, and synchronizing with that thread when the task
completes, might outweigh the benefits of asynchronous I/O. However, these cases are relatively rare, and
most I/O operations should be done asynchronously.
Improving I/O performance may cause other parts of the system to become bottlenecks. For example,
unblocking threads might result in a higher volume of concurrent requests to shared resources, leading in
turn to resource starvation or throttling. If that becomes a problem, you might need to scale out the number
of web servers or partition data stores to reduce contention.

How to detect the problem


For users, the application may seem unresponsive periodically. The application might fail with timeout exceptions.
These failures could also return HTTP 500 (Internal Server) errors. On the server, incoming client requests might be
blocked until a thread becomes available, resulting in excessive request queue lengths, manifested as HTTP 503
(Service Unavailable) errors.
You can perform the following steps to help identify the problem:
1. Monitor the production system and determine whether blocked worker threads are constraining
throughput.
2. If requests are being blocked due to lack of threads, review the application to determine which operations
may be performing I/O synchronously.
3. Perform controlled load testing of each operation that is performing synchronous I/O, to find out whether
those operations are affecting system performance.

Example diagnosis
The following sections apply these steps to the sample application described earlier.
Monitor web server performance
For Azure web applications and web roles, it's worth monitoring the performance of the IIS web server. In
particular, pay attention to the request queue length to establish whether requests are being blocked waiting for
available threads during periods of high activity. You can gather this information by enabling Azure diagnostics. For
more information, see:
Monitor Apps in Azure App Service
Create and use performance counters in an Azure application
Instrument the application to see how requests are handled once they have been accepted. Tracing the flow of a
request can help to identify whether it is performing slow-running calls and blocking the current thread. Thread
profiling can also highlight requests that are being blocked.
Load test the application
The following graph shows the performance of the synchronous GetUserProfile method shown earlier, under
varying loads of up to 4000 concurrent users. The application is an ASP.NET application running in an Azure Cloud
Service web role.

The synchronous operation is hard-coded to sleep for 2 seconds, to simulate synchronous I/O, so the minimum
response time is slightly over 2 seconds. When the load reaches approximately 2500 concurrent users, the average
response time reaches a plateau, although the volume of requests per second continues to increase. Note that the
scale for these two measures is logarithmic. The number of requests per second doubles between this point and
the end of the test.
In isolation, it's not necessarily clear from this test whether the synchronous I/O is a problem. Under heavier load,
the application may reach a tipping point where the web server can no longer process requests in a timely manner,
causing client applications to receive time-out exceptions.
Incoming requests are queued by the IIS web server and handed to a thread running in the ASP.NET thread pool.
Because each operation performs I/O synchronously, the thread is blocked until the operation completes. As the
workload increases, eventually all of the ASP.NET threads in the thread pool are allocated and blocked. At that
point, any further incoming requests must wait in the queue for an available thread. As the queue length grows,
requests start to time out.
Implement the solution and verify the result
The next graph shows the results from load testing the asynchronous version of the code.

Throughput is far higher. Over the same duration as the previous test, the system successfully handles a nearly
tenfold increase in throughput, as measured in requests per second. Moreover, the average response time is
relatively constant and remains approximately 25 times smaller than the previous test.
Responsible Innovation: A Best Practices Toolkit
11/2/2020 • 2 minutes to read • Edit Online

This toolkit provides developers with a set of practices in development, for anticipating and addressing the
potential negative impacts of technology on people. We are sharing this as an early stage practice for feedback and
learning.

Judgment Call
Judgment Call is an award-winning game and team-based activity that puts Microsoft’s AI principles of fairness,
privacy and security, reliability and safety, transparency, inclusion, and accountability into action. The game provides
an easy-to-use method for cultivating stakeholder empathy through scenario-imagining. Game participants write
product reviews from the perspective of a particular stakeholder, describing what kind of impact and harms the
technology could produce from their point of view.

Harms Modeling
Harms Modeling is a framework for product teams, grounded in four core pillars that examine how people's lives
can be negatively impacted by technology: injuries, denial of consequential services, infringement on human rights,
and erosion of democratic & societal structures. Similar to Security Threat Modeling, Harms Modeling enables
product teams to anticipate potential real-world impacts of technology, which is a cornerstone of responsible
development.

Community Jury
Community Jury is a technique that brings together diverse stakeholders impacted by a technology. It is an
adaptation of the citizen jury. The stakeholders are provided an opportunity to learn from experts about a project,
deliberate together, and give feedback on use cases and product design. This technique allows project teams to
collaborate with researchers to identify stakeholder values, and understand the perceptions and concerns of
impacted stakeholders.
Judgment Call
11/2/2020 • 5 minutes to read • Edit Online

Judgment Call is an award-winning game and team-based activity that puts Microsoft’s AI principles of fairness,
privacy and security, reliability and safety, transparency, inclusion, and accountability into action. The game
provides an easy-to-use method for cultivating stakeholder empathy by imagining their scenarios. Game
participants write product reviews from the perspective of a particular stakeholder, describing what kind of impact
and harms the technology could produce from their point of view.

Benefits
Technology builders need practical methods to incorporate ethics in product development, by considering the
values of diverse stakeholders and how technology may uphold or not uphold those values. The goal of the game
is to imagine potential outcomes of a product or platform by gaining a better understanding of stakeholders, and
what they need and expect.
The game helps people discuss challenging topics in a safe space within the context of gameplay, and gives
technology creator a vocabulary to facilitate ethics discussions during product development. It gives managers and
designers an interactive tool to lead ethical dialogue with their teams to incorporate ethical design in their
products.
The theoretical basis and initial outcomes of the Judgment Call game were presented at the 2019 ACM Design
Interactive Systems conference and the game was a finalist in the Fast Company 2019 Innovation by Design
Awards (Social Goods category). The game has been presented to thousands of people in the US and
internationally. While the largest group facilitated in game play has been over 100, each card deck can involve 1-10
players. This game is not intended to be a substitute for performing robust user research. Rather, it's an exercise to
help tech builders learn how to exercise their moral imagination.

Preparation
Judgment Call uses role play to surface ethical concerns in product development so players will anticipate societal
impact, write product reviews, and explore mitigations. Players think of what kind of harms and impacts the
technology might produce by writing product reviews from the point of view of a stakeholder.

During the Game


The players are expected to go through the following steps in this game:
1. Identify the product
Start by identifying a product that your team is building, or a scenario you are working on. For example, if
your team is developing synthetic voice technology, your scenario could be developing a voice assistant for
online learning.
2. Pass the cards
Once you have decided on the scenario, pass out the cards. Each player receives a stakeholder card, an
ethical principle card, a rating card, and a review form. The following is a description of these cards:
Stakeholder - This represents the group of people impacted by the technology or scenario you’ve
chosen. Stakeholders can also include the makers (those proposing and working on the technology)
as well as the internal project sponsors (managers and executives who are supportive of the project).
Principle - This includes Microsoft’s ethical principles of fairness, privacy and security, reliability and
safety, transparency, inclusion, and accountability.
Rating - Star cards give a rating: 1-star is poor, 3-star is mediocre, 5-star is excellent.
3. Brainstorm and identify stakeholders
As a group, brainstorm all the stakeholders that could be directly or indirectly affected by the technology.
Then assign each player a stakeholder from the list. In some cases, the Judgment Call decks may have pre-
populated stakeholder cards, but players can add additional cards relevant to their situation.
The deck provides following categories of stakeholders:
Direct - These use the technology and/or the technology is used on them.
Indirect - These feel the effects of the technology.
Special populations - A wide range of categories that includes those who cannot use the
technology or choose not to use it, and organizations that may be interested in its deployment like
advocacy groups or the government.
There are a range of harms that can be felt or perpetuated onto each group of stakeholder. Types of
stakeholders can include end users of the product, administrators, internal teams, advocacy groups,
manufacturers, someone who might be excluded from using the product. Stakeholders can also include the
makers who propose and work on the technology, as well as the internal project sponsors, that is, the
managers and executives who are supportive of the project.
4. Moderator selection
The moderator describes scenarios for gameplay, guides discussion, and collects comment cards at the end.
They will also give an overview of the ethical principles considered in the deck.
5. Presentation of a product
The moderator introduces a product, real or imaginary, for the players to role play. They can start with
questions about the product with prompts such as:
Who will use the product?
What features will it have?
When will it be used?
Where will it be used?
Why is it useful?
How will it be used?
6. Writing reviews
Players each write a review of the product or scenario based on the cards they’ve been dealt. Players take
turns reading their reviews aloud. Listen carefully for any issues or themes that might come up. Take note of
any changes you might make to your product based on insights discussed.
The stakeholder card tells them who they are for the round.
The ethical principle card is one of the six Microsoft AI principles; it tells them what area they’re
focusing on in their review.
The rating card tells them how much their stakeholder likes the technology (1, 3, or 5 stars).
The player then writes a review of the technology, from the perspective of their stakeholder, of why they like
or dislike the product from the perspective of the ethical principle they received.

Discussion
The moderator has each player read their reviews. Everyone is invited to discuss the different perspectives and
other considerations that may not have come up.
Potential moderator questions include:
Who is the most impacted?
What features are problematic?
What are the potential harms?

Harms mitigation
Finally, the group picks a thread from the discussion to begin identifying design or process changes that can
mitigate a particular harm. At the end of each round, the decks are shuffled, and another round can begin with the
same or a different scenario.

Next steps
Once you have enough understanding of potential harms or negative impact your product or scenario may cause,
proceed to learn how to model these harms so you can devise effective mitigations.
Foundations of assessing harm
11/2/2020 • 3 minutes to read • Edit Online

Harms Modeling is a practice designed to help you anticipate the potential for harm, identify gaps in product that
could put people at risk, and ultimately create approaches that proactively address harm.

Why harms modeling?


As technology builders, your work is global. Designing AI to be trustworthy requires creating solutions that reflect
ethical principles deeply rooted in important and timeless values. In the process of designing and building your
technologies, it is essential to evaluate not only ideal outcomes, but possible negative ones as well.
Technology and human rights
It is as important as ever to be aware of how digital technology could impact human rights. In addition to
continuing to protect privacy and security, we must address the risks of AI and other emerging technologies, such
as facial recognition. History teaches us that human rights violations not only result from the nefarious use of
technology, but also from a lack of awareness amongst those with good intentions. As a part of our company's
dedication to the protection of human rights, Microsoft forged a partnership with important stakeholders outside
of our industry, including the United Nations (UN).
An important set of UN principles that our company firmly supports, which was ratified by over 250 nations, is the
Universal Declaration of Human Rights (UDHR). The UDHR is a milestone document. Drafted by a diverse global
group of legal and cultural experts, the Declaration was proclaimed by the United Nations General Assembly in
1948 as a common standard of achievements for all peoples and all nations. It sets out, for the first time,
fundamental human rights to be universally protected. It has been translated into over 500 languages. Additionally,
Microsoft is one of 4,700 corporate signatories to the UN Global Compact, an international business initiative
designed to promote responsible corporate citizenship.

Download in Microsoft Word

Human understanding
In addition to appreciating the importance of human rights, building trustworthy systems requires us to consider
many people's perspectives. Asking who the stakeholders are, what they value, how they could benefit, and how
they could be hurt by our technology, is a powerful step that allows us to design and build better products.
Who does the technology impact?
Who are the customers?
What do they value?
How should they benefit?
How could tech harm them?
Who are the non-customer stakeholders?
What do they value?
How should they benefit?
How could tech harm them?
Asking these questions is a practice in Value Sensitive Design and is the beginning to better understanding what is
important to stakeholders, and how it plays into their relationship with the product.
Types of stakeholders
Project sponsors
Backers, decision makers, and owners make up this category. Their values are articulated in the project strategy
and goals.
Tech builders
Designers, developers, project managers, and those working directly on designing systems make up this group.
They bring their own ethical standards and profession-specific values to the system.
Direct & indirect stakeholders
These stakeholders are significantly impacted by the system. This includes end users, software staff, clients,
bystanders, interfacing institutions, and even past or future generations. Non-human factors such as places, for
example, historic buildings or sacred spaces, may also be included.
Marginalized populations
This category is made up of the population frequently considered a minority, vulnerable, or stigmatized. This
includes children, the elderly, members of the LGBTQ+ community, ethnic minorities, and other populations who
often experience unique and disproportionate consequences.

Assessing Harm
Once you have defined the technology purpose, use cases, and stakeholders, conduct a Harms Modeling exercise
to evaluate potential ways the use of a technology you are building could result in negative outcomes for people
and society.

Download in Microsoft Word


The diagram above is an example of a harms evaluation. This is a qualitative approach used to understand
potential magnitude of harm.
You can complete this ideation activity individually, but ideally it is conducted as collaboration between developers,
data scientists, designers, user researcher, business decision-makers, and other disciplines that are involved in
building the technology.
Suggestions for harm description statements:
Intended use: If [feature] was used for [use case], [stakeholder] could experience [harm description].
Unintended use: If [user] tried to use [feature] for [use case], [stakeholder] could experience [harm description].
System error: If [feature] failed to function properly when used for [use case], [stakeholder] could experience
[harm description].
Misuse: [Malicious actor] could potentially use [feature], to cause [harm description] to [stakeholder].
Use the categories, questions, and examples described in the Types of harm to generate specific ideas for how
harm could occur. The article lists categories of harms, that are based upon common negative impact areas. Adapt
and adopt additional categories that are relevant to you.

Next Steps
Read Types of harm for further harms analysis.
Types of harm
11/2/2020 • 13 minutes to read • Edit Online

This article creates awareness for the different types of harms, so that appropriate mitigation steps can be
implemented.

Risk of injury
Physical injury
Consider how technology could hurt people or create dangerous environments.

H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E

Overreliance on safety This points out the How might people rely on A healthcare agent could
features dependence on technology this technology to keep misdiagnose illness, leading
to make decisions without them safe? How could this to unnecessary treatment.
adequate human oversight. technology reduce
appropriate human
oversight?

Inadequate fail-safes Real-world testing may If this technology fails or is If an automatic door failed
insufficiently consider a misused, how would people to detect a wheelchair
diverse set of users and be impacted? At what point during an emergency
scenarios. could a human intervene? evacuation, a person could
Are there alternative uses be trapped if there isn't an
that have not been tested accessible override button.
for? How would users be
impacted by a system
failure?

Exposure to unhealthy Manufacturing, as well as What negative outcomes Inadequate safety


agents disposal of technology can could come from the measures could cause
jeopardize the health and manufacturing of your workers to be exposed to
well-being of workers and components or devices? toxins during digital
nearby inhabitants. component manufacturing.

Emotional or psychological injury


Misused technology can lead to serious emotional and psychological distress.

H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E

Overreliance on Misguided beliefs can lead How could this technology A chat bot could be relied
automation users to trust the reliability reduce direct interpersonal upon for relationship advice
of a digital agent over that feedback? How might this or mental health counseling
of a human. technology interface with instead of a trained
trusted sources of professional.
information? How could
sole dependence on an
artificial agent impact a
person?
H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E

Distor tion of reality or When intentionally Could this be used to An IoT device could enable
gaslighting misused, technology can modify digital media or monitoring and controlling
undermine trust and physical environments? of an ex-intimate partner
distort someone's sense of from afar.
reality.

Reduced self- Some shared content can How could this technology Synthetic media "revenge
esteem/reputation be harmful, false, be used to inappropriately porn" can swap faces,
damage misleading, or denigrating. share personal information? creating the illusion of a
How could it be person participating in a
manipulated to misuse video, who did not.
information and
misrepresent people?

Addiction/attention Technology could be In what ways might this Variable drop rates in video
hijacking designed for prolonged technology reward or game loot boxes could
interaction, without regard encourage continued cause players to keep
for well-being. interaction beyond playing and neglect self-
delivering user value? care.

Identity theft Identity theft may lead to How might an individual be Synthetic voice font could
loss of control over impersonated with this mimic the sound of a
personal credentials, technology? How might person's voice and be used
reputation, and/or this technology mistakenly to access a bank account.
representation. recognize the wrong
individual as an authentic
user?

Misattribution This includes crediting a In what ways might this Facial recognition can
person with an action or technology attribute an misidentify an individual
content that they are not action to an individual or during a police
responsible for. group? How could investigation.
someone be affected if an
action was incorrectly
attributed to them?

Denial of consequential services


Opportunity loss
Automated decisions could limit access to resources, services, and opportunities essential to well-being.

H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E

Employment Some people may be Are there ways in which Hiring AI could recommend
discrimination denied access to apply for this technology could fewer candidates with
or secure a job based on impact recommendations female-sounding names for
characteristics unrelated to or decisions related to interviews.
merit. employment?

Housing discrimination This includes denying How could this technology Public housing queuing
people access to housing or impact recommendations algorithm could cause
the ability to apply for or decisions related to people with international-
housing. housing? sounding names to wait
longer for vouchers.
H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E

Insurance and benefit This includes denying Could this technology be Insurance company might
discrimination people insurance, social used to determine access, charge higher rates for
assistance, or access to a cost, allocation of insurance drivers working night shifts
medical trial due to biased or social benefits? due to algorithmic
standards. predictions suggesting
increased drunk driving
risk.

Educational Access to education may be How might this technology Emotion classifier could
discrimination denied because of an be used to determine incorrectly report that
unchangeable access, cost, students of color are less
characteristic. accommodations, or other engaged than their white
outcomes related to counterparts, leading to
education? lower grades.

Digital Disproportionate access to What prerequisite skills, Content throttling could


divide/technological the benefits of technology, equipment, or connectivity prevent rural students from
discrimination may leave some people less are necessary to get the accessing classroom
informed or less equipped most out of this instruction video feed.
to participate in society. technology? What might be
the impact of select people
gaining earlier access to
this technology than
others, in terms of
equipment, connectivity, or
other product functionality?

Loss of choice/network Presenting people with only How might this technology News feed could only
and filter bubble information that conforms affect which choices and present information that
to and reinforces their own information are made confirms existing beliefs.
beliefs. available to people? What
past behaviors or
preferences might this
technology rely on to
predict future behaviors or
preferences?

Economic loss
Automating decisions related to financial instruments, economic opportunity, and resources can amplify existing
societal inequities and obstruct well-being.

H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E

Credit discrimination People may be denied How might this technology Higher introductory rate
access to financial rely on existing credit offers could be sent only to
instruments based on structures to make homes in lower
characteristics unrelated to decisions? How might this socioeconomic postal
economic merit. technology affect the ability codes.
of an individual or group to
obtain or maintain a credit
score?
H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E

Differential pricing of Goods or services may be How could this technology More could be charged for
goods and ser vices offered at different prices be used to determine products based on
for reasons unrelated to pricing of goods or designation for men or
the cost of production or services? What are the women.
delivery. criteria for determining the
cost to people for use of
this tech?

Economic exploitation People might be compelled What role did human labor Paying financially destitute
or misled to work on play in producing training people for their biometric
something that impacts data for this technology? data to train AI systems.
their dignity or wellbeing. How was this workforce
acquired? What role does
human labor play in
supporting this
technology? Where is this
workforce expected to
come from?

Devaluation of Technology may supplant How might this technology AI agents replace
individual exper tise the use of paid human impact the need to employ doctors/radiographers for
expertise or labor. an existing workforce? evaluation of medical
imaging.

Infringement on human rights


Dignity loss
Technology can influence how people perceive the world, and how they recognize, engage, and value one another.
The exchange of honor and respect between people can be interfered with.

H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E

Dehumanization Removing, reducing, or How might this technology Entity recognition and
obscuring visibility of a be used to simplify or virtual overlays in drone
person's humanity. abstract the way a person surveillance could reduce
is represented? How might the perceived accountability
this technology reduce the of human actions.
distinction between
humans and the digital
world?

Public shaming This may mean exposing How might movements or A fitness app could reveal a
people's private, sensitive, actions be revealed user's GPS location on
or socially inappropriate through data aggregation? social media, indicating
material. attendance at an Alcoholics
Anonymous meeting.

Liberty loss
Automation of legal, judicial, and social systems can reinforce biases and lead to detrimental consequences.

H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E


H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E

Predictive policing Inferring suspicious How could this support or An algorithm can predict a
behavior and/or criminal replace human policing or number of area arrests, so
intent based on historical criminal justice decision- police make sure they
records. making? match, or exceed, that
number.

Social control Conformity may be What types of personal or Authoritarian government


reinforced or encouraged behavioral data might feed uses social media and e-
by publicly designating this technology? How commerce data to
human behaviors as would it be obtained? What determine a "trustworthy"
positive or negative. outputs would be derived score based on where
from this data? Is this people shop and who they
technology likely to be spend time with.
used to encourage or
discourage certain
behaviors?

Loss of effective This means an inability to How might people Automated prison sentence
remedy explain the rationale or lack understand the reasoning or pre-trial release decision
of opportunity to contest a for decisions made by this is not explained to the
decision. technology? How might an accused.
individual that relies on this
technology explain the
decisions it makes? How
could people contest or
question a decision this
technology makes?

Privacy loss
Information generated by our use of technology can be used to determine facts or make assumptions about
someone without their knowledge.

H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E

Interference with Revealing information a How could this technology Task-tracking AI could
private life person has not chosen to use information to infer monitor personal patterns
share. portions of a private life? from which it infers an
How could decisions based extramarital affair.
upon these inferences
expose things that a person
does not want made
public?

Forced association Requiring participation in How might use of this Biometric enrollment in a
the use of technology or technology be required for company's meeting room
surveillance to take part in participation in society or transcription AI is a
society. organization membership? stipulated requirement in
job offer letter.
H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E

Inability to freely and This may mean restriction In what way does the Intelligent meeting system
fully develop of one's ability to truthfully system or product ascribe could record all discussions
personality express themselves or positive vs negative between colleagues
explore external avenues connotations toward including personal coaching
for self-development. particular personality traits? and mentorship sessions.
In what way can using the
product or system reveal
information to entities such
as the government or
employer that inhibits free
expression?

Never forgiven Digital files or records may What and where is data A teenager's social media
never be deleted. being stored from this history could remain
product, and who can searchable long after they
access it? How long is user have outgrown the
data stored after platform.
technology interaction?
How is user data updated
or deleted?

Loss of freedom of This means an inability to In what ways might this A real name could be
movement or assembly navigate the physical or technology monitor people required in order to sign up
virtual world with desired across physical and virtual for a video game enabling
anonymity. space? real-world stalking.

Environmental impact
The environment can be impacted by every decision in a system or product life cycle, from the amount of cloud
computing needed to retail packaging. Environmental changes can impact entire communities.

H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E

Exploitation or Obtaining the raw materials What materials are needed A local community could be
depletion of resources for a technology, including to build or run this displaced due to the
how it's powered, leads to technology? What energy harvesting of rare earth
negative consequences to requirements are needed to minerals and metals
the environment and its build or run this required for some
inhabitants. technology? electronic manufacturing.

Electronic waste Reduced quality of How might this technology Toxic materials inside
collective well-being reduce electronic waste by discarded electronic devices
because of the inability to recycling materials or could leach into the water
repair, recycle, or otherwise allowing users to self- supply, making local
responsibly dispose of repair? How might this populations ill.
electronics. technology contribute to
electronic waste when new
versions are released or
when current/past versions
stop working?

Erosion of social & democratic structures


Manipulation
The ability for technology to be used to create highly personalized and manipulative experiences can undermine an
informed citizenry and trust in societal structures.
H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E

Misinformation Disguising fake information How might this technology Generation of synthetic
as legitimate or credible be used to generate speech of a political leader
information. misinformation? How could sways an election.
it be used to spread
misinformation that
appears credible?

Behavioral exploitation This means exploiting How might this technology Monitoring shopping
personal preferences or be used to observe habits in connected retail
patterns of behavior to patterns of behavior? How environment leads to
induce a desired reaction. could this technology be personalized incentives for
used to encourage impulse shoppers and
dysfunctional or hoarders.
maladaptive behaviors?

Social detriment
At scale, the way technology impacts people shapes social and economic structures within communities. It can
further ingrain elements that include or benefit some, at the exclusion of others.

H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E

Amplification of power This may perpetuate How might this technology Requiring a residential
inequality existing class or privilege be used in contexts where address & phone number
disparities. there are existing social, to register on a job website
economic, or class could prevent a homeless
disparities? How might person from applying for
people with more power or jobs.
privilege disproportionately
influence the technology?

Stereotype This may perpetuate How might this technology Results of an image search
reinforcement uninformed "conventional be used to reinforce or for "CEO" could primarily
wisdom" about historically amplify existing social show photos of Caucasian
or statistically norms or cultural men.
underrepresented people. stereotypes? How might
the data used by this
technology cause it to
reflect biases or
stereotypes?

Loss of individuality This may be an inability to How might this technology Limited customization
express a unique amplify majority opinions options in designing a
perspective. or "group-think"? video game avatar inhibits
Conversely, how might self-expression of a player's
unique forms of expression diversity.
be suppressed. In what
ways might the data
gathered by this
technology be used in
feedback to people?

Loss of representation Broad categories of How could this technology Automatic photo caption
generalization obscure, constrain identity options? assigns incorrect gender
diminish, or erase real Could it be used to identity and age to the
identities. automatically label or subject.
categorize people?
H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E

Skill degradation and Overreliance on In what ways might this Overreliance on


complacency automation leads to technology reduce the automation could lead to
atrophy of manual skills. accessibility and ability to an inability to gauge the
use manual controls? airplane's true orientation
because the pilots have
been trained to rely on
instruments only.

Evaluate harms
Once you have generated a broad list of potential harms, you should complete your Harms Model by evaluating
the potential magnitude for each category of harm. This will allow you to prioritize your areas of focus. See the
following example harms model for reference:

C O N T RIB UT IN G FA C TO R DEF IN IT IO N

Severity How acutely could an individual or group's well-being be


impacted by the technology?

Scale How broadly could the impact to well-being be experienced


across populations or groups?

Probability How likely is it that individual or group's well-being will be


impacted by the technology?

Frequency How often would an individual or group experience an impact


to their well-being from the technology?

Next Steps
Use the Harms Model you developed to guide your product development work:
Seek more information from stakeholders that you identified as potentially experiencing harm.
Develop and validate hypothesis for addressing the areas you identified as having the highest potential for
harm.
Integrate the insights into your decisions throughout the technology development process: data collectionand
model training, system architecture, user experience design, product documentation, feedback loops, and
communication capabilities and limitationsof the technology.
Explore Community Jury.
Assess and mitigate unfairness using Azure Machine Learning and the open-source FairLearn package.
Other Responsible AI tools:
Responsible AI resource center
Guidelines for Human AI interaction
Conversational AI guidelines
Inclusive Design guidelines
AI Fairness checklist
Additional references:
Value Sensitive Design
Community Jury
11/2/2020 • 7 minutes to read • Edit Online

Community jury, an adaptation of the citizen jury, is a technique where diverse stakeholders impacted by a
technology are provided an opportunity to learn about a project, deliberate together, and give feedback on use
cases and product design. This technique allows project teams to understand the perceptions and concerns of
impacted stakeholders.
A community jury is different from a focus group or market research; it allows the impacted stakeholders to hear
directly from the subject matter experts in the product team, and cocreate solutions to challenging problems with
them.

Benefits
This section discusses some important benefits of community jury.
Expert Testimony
Members of a community jury hear details about the technology under consideration, directly from experts on the
product team. These experts help highlight the capabilities in particular applications.
Proximity
A community jury allows decision-makers to hear directly from the community, and to learn about their values,
concerns, and ideas regarding a particular issue or problem. It also provides a valuable opportunity to better
understand the reasons for their conclusions.
Consensus
By bringing impacted individuals together and providing an opportunity for them to learn and discuss key aspects
of the technology, a community jury can identify areas of agreement and build common ground solutions to
challenging problems.

Community jury contributors


Product Team
This group comprises owners who will bring the product to market, representative project managers, engineers,
designers, data scientists, and others involved in the making of a product. Additionally, subject matter experts are
included who can answer in-depth questions about the product.
Preparation
Effective deliberation and cocreation require ready-to-use answers to technical questions. As product owners, it is
important to ensure technical details are collected prior to the start of the jury session.
Relevant artifacts could include:
Documentation of existing protections, planned or in place.
Data Flows, for example, what data types collected, who will have access, for how long, with what protections,
and so on.
Mockups or prototypes of proposed solutions.
During the Session
Along with providing an accessible presentation of technical details, product team witnesses should describe
certain applications of the technology.
Relevant artifacts could include:
Customer benefits
Harms assessment and socio-technical impact
Storyboards
Competitive analyses
Academic or popular media reports
Moderator
Bring on a neutral user researcher to ensure everyone is heard, avoiding domination of the communications by
any one member. The moderator will facilitate brainstorms and deliberations, as well as educate jury members in
uncovering bias, and ways to ask difficult questions. If a user researcher is not available, choose a moderator who
is skilled at facilitating group discussions. Following the session, the moderator is responsible for the following:
ensure that the agreed-upon compensation is provided to the jury members;
produce a report that describes key insights, concerns, and recommendations, and
share key insights and next steps with the jury, and thank them for their participation.
Preparation
Structure the sessions so that there is ample time for learning, deliberation, and cocreation. This could mean
having multiple sessions that go in-depth on different topics or having longer sessions.
Pilot the jury with a smaller sample of community members to work out the procedural and content issues
prior to the actual sessions.
During the Session
Ensure that all perspectives are heard, including those of the project team and those of the jury members. This
minimizes group-thinking as well as one or two individuals dominating the discussion.
Reinforce the value of the juror participation by communicating the plan for integrating jury feedback into the
product planning process. Ensure that the project team is prepared to hear criticisms from the jury.
Jury members
These are direct and indirect stakeholders impacted by the technology, representative of the diverse community in
which the technology will be deployed.
Sample size
A jury should be large enough to represent the diversity and collective will of the community, but not so large that
the voices of quieter jurors are drowned out. It is ideal to get feedback from at least 16-20 individuals total who
meet the criteria below. You can split the community jury over multiple sessions so that no more than 8-10
individuals are part of one session.
Recruitment criteria
Stratify a randomly selected participant pool so that the jury includes the demographic diversity of the community.
Based on the project objectives, relevant recruitment criteria may include balancing across gender identity, age,
privacy index. Recruitment should follow good user research recruitment procedures and reach a wider
community.

Session structure
Sessions typically last 2-3 hours. Add more or longer deep dive sessions, as needed, if aspects of the project
require more learning, deliberation, and cocreation.
1. Over view, introduction, and Q&A - The moderator provides a session overview, then introduces the
project team and explains the product's purpose, along with potential use cases, benefits, and harms.
Questions are then accepted from community members. This session should be between 30 to 60 minutes
long.
2. Discussion of key themes - Jury members ask in-depth questions about aspects of the project, fielded by
the moderator. This session should also be between 30 to 60 minutes in length.
3. This step can be any one of the following options:
Deliberation and cocreation - This option is preferable. Jury members deliberate and co-create
solutions with the project team. This is typically 30 to 60 minutes long.
Individual anonymous sur vey - Alternatively, conduct an anonymous survey of jury members.
Anonymity may allow issues to be raised that would not otherwise be expressed. Use this 30-minute
session if there are time constraints.
4. Following the session - The moderator produces a study report that describes key insights, concerns, and
potential solutions to the concerns.
If the values of different stakeholders were in conflict with each other during the session and the value tensions
were left unresolved, the product team would need to brainstorm solutions, and conduct a follow-up session with
the jury to determine if the solutions adequately resolve their concerns.

Tips to run a successful jury


Ensure alignment of goals and outcomes with the project team before planning begins, including deliverables,
recruitment strategy, and timeline. Consider including additional subject-matter experts relevant to the project,
to address open questions/concerns.
The consent to audio and video recording of the jury should follow your company's standard procedures for
non-disclosure and consent that is obtained from participants during user research studies.
Provide fair compensation to participants for the time they devote to participation. Whenever possible,
participants should also be reimbursed for costs incurred as a result of study participation, for example, parking
and transportation costs. Experienced user research recruiters can help determine fair gratuities for various
participant profiles.
Ensure that all perspectives are heard, minimizing group-thinking as well as one or two individuals dominating
the discussion. This should include those of the project team and the jury members.
Structure the sessions so that there is ample time for learning, deliberation, and cocreation. This could mean
having multiple sessions going in-depth on different topics or having longer sessions.
Pilot the jury with a smaller sample of community members to work out the procedural and content issues
prior to the actual sessions.
If the topic being discussed is related to personal data privacy, aim to balance the composition of the
community jury to include individuals with different privacy indices.
Consider including session breaks and providing snacks/refreshments for sessions that are two hours or longer.
Reinforce the value of the juror participation by communicating the plan for integrating jury feedback into the
product planning process. Also ensure that the project team is prepared to hear criticisms from the jury.
After the jury, in addition to publishing the research report, send out a thank you note to the volunteer
participants of the jury, along with an executive summary of the key insights.

Additional information
Privacy Index
The Privacy Index is an approximate measure for an individual's concern about personal data privacy, and is
gauged using the following:
1. Consumers have lost all control over how personal information is collected and used by companies.
2. Most businesses handle the personal information they collect about consumers in a proper and confidential
way.
3. Existing laws and organizational practices provide a reasonable level of protection for consumer privacy today.
Participants are asked to provide responses to the above concerns using the scale of: 1 - Strongly Agree, 2 -
Somewhat Agree, 3 - Somewhat Disagree, 4- Strongly Disagree, and classified into the categories below.
High/Fundamentalist => IF A = 1 or 2 AND B & C = 3 or 4
Low/unconcerned => IF A = 3 or 4 AND B & C = 1 or 2
Medium/pragmatist => All other responses

Next steps
Explore the following references for detailed information on this method:
Jefferson center: creator of the Citizen's Jury method https://jefferson-center.org/about-us/how-we-work/
Citizen's jury handbook http://www.rachel.org/files/document/Citizens_Jury_Handbook.pdf
Case study: Connected Health Cities (UK)
Project page
Final report
Jury specification
Juror's report
Case study: Community Jury at Microsoft
Azure for AWS Professionals
11/2/2020 • 2 minutes to read • Edit Online

This article helps Amazon Web Services (AWS) experts understand the basics of Microsoft Azure accounts,
platform, and services. It also covers key similarities and differences between the AWS and Azure platforms.
You'll learn:
How accounts and resources are organized in Azure.
How available solutions are structured in Azure.
How the major Azure services differ from AWS services.
Azure and AWS built their capabilities independently over time so that each has important implementation and
design differences.

Overview
Like AWS, Microsoft Azure is built around a core set of compute, storage, database, and networking services. In
many cases, both platforms offer a basic equivalence between the products and services they offer. Both AWS and
Azure allow you to build highly available solutions based on Windows or Linux hosts. So, if you're used to
development using Linux and OSS technology, both platforms can do the job.
While the capabilities of both platforms are similar, the resources that provide those capabilities are often
organized differently. Exact one-to-one relationships between the services required to build a solution are not
always clear. In other cases, a particular service might be offered on one platform, but not the other. See charts of
comparable Azure and AWS services.

Services
For a listing of how services map between platforms, see AWS to Azure services comparison.
Not all Azure products and services are available in all regions. Consult the Products by Region page for details. You
can find the uptime guarantees and downtime credit policies for each Azure product or service on the Service Level
Agreements page.

Components
A number of core components on Azure and AWS have similar functionality. To review the differences, visit the
component page for the topic you're interested in:
Accounts
Compute
Databases
Messaging
Networking
Regions and Zones
Resources
Security & Identity
Storage
Azure and AWS accounts and subscriptions
11/2/2020 • 2 minutes to read • Edit Online

Azure services can be purchased using several pricing options, depending on your organization's size and needs.
See the pricing overview page for details.
Azure subscriptions are a grouping of resources with an assigned owner responsible for billing and permissions
management. Unlike AWS, where any resources created under the AWS account are tied to that account,
subscriptions exist independently of their owner accounts, and can be reassigned to new owners as needed.

Comparison of structure and ownership of AWS accounts and Azure subscriptions

Subscriptions are assigned three types of administrator accounts:


Account Administrator . The subscription owner and the account billed for the resources used in the
subscription. The account administrator can only be changed by transferring ownership of the subscription.
Ser vice Administrator . This account has rights to create and manage resources in the subscription, but is
not responsible for billing. By default, the account administrator and service administrator are assigned to
the same account. The account administrator can assign a separate user to the service administrator account
for managing the technical and operational aspects of a subscription. Only one service administrator is
assigned per subscription.
Co-administrator . There can be multiple co-administrator accounts assigned to a subscription. Co-
administrators cannot change the service administrator, but otherwise have full control over subscription
resources and users.
Below the subscription level user roles and individual permissions can also be assigned to specific resources,
similarly to how permissions are granted to IAM users and groups in AWS. In Azure, all user accounts are
associated with either a Microsoft Account or Organizational Account (an account managed through an Azure
Active Directory).
Like AWS accounts, subscriptions have default service quotas and limits. For a full list of these limits, see Azure
subscription and service limits, quotas, and constraints. These limits can be increased up to the maximum by filing
a support request in the management portal.
See also
How to add or change Azure administrator roles
How to download your Azure billing invoice and daily usage data
Compute services on Azure and AWS
11/2/2020 • 4 minutes to read • Edit Online

EC2 Instances and Azure virtual machines


Although AWS instance types and Azure virtual machine sizes are categorized similarly, the RAM, CPU, and storage
capabilities differ.
Amazon EC2 Instance Types
Sizes for virtual machines in Azure (Windows)
Sizes for virtual machines in Azure (Linux)
Similar to AWS' per second billing, Azure on-demand VMs are billed per second.

EBS and Azure Storage for VM disks


Durable data storage for Azure VMs is provided by data disks residing in blob storage. This is similar to how EC2
instances store disk volumes on Elastic Block Store (EBS). Azure temporary storage also provides VMs the same
low-latency temporary read-write storage as EC2 Instance Storage (also called ephemeral storage).
Higher performance disk I/O is supported using Azure premium storage. This is similar to the Provisioned IOPS
storage options provided by AWS.

Lambda, Azure Functions, Azure Web-Jobs, and Azure Logic Apps


Azure Functions is the primary equivalent of AWS Lambda in providing serverless, on-demand code. However,
Lambda functionality also overlaps with other Azure services:
WebJobs allow you to create scheduled or continuously running background tasks.
Logic Apps provides communications, integration, and business rule management services.

Autoscaling, Azure VM scaling, and Azure App Service Autoscale


Autoscaling in Azure is handled by two services:
Virtual machine scale sets allow you to deploy and manage an identical set of VMs. The number of instances
can autoscale based on performance needs.
App Service Autoscale provides the capability to autoscale Azure App Service solutions.

Container Service
The Azure Kubernetes Service supports Docker containers managed through Kubernetes.

Distributed Systems Platform


Service Fabric is a platform for developing and hosting scalable microservices-based solutions.

Batch Processing
Azure Batch allows you to manage compute-intensive work across a scalable collection of virtual machines.
Service Comparison
Virtual servers
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Elastic Compute Cloud (EC2) Instances Virtual Machines Virtual servers allow users to deploy,
manage, and maintain OS and server
software. Instance types provide
combinations of CPU/RAM. Users pay
for what they use with the flexibility to
change sizes.

Batch Batch Run large-scale parallel and high-


performance computing applications
efficiently in the cloud.

Auto Scaling Virtual Machine Scale Sets Allows you to automatically change the
number of VM instances. You set
defined metric and thresholds that
determine if the platform adds or
removes instances.

VMware Cloud on AWS Azure VMware Solution Seamlessly move VMware-based


workloads from your datacenter to
Azure and integrate your VMware
environment with Azure. Keep
managing your existing environments
with the same VMware tools you
already know while you modernize your
applications with Azure native services.
Azure VMware Solution is a Microsoft
service, verified by VMware, that runs
on Azure infrastructure.

Parallel Cluster CycleCloud Create, manage, operate, and optimize


HPC and big compute clusters of any
scale

Containers and container orchestrators


AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Elastic Container Service (ECS) Container Instances Azure Container Instances is the fastest
and simplest way to run a container in
Fargate Azure, without having to provision any
virtual machines or adopt a higher-level
orchestration service.

Elastic Container Registry Container Registry Allows customers to store Docker


formatted images. Used to create all
types of container deployments on
Azure.
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Elastic Kubernetes Service (EKS) Kubernetes Service (AKS) Deploy orchestrated containerized
applications with Kubernetes. Simplify
monitoring and cluster management
through auto upgrades and a built-in
operations console. See AKS solution
journey.

App Mesh Service Fabric Mesh Fully managed service that enables
developers to deploy microservices
applications without managing virtual
machines, storage, or networking.

Container architectures

Baseline architecture on Azure Kubernetes Service (AKS)


07/20/2020
37 min read
Deploy a baseline infrastructure that deploys an AKS cluster with focus on security.

Microservices architecture on Azure Kubernetes Service (AKS)


5/07/2020
17 min read
Deploy a microservices architecture on Azure Kubernetes Service (AKS)

CI/CD pipeline for container-based workloads


7/05/2018
7 min read
Build a DevOps pipeline for a Node.js web app with Jenkins, Azure Container Registry, Azure Kubernetes
Service, Cosmos DB, and Grafana.
view all
Serverless
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Lambda Functions Integrate systems and run backend


processes in response to events or
schedules without provisioning or
managing servers.

Serverless architectures
Social App for Mobile and Web with Authentication
12/16/2019
3 min read
View a detailed, step-by-step diagram depicting the build process and implementation of the mobile client
app architecture that offers social image sharing with a companion web app and authentication abilities,
even while offline.

HIPAA and HITRUST compliant health data AI


12/16/2019
2 min read
Manage HIPAA and HITRUST compliant health data and medical records with the highest level of built-in
security.

Cross Cloud Scaling Architecture


12/16/2019
1 min read
Learn how to improve cross cloud scalability with solution architecture that includes Azure Stack. A step-by-
step flowchart details instructions for implementation.

See also
Create a Linux VM on Azure using the portal
Azure Reference Architecture: Running a Linux VM on Azure
Get started with Node.js web apps in Azure App Service
Azure Reference Architecture: Basic web application
Create your first Azure Function
Database technologies on Azure and AWS
11/2/2020 • 2 minutes to read • Edit Online

RDS and Azure relational database services


Azure provides several different relational database services that are the equivalent of AWS' Relational Database
Service (RDS).
SQL Database
Azure Database for MySQL
Azure Database for PostgreSQL
Other database engines such as SQL Server, Oracle, and MySQL can be deployed using Azure VM Instances.
Costs for AWS RDS are determined by the amount of hardware resources that your instance uses, like CPU, RAM,
storage, and network bandwidth. In the Azure database services, cost depends on your database size, concurrent
connections, and throughput levels.
See also
Azure SQL Database Tutorials
Configure geo-replication for Azure SQL Database with the Azure portal
Introduction to Cosmos DB: A NoSQL JSON Database
How to use Azure Table storage from Node.js

Analytics and big data


The Cortana Intelligence Suite is Azure's package of products and services designed to capture, organize, analyze,
and visualize large amounts of data. The Cortana suite consists of the following services:
HDInsight: managed Apache distribution that includes Hadoop, Spark, Storm, or HBase.
Data Factory: provides data orchestration and data pipeline functionality.
SQL Data Warehouse: large-scale relational data storage.
Data Lake Store: large-scale storage optimized for big data analytics workloads.
Machine Learning: used to build and apply predictive analytics on data.
Stream Analytics: real-time data analysis.
Data Lake Analytics: large-scale analytics service optimized to work with Data Lake Store
Power BI: used to power data visualization.

Service comparison
TYPE AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
TYPE AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Relational database RDS SQL Database Managed relational database


service where resiliency,
Database for MySQL scale, and maintenance are
primarily handled by the
Database for PostgreSQL platform.

NoSQL / Document DynamoDB Cosmos DB A globally distributed, multi-


model database that natively
SimpleDB supports multiple data
models: key-value,
Amazon DocumentDB documents, graphs, and
columnar.

Caching ElastiCache Cache for Redis An in-memory–based,


distributed caching service
that provides a high-
performance store typically
used to offload
nontransactional work from
a database.

Database migration Database Migration Service Database Migration Service Migration of database
schema and data from one
database format to a specific
database technology in the
cloud.

Database architectures

Gaming using Cosmos DB


12/16/2019
1 min read
Elastically scale your database to accommodate unpredictable bursts of traffic and deliver low-latency multi-
player experiences on a global scale.

Oracle Database Migration to Azure


12/16/2019
2 min read
Oracle DB migrations can be accomplished in multiple ways. This architecture covers one of these options
wherein Oracle Active Data Guard is used to migrate the Database.

Retail and e-commerce using Azure MySQL


12/16/2019
1 min read
Build secure and scalable e-commerce solutions that meet the demands of both customers and business
using Azure Database for MySQL.
view all
See also
Cortana Intelligence Gallery
Understanding Microsoft big data solutions
Azure Data Lake and Azure HDInsight Blog
Messaging services on Azure and AWS
11/2/2020 • 2 minutes to read • Edit Online

Simple Email Service


AWS provides the Simple Email Service (SES) for sending notification, transactional, or marketing emails. In Azure,
third-party solutions like SendGrid provide email services.

Simple Queueing Service


AWS Simple Queueing Service (SQS) provides a messaging system for connecting applications, services, and
devices within the AWS platform. Azure has two services that provide similar functionality:
Queue storage: a cloud messaging service that allows communication between application components
within Azure.
Service Bus: a more robust messaging system for connecting applications, services, and devices. Using the
related Service Bus relay, Service Bus can also connect to remotely hosted applications and services.

Messaging components
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Simple Queue Service (SQS) Queue Storage Provides a managed message queueing
service for communicating between
decoupled application components.

Simple Queue Service (SQS) Service Bus Supports a set of cloud-based,


message-oriented middleware
technologies including reliable message
queuing and durable publish/subscribe
messaging.

Amazon EventBridge Event Grid A fully managed event routing service


that allows for uniform event
consumption using a publish/subscribe
model.

Messaging architectures

Anomaly Detector Process


12/16/2019
1 min read
Learn more about Anomaly Detector with a step-by-step flowchart that details the process. See how
anomaly detection models are selected with time-series data.
Networking on Azure and AWS
11/2/2020 • 2 minutes to read • Edit Online

Elastic Load Balancing, Azure Load Balancer, and Azure Application


Gateway
The Azure equivalents of the two Elastic Load Balancing services are:
Load Balancer: provides the same capabilities as the AWS Classic Load Balancer, allowing you to distribute
traffic for multiple VMs at the network level. It also provides failover capability.
Application Gateway: offers application-level rule-based routing comparable to the AWS Application Load
Balancer.

Route 53, Azure DNS, and Azure Traffic Manager


In AWS, Route 53 provides both DNS name management and DNS-level traffic routing and failover services. In
Azure this is handled through two services:
Azure DNS provides domain and DNS management.
Traffic Manager provides DNS level traffic routing, load balancing, and failover capabilities.

Direct Connect and Azure ExpressRoute


Azure provides similar site-to-site dedicated connections through its ExpressRoute service. ExpressRoute allows
you to connect your local network directly to Azure resources using a dedicated private network connection. Azure
also offers more conventional site-to-site VPN connections at a lower cost.

Network service comparison


A REA AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Cloud virtual networking Virtual Private Cloud (VPC) Virtual Network Provides an isolated, private
environment in the cloud.
Users have control over their
virtual networking
environment, including
selection of their own IP
address range, creation of
subnets, and configuration
of route tables and network
gateways.

Cross-premises connectivity VPN Gateway VPN Gateway Connects Azure virtual


networks to other Azure
virtual networks, or
customer on-premises
networks (Site To Site).
Allows end users to connect
to Azure services through
VPN tunneling (Point To
Site).
A REA AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

DNS management Route 53 DNS Manage your DNS records


using the same credentials
and billing and support
contract as your other Azure
services

53 Traffic Manager A service that hosts domain


names, plus routes users to
Internet applications,
connects user requests to
datacenters, manages traffic
to apps, and improves app
availability with automatic
failover.

Dedicated network Direct Connect ExpressRoute Establishes a dedicated,


private network connection
from a location to the cloud
provider (not over the
Internet).

Load balancing Network Load Balancer Load Balancer Azure Load Balancer load-
balances traffic at layer 4
(TCP or UDP).

Application Load Balancer Application Gateway Application Gateway is a


layer 7 load balancer. It
supports SSL termination,
cookie-based session affinity,
and round robin for load-
balancing traffic.

Networking architectures

Deploy highly available NVAs


12/08/2018
7 min read
Learn how to deploy network virtual appliances for high availability in Azure. This article includes example
architectures for ingress, egress, and both.

Hub-spoke network topology in Azure


9/30/2020
7 min read
Learn how to implement a hub-spoke topology in Azure, where the hub is a virtual network and the spokes
are virtual networks that peer with the hub.
Implement a secure hybrid network
1/07/2020
9 min read
See a secure hybrid network that extends an on-premises network to Azure with a perimeter network
between the on-premises network and an Azure virtual network.
view all

See also
Create a virtual network using the Azure portal
Plan and design Azure Virtual Networks
Azure Network Security Best Practices
Regions and zones on Azure and AWS
11/2/2020 • 3 minutes to read • Edit Online

Failures can vary in the scope of their impact. Some hardware failures, such as a failed disk, may affect a single host
machine. A failed network switch could affect a whole server rack. Less common are failures that disrupt a whole
datacenter, such as loss of power in a datacenter. Rarely, an entire region could become unavailable.
One of the main ways to make an application resilient is through redundancy. But you need to plan for this
redundancy when you design the application. Also, the level of redundancy that you need depends on your
business requirements—not every application needs redundancy across regions to guard against a regional
outage. In general, a tradeoff exists between greater redundancy and reliability versus higher cost and complexity.
In Azure, a region is divided into two or more Availability Zones. An Availability Zone corresponds with a physically
isolated datacenter in the geographic region. Azure has numerous features for providing application redundancy at
every level of potential failure, including availability sets , availability zones , and paired regions .

The diagram has three parts. The first part shows VMs in an availability set in a virtual network. The second part
shows an availability zone with two availability sets in a virtual network. The third part shows regional pairs with
resources in each region.
The following table summarizes each option.

AVA IL A B IL IT Y SET AVA IL A B IL IT Y Z O N E PA IRED REGIO N

Scope of failure Rack Datacenter Region

Request routing Load Balancer Cross-zone Load Balancer Traffic Manager

Network latency Very low Low Mid to high

Virtual networking VNet VNet Cross-region VNet peering

Availability sets
To protect against localized hardware failures, such as a disk or network switch failing, deploy two or more VMs in
an availability set. An availability set consists of two or more fault domains that share a common power source and
network switch. VMs in an availability set are distributed across the fault domains, so if a hardware failure affects
one fault domain, network traffic can still be routed to the VMs in the other fault domains. For more information
about Availability Sets, see Manage the availability of Windows virtual machines in Azure.
When VM instances are added to availability sets, they are also assigned an update domain. An update domain is a
group of VMs that are set for planned maintenance events at the same time. Distributing VMs across multiple
update domains ensures that planned update and patching events affect only a subset of these VMs at any given
time.
Availability sets should be organized by the instance's role in your application to ensure one instance in each role is
operational. For example, in a three-tier web application, create separate availability sets for the front-end,
application, and data tiers.
Availability zones
An Availability Zone is a physically separate zone within an Azure region. Each Availability Zone has a distinct
power source, network, and cooling. Deploying VMs across availability zones helps to protect an application
against datacenter-wide failures.

Paired regions
To protect an application against a regional outage, you can deploy the application across multiple regions, using
Azure Traffic Manager to distribute internet traffic to the different regions. Each Azure region is paired with another
region. Together, these form a regional pair. With the exception of Brazil South, regional pairs are located within the
same geography in order to meet data residency requirements for tax and law enforcement jurisdiction purposes.
Unlike Availability Zones, which are physically separate datacenters but may be in relatively nearby geographic
areas, paired regions are typically separated by at least 300 miles. This design ensures that large-scale disasters
only affect one of the regions in the pair. Neighboring pairs can be set to sync database and storage service data,
and are configured so that platform updates are rolled out to only one region in the pair at a time.
Azure geo-redundant storage is automatically backed up to the appropriate paired region. For all other resources,
creating a fully redundant solution using paired regions means creating a full copy of your solution in both regions.

See also
Regions for virtual machines in Azure
Availability options for virtual machines in Azure
High availability for Azure applications
Failure and disaster recovery for Azure applications
Planned maintenance for Linux virtual machines in Azure
Resource management on Azure and AWS
11/2/2020 • 2 minutes to read • Edit Online

The term "resource" in Azure is used in the same way as in AWS, meaning any compute instance, storage object,
networking device, or other entity you can create or configure within the platform.
Azure resources are deployed and managed using one of two models: Azure Resource Manager, or the older Azure
classic deployment model. Any new resources are created using the Resource Manager model.

Resource groups
Both Azure and AWS have entities called "resource groups" that organize resources such as VMs, storage, and
virtual networking devices. However, Azure resource groups are not directly comparable to AWS resource groups.
While AWS allows a resource to be tagged into multiple resource groups, an Azure resource is always associated
with one resource group. A resource created in one resource group can be moved to another group, but can only
be in one resource group at a time. Resource groups are the fundamental grouping used by Azure Resource
Manager.
Resources can also be organized using tags. Tags are key-value pairs that allow you to group resources across your
subscription irrespective of resource group membership.

Management interfaces
Azure offers several ways to manage your resources:
Web interface. Like the AWS Dashboard, the Azure portal provides a full web-based management interface
for Azure resources.
REST API. The Azure Resource Manager REST API provides programmatic access to most of the features
available in the Azure portal.
Command Line. The Azure CLI provides a command-line interface capable of creating and managing Azure
resources. The Azure CLI is available for Windows, Linux, and Mac OS.
PowerShell. The Azure modules for PowerShell allow you to execute automated management tasks using a
script. PowerShell is available for Windows, Linux, and Mac OS.
Templates. Azure Resource Manager templates provide similar JSON template-based resource management
capabilities to the AWS CloudFormation service.
In each of these interfaces, the resource group is central to how Azure resources get created, deployed, or modified.
This is similar to the role a "stack" plays in grouping AWS resources during CloudFormation deployments.
The syntax and structure of these interfaces are different from their AWS equivalents, but they provide comparable
capabilities. In addition, many third-party management tools used on AWS, like Hashicorp's Terraform and Netflix
Spinnaker, are also available on Azure.

See also
Azure resource group guidelines
Security and identity on Azure and AWS
11/2/2020 • 2 minutes to read • Edit Online

Directory service and Azure Active Directory


Azure splits up directory services into the following offerings:
Azure Active Directory: cloud-based directory and identity management service.
Azure Active Directory B2B: enables access to your corporate applications from partner-managed identities.
Azure Active Directory B2C: service offering support for single sign-on and user management for
consumer-facing applications.
Azure Active Directory Domain Services: hosted domain controller service, allowing Active Directory
compatible domain join and user management functionality.

Web application firewall


In addition to the Application Gateway Web Application Firewall, you can also use web application firewalls from
third-party vendors like Barracuda Networks.

See also
Getting started with Microsoft Azure security
Azure Identity Management and access control security best practices
Comparing storage on Azure and AWS
11/2/2020 • 3 minutes to read • Edit Online

S3/EBS/EFS and Azure Storage


In the AWS platform, cloud storage is primarily broken down into three services:
Simple Storage Ser vice (S3) . Basic object storage that makes data available through an Internet
accessible API.
Elastic Block Storage (EBS) . Block level storage intended for access by a single VM.
Elastic File System (EFS) . File storage meant for use as shared storage for up to thousands of EC2
instances.
In Azure Storage, subscription-bound storage accounts allow you to create and manage the following storage
services:
Blob storage stores any type of text or binary data, such as a document, media file, or application installer.
You can set Blob storage for private access or share contents publicly to the Internet. Blob storage serves the
same purpose as both AWS S3 and EBS.
Table storage stores structured datasets. Table storage is a NoSQL key-attribute data store that allows for
rapid development and fast access to large quantities of data. Similar to AWS' SimpleDB and DynamoDB
services.
Queue storage provides messaging for workflow processing and for communication between components
of cloud services.
File storage offers shared storage for legacy applications using the standard server message block (SMB)
protocol. File storage is used in a similar manner to EFS in the AWS platform.

Glacier and Azure Storage


Azure Archive Blob Storage is comparable to AWS Glacier storage service. It is intended for rarely accessed data
that is stored for at least 180 days and can tolerate several hours of retrieval latency.
For data that is infrequently accessed but must be available immediately when accessed, Azure Cool Blob Storage
tier provides cheaper storage than standard blob storage. This storage tier is comparable to AWS S3 - Infrequent
Access storage service.

Storage comparison
Object storage
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Simple Storage Services (S3) Blob storage Object storage service, for use cases
including cloud applications, content
distribution, backup, archiving, disaster
recovery, and big data analytics.

Virtual server disks


AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Elastic Block Store (EBS) managed disks SSD storage optimized for I/O intensive
read/write operations. For use as high-
performance Azure virtual machine
storage.

Shared files
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Elastic File System Files Provides a simple interface to create


and configure file systems quickly, and
share common files. Can be used with
traditional protocols that access files
over a network.

Archiving and backup


AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

S3 Infrequent Access (IA) Storage cool tier Cool storage is a lower-cost tier for
storing data that is infrequently
accessed and long-lived.

S3 Glacier, Deep Archive Storage archive access tier Archive storage has the lowest storage
cost and higher data retrieval costs
compared to hot and cool storage.

Backup Backup Back up and recover files and folders


from the cloud, and provide offsite
protection against data loss.

Hybrid storage
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Storage Gateway StorSimple Integrates on-premises IT environments


with cloud storage. Automates data
management and storage, plus
supports disaster recovery.

DataSync File Sync Azure Files can be deployed in two main


ways: by directly mounting the
serverless Azure file shares or by
caching Azure file shares on-premises
using Azure File Sync.

Bulk data transfer


AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Import/Export Disk Import/Export A data transport solution that uses


secure disks and appliances to transfer
large amounts of data. Also offers data
protection during transit.
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Import/Export Snowball, Snowball Edge, Data Box Petabyte- to exabyte-scale data


Snowmobile transport solution that uses secure data
storage devices to transfer large
amounts of data to and from Azure.

Storage architectures

HIPAA and HITRUST compliant health data AI


12/16/2019
2 min read
Manage HIPAA and HITRUST compliant health data and medical records with the highest level of built-in
security.

HPC Media Rendering


11/04/2020
2 min read
Optimize the media rendering process with a step-by-step HPC solution architecture from Azure that
combines Azure CycleCloud and HPC Cache.

Medical Data Storage Solutions


12/16/2019
2 min read
Store healthcare data effectively and affordably with cloud-based solutions from Azure. Manage medical
records with the highest level of built-in security.
view all

See also
Microsoft Azure Storage Performance and Scalability Checklist
Azure Storage security guide
Best practices for using content delivery networks (CDNs)
AWS to Azure services comparison
11/13/2020 • 21 minutes to read • Edit Online

This article helps you understand how Microsoft Azure services compare to Amazon Web Services (AWS). Whether
you are planning a multicloud solution with Azure and AWS, or migrating to Azure, you can compare the IT
capabilities of Azure and AWS services in all categories.
This article compares services that are roughly comparable. Not every AWS service or Azure service is listed, and
not every matched service has exact feature-for-feature parity.

Azure and AWS for multicloud solutions


As the leading public cloud platforms, Azure and AWS each offer a broad and deep set of capabilities with global
coverage. Yet many organizations choose to use both platforms together for greater choice and flexibility, as well
as to spread their risk and dependencies with a multicloud approach. Consulting companies and software vendors
might also build on and use both Azure and AWS, as these platforms represent most of the cloud market demand.
For an overview of Azure for AWS users, see Introduction to Azure for AWS professionals.

Marketplace
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

AWS Marketplace Azure Marketplace Easy-to-deploy and automatically


configured third-party applications,
including single virtual machine or
multiple virtual machine solutions.

AI and machine learning


AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

SageMaker Machine Learning A cloud service to train, deploy,


automate, and manage machine
learning models.

Alexa Skills Kit Bot Framework Build and connect intelligent bots that
interact with your users using text/SMS,
Skype, Teams, Slack, Microsoft 365 mail,
Twitter, and other popular services.

Lex Speech Services API capable of converting speech to


text, understanding intent, and
converting text back to speech for
natural responsiveness.

Lex Language Understanding (LUIS) Allows your applications to understand


user commands contextually.

Polly, Transcribe Speech Services Enables both Speech to Text, and Text
into Speech capabilities.
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Rekognition Cognitive Services Computer Vision: Extract information


from images to categorize and process
visual data.

Face: Detect, identify, and analyze faces


in photos.

Emotions: Recognize emotions in


images.

Skills Kit Virtual Assistant The Virtual Assistant Template brings


together a number of best practices
we've identified through the building of
conversational experiences and
automates integration of components
that we've found to be highly beneficial
to Bot Framework developers.

AI and machine learning architectures

Image classification on Azure


7/05/2018
4 min read
Learn how to build image processing into your applications by using Azure services such as the Computer
Vision API and Azure Functions.

Predictive Marketing with Machine Learning


12/16/2019
2 min read
Learn how to build a machine-learning model with Microsoft R Server on Azure HDInsight Spark clusters to
recommend actions to maximize the purchase rate.

Scalable personalization on Azure


5/31/2019
6 min read
Use machine learning to automate content-based personalization for customers.
view all

Big data and analytics


Data warehouse
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Redshift Synapse Analytics Cloud-based Enterprise Data


Warehouse (EDW) that uses Massively
Parallel Processing (MPP) to quickly run
complex queries across petabytes of
data.

Lake Formation Data Share A simple and safe service for sharing
big data

Data warehouse architectures

Modern Data Warehouse Architecture


12/16/2019
2 min read
Explore a cloud data warehouse that uses big data. Modern data warehouse brings together all your data
and scales easily as your data grows.

Automated enterprise BI
6/03/2020
13 min read
Automate an extract, load, and transform (ELT) workflow in Azure using Azure Data Factory with Azure
Synapse Analytics.
view all
Big data processing
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

EMR Databricks Apache Spark-based analytics platform.

EMR HDInsight Managed Hadoop service. Deploy and


manage Hadoop clusters in Azure.

EMR Data Lake Storage Massively scalable, secure data lake


functionality built on Azure Blob
Storage.

Big data architectures

Azure data platform end-to-end


1/31/2020
7 min read
Use Azure services to ingest, process, store, serve, and visualize data from different sources.

Campaign Optimization with Azure HDInsight Spark Clusters


12/16/2019
4 min read
This solution demonstrates how to build and deploy a machine learning model with Microsoft R Server on
Azure HDInsight Spark clusters to recommend actions to maximize the purchase rate of leads targeted by a
campaign. This solution enables efficient handling of big data on Spark with Microsoft R Server.
view all
Data orchestration / ETL
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Data Pipeline, Glue Data Factory Processes and moves data between
different compute and storage services,
as well as on-premises data sources at
specified intervals. Create, schedule,
orchestrate, and manage data pipelines.

Glue Data Catalog A fully managed service that serves as a


system of registration and system of
discovery for enterprise data sources

Dynamo DB Table Storage, Cosmos DB NoSQL key-value store for rapid


development using massive semi-
structured datasets.

Analytics and visualization


AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Kinesis Analytics Stream Analytics Storage and analysis platforms that


create insights from large quantities of
Data Lake Analytics data, or data that originates from many
sources.
Data Lake Store

QuickSight Power BI Business intelligence tools that build


visualizations, perform ad hoc analysis,
and develop business insights from
data.

CloudSearch Cognitive Search Delivers full-text search and related


search analytics and capabilities.

Athena Data Lake Analytics Provides a serverless interactive query


service that uses standard SQL for
analyzing databases.

Analytics architectures
Advanced Analytics Architecture
12/16/2019
2 min read
Get near real-time data analytics on streaming services. This big data architecture allows you to combine
any data at any scale with custom machine learning.

Automated enterprise BI
6/03/2020
13 min read
Automate an extract, load, and transform (ELT) workflow in Azure using Azure Data Factory with Azure
Synapse Analytics.

Mass ingestion and analysis of news feeds on Azure


2/01/2019
5 min read
Create a pipeline for ingesting and analyzing text, images, sentiment, and other data from RSS news feeds
using only Azure services, including Azure Cosmos DB and Azure Cognitive Services.
view all

Compute
Virtual servers
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Elastic Compute Cloud (EC2) Instances Virtual Machines Virtual servers allow users to deploy,
manage, and maintain OS and server
software. Instance types provide
combinations of CPU/RAM. Users pay
for what they use with the flexibility to
change sizes.

Batch Batch Run large-scale parallel and high-


performance computing applications
efficiently in the cloud.

Auto Scaling Virtual Machine Scale Sets Allows you to automatically change the
number of VM instances. You set
defined metric and thresholds that
determine if the platform adds or
removes instances.
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

VMware Cloud on AWS Azure VMware Solution Seamlessly move VMware-based


workloads from your datacenter to
Azure and integrate your VMware
environment with Azure. Keep
managing your existing environments
with the same VMware tools you
already know while you modernize your
applications with Azure native services.
Azure VMware Solution is a Microsoft
service, verified by VMware, that runs
on Azure infrastructure.

Parallel Cluster CycleCloud Create, manage, operate, and optimize


HPC and big compute clusters of any
scale

Containers and container orchestrators


AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Elastic Container Service (ECS) Container Instances Azure Container Instances is the fastest
and simplest way to run a container in
Fargate Azure, without having to provision any
virtual machines or adopt a higher-level
orchestration service.

Elastic Container Registry Container Registry Allows customers to store Docker


formatted images. Used to create all
types of container deployments on
Azure.

Elastic Kubernetes Service (EKS) Kubernetes Service (AKS) Deploy orchestrated containerized
applications with Kubernetes. Simplify
monitoring and cluster management
through auto upgrades and a built-in
operations console. See AKS solution
journey.

App Mesh Service Fabric Mesh Fully managed service that enables
developers to deploy microservices
applications without managing virtual
machines, storage, or networking.

Container architectures

Baseline architecture on Azure Kubernetes Service (AKS)


07/20/2020
37 min read
Deploy a baseline infrastructure that deploys an AKS cluster with focus on security.

Microservices architecture on Azure Kubernetes Service (AKS)


5/07/2020
17 min read
Deploy a microservices architecture on Azure Kubernetes Service (AKS)

CI/CD pipeline for container-based workloads


7/05/2018
7 min read
Build a DevOps pipeline for a Node.js web app with Jenkins, Azure Container Registry, Azure Kubernetes
Service, Cosmos DB, and Grafana.
view all
Serverless
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Lambda Functions Integrate systems and run backend


processes in response to events or
schedules without provisioning or
managing servers.

Serverless architectures

Social App for Mobile and Web with Authentication


12/16/2019
3 min read
View a detailed, step-by-step diagram depicting the build process and implementation of the mobile client
app architecture that offers social image sharing with a companion web app and authentication abilities,
even while offline.

HIPAA and HITRUST compliant health data AI


12/16/2019
2 min read
Manage HIPAA and HITRUST compliant health data and medical records with the highest level of built-in
security.

Cross Cloud Scaling Architecture


12/16/2019
1 min read
Learn how to improve cross cloud scalability with solution architecture that includes Azure Stack. A step-by-
step flowchart details instructions for implementation.
Database
TYPE AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Relational database RDS SQL Database Managed relational database


service where resiliency,
Database for MySQL scale, and maintenance are
primarily handled by the
Database for PostgreSQL platform.

NoSQL / Document DynamoDB Cosmos DB A globally distributed, multi-


model database that natively
SimpleDB supports multiple data
models: key-value,
Amazon DocumentDB documents, graphs, and
columnar.

Caching ElastiCache Cache for Redis An in-memory–based,


distributed caching service
that provides a high-
performance store typically
used to offload
nontransactional work from
a database.

Database migration Database Migration Service Database Migration Service Migration of database
schema and data from one
database format to a specific
database technology in the
cloud.

Database architectures

Gaming using Cosmos DB


12/16/2019
1 min read
Elastically scale your database to accommodate unpredictable bursts of traffic and deliver low-latency multi-
player experiences on a global scale.

Oracle Database Migration to Azure


12/16/2019
2 min read
Oracle DB migrations can be accomplished in multiple ways. This architecture covers one of these options
wherein Oracle Active Data Guard is used to migrate the Database.

Retail and e-commerce using Azure MySQL


12/16/2019
1 min read
Build secure and scalable e-commerce solutions that meet the demands of both customers and business
using Azure Database for MySQL.
view all

DevOps and application monitoring


AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

CloudWatch, X-Ray Monitor Comprehensive solution for collecting,


analyzing, and acting on telemetry from
your cloud and on-premises
environments.

CodeDeploy DevOps A cloud service for collaborating on


code development.
CodeCommit

CodePipeline

Developer Tools Developer Tools Collection of tools for building,


debugging, deploying, diagnosing, and
managing multiplatform scalable apps
and services.

CodeBuild DevOps Fully managed build service that


supports continuous integration and
deployment.

Command Line Interface CLI Built on top of the native REST API
across all cloud services, various
PowerShell programming language-specific
wrappers provide easier ways to create
solutions.

OpsWorks (Chef-based) Automation Configures and operates applications of


all shapes and sizes, and provides
templates to create and manage a
collection of resources.

CloudFormation Resource Manager Provides a way for users to automate


the manual, long-running, error-prone,
VM extensions and frequently repeated IT tasks.

Azure Automation

Devops architectures

Container CI/CD using Jenkins and Kubernetes on Azure Kubernetes Service (AKS)
12/16/2019
2 min read
Containers make it easy for you to continuously build and deploy applications. By orchestrating the
deployment of those containers using Azure Kubernetes Service (AKS), you can achieve replicable,
manageable clusters of containers.

Run a Jenkins server on Azure


4/30/2018
10 min read
Recommended architecture that shows how to deploy and operate a scalable, enterprise-grade Jenkins
server on Azure secured with single sign-on (SSO).

DevOps in a hybrid environment


12/16/2019
3 min read
The tools provided in Azure allow for the implementation of a DevOps strategy that capably manages both
cloud and on-premises environments in tandem.
view all

Internet of things (IoT)


AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

IoT IoT Hub A cloud gateway for managing


bidirectional communication with
billions of IoT devices, securely and at
scale.

Greengrass IoT Edge Deploy cloud intelligence directly on IoT


devices to run in on-premises scenarios.

Kinesis Firehose, Kinesis Streams Event Hubs Services that allow the mass ingestion
of small data inputs, typically from
devices and sensors, to process and
route the data.

IoT Things Graph Digital Twins Azure Digital Twins is an IoT service that
helps you create comprehensive models
of physical environments. Create spatial
intelligence graphs to model the
relationships and interactions between
people, places, and devices. Query data
from a physical space rather than
disparate sensors.

IOT architectures

IoT Architecture � Azure IoT Subsystems


12/16/2019
1 min read
Learn about our recommended IoT application architecture that supports hybrid cloud and edge computing.
A flowchart details how the subsystems function within the IoT application.

Azure IoT reference architecture


9/10/2020
12 min read
Recommended architecture for IoT applications on Azure using PaaS (platform-as-a-service) components

Process real-time vehicle data using IoT


3/10/2020
7 min read
This example builds a real-time data ingestion/processing pipeline to ingest and process messages from IoT
devices into a big data analytic platform in Azure.
view all

Management
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Trusted Advisor Advisor Provides analysis of cloud resource


configuration and security so
subscribers can ensure they're making
use of best practices and optimum
configurations.

Usage and Billing Report Billing API Services to help generate, monitor,
forecast, and share billing data for
resource usage by time, organization, or
product resources.

Management Console Portal A unified management console that


simplifies building, deploying, and
operating your cloud resources.

Application Discovery Service Migrate Assesses on-premises workloads for


migration to Azure, performs
performance-based sizing, and provides
cost estimations.

EC2 Systems Manager Monitor Comprehensive solution for collecting,


analyzing, and acting on telemetry from
your cloud and on-premises
environments.
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Personal Health Dashboard Resource Health Provides detailed information about the
health of resources as well as
recommended actions for maintaining
resource health.

CloudTrail Monitor Comprehensive solution for collecting,


analyzing, and acting on telemetry from
your cloud and on-premises
environments.

CloudWatch Application Insights Application Insights, is an extensible


Application Performance Management
(APM) service for developers and
DevOps professionals.

Cost Explorer Cost Management Optimize cloud costs while maximizing


cloud potential.

Messaging and eventing


AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Simple Queue Service (SQS) Queue Storage Provides a managed message queueing
service for communicating between
decoupled application components.

Simple Queue Service (SQS) Service Bus Supports a set of cloud-based,


message-oriented middleware
technologies including reliable message
queuing and durable publish/subscribe
messaging.

Amazon EventBridge Event Grid A fully managed event routing service


that allows for uniform event
consumption using a publish/subscribe
model.

Messaging architectures

Anomaly Detector Process


12/16/2019
1 min read
Learn more about Anomaly Detector with a step-by-step flowchart that details the process. See how
anomaly detection models are selected with time-series data.

Mobile services
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Mobile Hub App Center Provides backend mobile services for


rapid development of mobile solutions,
Xamarin Apps identity management, data
synchronization, and storage and
notifications across devices.

Mobile SDK App Center Provides the technology to rapidly build


cross-platform and native apps for
mobile devices.

Cognito App Center Provides authentication capabilities for


mobile applications.

Device Farm App Center Provides services to support testing


mobile applications.

Mobile Analytics App Center Supports monitoring, and feedback


collection for the debugging and
analysis of a mobile application service
quality.

Device Farm
The AWS Device Farm provides cross-device testing services. In Azure, Visual Studio App Center provides similar
cross-device front-end testing for mobile devices.
In addition to front-end testing, the Azure DevTest Labs provides back-end testing resources for Linux and
Windows environments.
Mobile architectures

Scalable web and mobile applications using Azure Database for PostgreSQL
12/16/2019
1 min read
Use Azure Database for PostgreSQL to rapidly build engaging, performant, and scalable cross-platform and
native apps for iOS, Android, Windows, or Mac.

Social App for Mobile and Web with Authentication


12/16/2019
3 min read
View a detailed, step-by-step diagram depicting the build process and implementation of the mobile client
app architecture that offers social image sharing with a companion web app and authentication abilities,
even while offline.

Task-Based Consumer Mobile App


12/16/2019
3 min read
Learn how the task-based consumer mobile app architecture is created with a step-by-step flow chart that
shows the integration with Azure App Service Mobile Apps, Visual Studio, and Xamarin to simplify the build
process.
view all

Networking
A REA AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Cloud virtual networking Virtual Private Cloud (VPC) Virtual Network Provides an isolated, private
environment in the cloud.
Users have control over their
virtual networking
environment, including
selection of their own IP
address range, creation of
subnets, and configuration
of route tables and network
gateways.

Cross-premises connectivity VPN Gateway VPN Gateway Connects Azure virtual


networks to other Azure
virtual networks, or
customer on-premises
networks (Site To Site).
Allows end users to connect
to Azure services through
VPN tunneling (Point To
Site).

DNS management Route 53 DNS Manage your DNS records


using the same credentials
and billing and support
contract as your other Azure
services

53 Traffic Manager A service that hosts domain


names, plus routes users to
Internet applications,
connects user requests to
datacenters, manages traffic
to apps, and improves app
availability with automatic
failover.

Dedicated network Direct Connect ExpressRoute Establishes a dedicated,


private network connection
from a location to the cloud
provider (not over the
Internet).

Load balancing Network Load Balancer Load Balancer Azure Load Balancer load-
balances traffic at layer 4
(TCP or UDP).
A REA AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Application Load Balancer Application Gateway Application Gateway is a


layer 7 load balancer. It
supports SSL termination,
cookie-based session affinity,
and round robin for load-
balancing traffic.

Networking architectures

Deploy highly available NVAs


12/08/2018
7 min read
Learn how to deploy network virtual appliances for high availability in Azure. This article includes example
architectures for ingress, egress, and both.

Hub-spoke network topology in Azure


9/30/2020
7 min read
Learn how to implement a hub-spoke topology in Azure, where the hub is a virtual network and the spokes
are virtual networks that peer with the hub.

Implement a secure hybrid network


1/07/2020
9 min read
See a secure hybrid network that extends an on-premises network to Azure with a perimeter network
between the on-premises network and an Azure virtual network.
view all

Security, identity, and access


Authentication and authorization
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Identity and Access Management (IAM) Azure Active Directory Allows users to securely control access
to services and resources while offering
data security and protection. Create
and manage users and groups, and use
permissions to allow and deny access to
resources.
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Identity and Access Management (IAM) Role Based Access Control Role-based access control (RBAC) helps
you manage who has access to Azure
resources, what they can do with those
resources, and what areas they have
access to.

Organizations Subscription Management + RBAC Security policy and role management


for working with multiple accounts.

Multi-Factor Authentication Multi-Factor Authentication Safeguard access to data and


applications while meeting user demand
for a simple sign-in process.

Directory Service Azure Active Directory Domain Services Provides managed domain services
such as domain join, group policy, LDAP,
and Kerberos/NTLM authentication that
are fully compatible with Windows
Server Active Directory.

Cognito Azure Active Directory B2C A highly available, global, identity


management service for consumer-
facing applications that scales to
hundreds of millions of identities.

Organizations Policy Azure Policy is a service in Azure that


you use to create, assign, and manage
policies. These policies enforce different
rules and effects over your resources, so
those resources stay compliant with
your corporate standards and service
level agreements.

Organizations Management Groups Azure management groups provide a


level of scope above subscriptions. You
organize subscriptions into containers
called "management groups" and apply
your governance conditions to the
management groups. All subscriptions
within a management group
automatically inherit the conditions
applied to the management group.
Management groups give you
enterprise-grade management at a
large scale, no matter what type of
subscriptions you have.

Encryption
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Server-side encryption with Amazon S3 Azure Storage Service Encryption Helps you protect and safeguard your
Key Management Service data and meet your organizational
security and compliance commitments.
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Key Management Service (KMS), Key Vault Provides security solution and works
CloudHSM with other services by providing a way
to manage, create, and control
encryption keys stored in hardware
security modules (HSM).

Firewall
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Web Application Firewall Web Application Firewall A firewall that protects web applications
from common web exploits.

Web Application Firewall Firewall Provides inbound protection for non-


HTTP/S protocols, outbound network-
level protection for all ports and
protocols, and application-level
protection for outbound HTTP/S.

Security
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Inspector Security Center An automated security assessment


service that improves the security and
compliance of applications.
Automatically assess applications for
vulnerabilities or deviations from best
practices.

Certificate Manager App Service Certificates available on the Service that allows customers to create,
Portal manage, and consume certificates
seamlessly in the cloud.

GuardDuty Advanced Threat Protection Detect and investigate advanced


attacks on-premises and in the cloud.

Artifact Service Trust Portal Provides access to audit reports,


compliance guides, and trust
documents from across cloud services.

Shield DDos Protection Service Provides cloud services with protection


from distributed denial of services
(DDoS) attacks.

Security architectures

Real-time fraud detection


7/05/2018
4 min read
Detect fraudulent activity in real-time using Azure Event Hubs and Stream Analytics.
Securely managed web applications
5/09/2019
8 min read
Learn about deploying secure applications using the Azure App Service Environment, the Azure Application
Gateway service, and Web Application Firewall.

Threat indicators for cyber threat intelligence in Azure Sentinel


4/13/2020
13 min read
Import threat indicators, view logs, create rules to generate security alerts and incidents, and visualize threat
intelligence data with Azure Sentinel.
view all

Storage
Object storage
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Simple Storage Services (S3) Blob storage Object storage service, for use cases
including cloud applications, content
distribution, backup, archiving, disaster
recovery, and big data analytics.

Virtual server disks


AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Elastic Block Store (EBS) managed disks SSD storage optimized for I/O intensive
read/write operations. For use as high-
performance Azure virtual machine
storage.

Shared files
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Elastic File System Files Provides a simple interface to create


and configure file systems quickly, and
share common files. Can be used with
traditional protocols that access files
over a network.

Archiving and backup


AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

S3 Infrequent Access (IA) Storage cool tier Cool storage is a lower-cost tier for
storing data that is infrequently
accessed and long-lived.

S3 Glacier, Deep Archive Storage archive access tier Archive storage has the lowest storage
cost and higher data retrieval costs
compared to hot and cool storage.

Backup Backup Back up and recover files and folders


from the cloud, and provide offsite
protection against data loss.

Hybrid storage
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Storage Gateway StorSimple Integrates on-premises IT environments


with cloud storage. Automates data
management and storage, plus
supports disaster recovery.

DataSync File Sync Azure Files can be deployed in two main


ways: by directly mounting the
serverless Azure file shares or by
caching Azure file shares on-premises
using Azure File Sync.

Bulk data transfer


AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Import/Export Disk Import/Export A data transport solution that uses


secure disks and appliances to transfer
large amounts of data. Also offers data
protection during transit.

Import/Export Snowball, Snowball Edge, Data Box Petabyte- to exabyte-scale data


Snowmobile transport solution that uses secure data
storage devices to transfer large
amounts of data to and from Azure.

Storage architectures

HIPAA and HITRUST compliant health data AI


12/16/2019
2 min read
Manage HIPAA and HITRUST compliant health data and medical records with the highest level of built-in
security.

HPC Media Rendering


11/04/2020
2 min read
Optimize the media rendering process with a step-by-step HPC solution architecture from Azure that
combines Azure CycleCloud and HPC Cache.

Medical Data Storage Solutions


12/16/2019
2 min read
Store healthcare data effectively and affordably with cloud-based solutions from Azure. Manage medical
records with the highest level of built-in security.
view all

Web applications
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Elastic Beanstalk App Service Managed hosting platform providing


easy to use services for deploying and
scaling web applications and services.

API Gateway API Management A turnkey solution for publishing APIs


to external and internal consumers.

CloudFront Content Delivery Network A global content delivery network that


delivers audio, video, applications,
images, and other files.

Global Accelerator Front Door Easily join your distributed


microservices architectures into a single
global application using HTTP load
balancing and path-based routing rules.
Automate turning up new regions and
scale-out with API-driven global
actions, and independent fault-
tolerance to your back end
microservices in Azure-or anywhere.

LightSail App Service Build, deploy, and scale web apps on a


fully managed platform.

Web architectures

Architect scalable e-commerce web app


12/16/2019
1 min read
The e-commerce website includes simple order processing workflows with the help of Azure services. Using
Azure Functions and Web Apps, developers can focus on building personalized experiences and let Azure
take care of the infrastructure.
Multi-region N-tier application
6/18/2019
10 min read
Deploy an application on Azure virtual machines in multiple regions for high availability and resiliency.

Serverless web application


5/28/2019
16 min read
This reference architecture shows a serverless web application, which serves static content from Azure Blob
Storage and implements an API using Azure Functions.
view all

Miscellaneous
A REA AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Backend process logic Step Functions Logic Apps Cloud technology to build
distributed applications
using out-of-the-box
connectors to reduce
integration challenges.
Connect apps, data and
devices on-premises or in
the cloud.

Enterprise application WorkMail, WorkDocs Microsoft 365 Fully integrated Cloud


services service providing
communications, email,
document management in
the cloud and available on a
wide variety of devices.

Gaming GameLift, GameSparks PlayFab Managed services for


hosting dedicated game
servers.

Media transcoding Elastic Transcoder Media Services Services that offer


broadcast-quality video
streaming services, including
various transcoding
technologies.

Workflow Simple Workflow Service Logic Apps Serverless technology for


(SWF) connecting apps, data and
devices anywhere, whether
on-premises or in the cloud
for large ecosystems of SaaS
and cloud-based connectors.
A REA AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N

Hybrid Outposts Stack Azure Stack is a hybrid cloud


platform that enables you to
run Azure services in your
company's or service
provider's datacenter. As a
developer, you can build
apps on Azure Stack. You
can then deploy them to
either Azure Stack or Azure,
or you can build truly hybrid
apps that take advantage of
connectivity between an
Azure Stack cloud and Azure.

Media Elemental MediaConvert Media Services Cloud-based media


workflow platform to index,
package, protect, and stream
video at scale.

More learning
If you are new to Azure, review the interactive Core Cloud Services - Introduction to Azure module on Microsoft
Learn.
Azure for GCP Professionals
11/2/2020 • 8 minutes to read • Edit Online

This article helps Google Cloud Platform (GCP) experts understand the basics of Microsoft Azure accounts,
platform, and services. It also covers key similarities and differences between the GCP and Azure platforms.
You'll learn:
How accounts and resources are organized in Azure.
How available solutions are structured in Azure.
How the major Azure services differ from GCP services.
Azure and GCP built their capabilities independently over time so that each has important implementation and
design differences.

Overview
Like GCP, Microsoft Azure is built around a core set of compute, storage, database, and networking services. In
many cases, both platforms offer a basic equivalence between the products and services they offer. Both GCP and
Azure allow you to build highly available solutions based on Linux or Windows hosts. So, if you're used to
development using Linux and OSS technology, both platforms can do the job.
While the capabilities of both platforms are similar, the resources that provide those capabilities are often
organized differently. Exact one-to-one relationships between the services required to build a solution are not
always clear. In other cases, a particular service might be offered on one platform, but not the other.

Managing accounts and subscription


Azure has a hierarchy of Management group and subscriptions and resource groups to manage resources
effectively. This is similar to the Folders and Project hierarchy for resources in Google Cloud.

Azure levels of management scope


Management groups: These groups are containers that help you manage access, policy, and compliance for
multiple subscriptions. All subscriptions in a management group automatically inherit the conditions applied to
the management group.
Subscriptions: A subscription logically associates user accounts and the resources that were created by those
user accounts. Each subscription has limits or quotas on the amount of resources you can create and use.
Organizations can use subscriptions to manage costs and the resources that are created by users, teams, or
projects.
Resource groups: A resource group is a logical container into which Azure resources like web apps, databases,
and storage accounts are deployed and managed.
Resources: Resources are instances of services that you create, like virtual machines, storage, or SQL
databases.
Azure services can be purchased using several pricing options, depending on your organization's size and needs.
See the pricing overview page for details.
Azure subscriptions are a grouping of resources with an assigned owner responsible for billing and permissions
management.
Subscriptions are assigned three types of administrator accounts:
Account Administrator . The subscription owner and the account billed for the resources used in the
subscription. The account administrator can only be changed by transferring ownership of the subscription.
Ser vice Administrator . This account has rights to create and manage resources in the subscription but is not
responsible for billing. By default, the account administrator and service administrator are assigned to the same
account. The account administrator can assign a separate user to the service administrator account for
managing the technical and operational aspects of a subscription. Only one service administrator is assigned
per subscription.
Co-administrator . There can be multiple co-administrator accounts assigned to a subscription. Co-
administrators cannot change the service administrator, but otherwise have full control over subscription
resources and users.
Below the subscription level user roles and individual permissions can also be assigned to specific resources. In
Azure, all user accounts are associated with either a Microsoft Account or Organizational Account (an account
managed through an Azure Active Directory).
Subscriptions have default service quotas and limits. For a full list of these limits, see Azure subscription and
service limits, quotas, and constraints. These limits can be increased up to the maximum by filing a support request
in the management portal.
See also
How to add or change Azure administrator roles
How to download your Azure billing invoice and daily usage data

Resource management
The term "resource" in Azure means any compute instance, storage object, networking device, or other entity you
can create or configure within the platform.
Azure resources are deployed and managed using one of two models: Azure Resource Manager, or the older Azure
classic deployment model. Any new resources are created using the Resource Manager model.
Resource groups
Azure additionally has an entity called "resource groups" that organize resources such as VMs, storage, and virtual
networking devices. An Azure resource is always associated with one resource group. A resource created in one
resource group can be moved to another group but can only be in one resource group at a time. Resource groups
are the fundamental grouping used by Azure Resource Manager.
Resources can also be organized using tags. Tags are key-value pairs that allow you to group resources across your
subscription irrespective of resource group membership.
Management interfaces
Azure offers several ways to manage your resources:
Web interface. The Azure portal provides a full web-based management interface for Azure resources.
REST API. The Azure Resource Manager REST API provides programmatic access to most of the features
available in the Azure portal.
Command Line. The Azure CLI provides a command-line interface capable of creating and managing Azure
resources. The Azure CLI is available for Windows, Linux, and Mac OS.
PowerShell. The Azure modules for PowerShell allow you to execute automated management tasks using a
script. PowerShell is available for Windows, Linux, and Mac OS.
Templates. Azure Resource Manager templates provide JSON template-based resource management
capabilities.
In each of these interfaces, the resource group is central to how Azure resources get created, deployed, or modified.
In addition, many third-party management tools like Hashicorp's Terraform and Netflix Spinnaker, are also
available on Azure.
See also
Azure resource group guidelines

Regions and zones (high availability)


Failures can vary in the scope of their impact. Some hardware failures, such as a failed disk, may affect a single host
machine. A failed network switch could affect a whole server rack. Less common are failures that disrupt a whole
datacenter, such as loss of power in a datacenter. Rarely, an entire region could become unavailable.
One of the main ways to make an application resilient is through redundancy. But you need to plan for this
redundancy when you design the application. Also, the level of redundancy that you need depends on your
business requirements. Not every application needs redundancy across regions to guard against a regional outage.
In general, a tradeoff exists between greater redundancy and reliability versus higher cost and complexity.
In GCP, a region has two or more Availability Zones. An Availability Zone corresponds with a physically isolated
datacenter in the geographic region. Azure has numerous features for providing application redundancy at every
level of potential failure, including availability sets , availability zones , and paired regions .
The following table summarizes each option.

AVA IL A B IL IT Y SET AVA IL A B IL IT Y Z O N E PA IRED REGIO N

Scope of failure Rack Datacenter Region

Request routing Load Balancer Cross-zone Load Balancer Traffic Manager

Network latency Very low Low Mid to high

Vir tual networking VNet VNet Cross-region VNet peering

Availability sets
To protect against localized hardware failures, such as a disk or network switch failing, deploy two or more VMs in
an availability set. An availability set consists of two or more fault domains that share a common power source and
network switch. VMs in an availability set are distributed across the fault domains, so if a hardware failure affects
one fault domain, network traffic can still be routed the VMs in the other fault domains. For more information
about Availability Sets, see Manage the availability of Windows virtual machines in Azure.
When VM instances are added to availability sets, they are also assigned an update domain. An update domain is a
group of VMs that are set for planned maintenance events at the same time. Distributing VMs across multiple
update domains ensures that planned update and patching events affect only a subset of these VMs at any given
time.
Availability sets should be organized by the instance's role in your application to ensure one instance in each role is
operational. For example, in a three-tier web application, create separate availability sets for the front-end,
application, and data tiers.

Availability sets
Availability zones
Like GCP, Azure regions can have Availability zones. An Availability Zone is a physically separate zone within an
Azure region. Each Availability Zone has a distinct power source, network, and cooling. Deploying VMs across
availability zones helps to protect an application against datacenter-wide failures.

Zone redundant VM deployment on Azure


Paired regions
To protect an application against a regional outage, you can deploy the application across multiple regions, using
Azure Traffic Manager to distribute internet traffic to the different regions. Each Azure region is paired with another
region. Together, these form a regional pair. With the exception of Brazil South, regional pairs are located within the
same geography in order to meet data residency requirements for tax and law enforcement jurisdiction purposes.
Unlike Availability Zones, which are physically separate datacenters but may be in relatively nearby geographic
areas, paired regions are typically separated by at least 300 miles. This design ensures that large-scale disasters
only affect one of the regions in the pair. Neighboring pairs can be set to sync database and storage service data,
and are configured so that platform updates are rolled out to only one region in the pair at a time.
Azure geo-redundant storage is automatically backed up to the appropriate paired region. For all other resources,
creating a fully redundant solution using paired regions means creating a full copy of your solution in both regions.

Region Pairs in Azure


See also
Regions for virtual machines in Azure
Availability options for virtual machines in Azure
High availability for Azure applications
Failure and disaster recovery for Azure applications
Planned maintenance for Linux virtual machines in Azure

Services
For a listing of how services map between platforms, see GCP to Azure services comparison.
Not all Azure products and services are available in all regions. Consult the Products by Region page for details. You
can find the uptime guarantees and downtime credit policies for each Azure product or service on the Service Level
Agreements page.

Next steps
Get started with Azure
Azure solution architectures
Azure Reference Architectures
GCP to Azure services comparison
11/13/2020 • 20 minutes to read • Edit Online

This article helps you understand how Microsoft Azure services compare to Google Cloud Platform (GCP).
Whether you are planning a multi-cloud solution with Azure and GCP, or migrating to Azure, you can compare the
IT capabilities of Azure and GCP services in all categories.
This article compares services that are roughly comparable. Not every GCP service or Azure service is listed, and
not every matched service has exact feature-for-feature parity.
For an overview of Azure for GCP users, see Introduction to Azure for GCP Professionals.

Marketplace
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

GCP Marketplace Azure Marketplace Easy-to-deploy and automatically


configured third-party applications,
including single virtual machine or
multiple virtual machine solutions.

AI and Machine Learning


GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

AI Hub Azure Machine Learning A cloud service to train, deploy,


automate, and manage machine
learning models.

TensorFlow ML.NET ML.NET is an open source and cross-


platform machine learning framework
for both machine learning & AI.

TensorFlow ONNX (Open Neural Network ONNX is an open format built to


Exchange) represent machine learning models that
facilitates maximum compatibility and
increased inference performance.

AI Building blocks - Sight Azure Cognitive Services Computer Use visual data processing to label
Vision content, from objects to concepts,
extract printed and handwritten text,
recognize familiar subjects like brands
and landmarks, and moderate content.
No machine learning expertise is
required.

AI Building blocks - Language Azure Cognitive Services Text Analytics Cloud-based services that provides
advanced natural language processing
over raw text, and includes four main
functions: sentiment analysis, key
phrase extraction, language detection,
and named entity recognition.
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

AI Building blocks - Language Azure Cognitive Services Language A machine learning-based service to
Understanding (LUIS) build natural language understanding
into apps, bots, and IoT devices. Quickly
create enterprise-ready, custom models
that continuously improve.

AI Building blocks - Conversation Azure Cognitive Services Speech To Text Swiftly convert audio into text from a
variety of sources. Customize models to
overcome common speech recognition
barriers, such as unique vocabularies,
speaking styles, or background noise.

AI Building blocks – Structured Data Azure ML - Automated Machine Empower professional and non-
Learning professional data scientists to build
machine learning models rapidly.
Automate time-consuming and iterative
tasks of model development using
breakthrough research-and accelerate
time to market. Available in Azure
Machine learning, Power BI, ML.NET &
Visual Studio.

AI Building blocks – Structured Data ML.NET Model Builder ML.NET Model Builder provides an easy
to understand visual interface to build,
train, and deploy custom machine
learning models. Prior machine learning
expertise is not required. Model Builder
supports AutoML, which automatically
explores different machine learning
algorithms and settings to help you find
the one that best suits your scenario.

AI Building blocks – Cloud AutoML Azure Cognitive Services Custom Vision Customize and embed state-of-the-art
computer vision for specific domains.
Build frictionless customer experiences,
optimize manufacturing processes,
accelerate digital marketing campaigns-
and more. No machine learning
expertise is required.

AI Building blocks – Cloud AutoML Video Indexer Easily extract insights from your videos
and quickly enrich your applications to
enhance discovery and engagement.

AI Building blocks – Cloud AutoML Azure Cognitive Services QnA Maker Build, train and publish a sophisticated
bot using FAQ pages, support websites,
product manuals, SharePoint
documents or editorial content through
an easy-to-use UI or via REST APIs.

AI Platform Notebooks Azure Notebooks Develop and run code from anywhere
with Jupyter notebooks on Azure.

Deep Learning VM Image Data Science Virtual Machines Pre-Configured environments in the
cloud for Data Science and AI
Development.
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Deep Learning Containers GPU support on Azure Kubernetes Graphical processing units (GPUs) are
Service (AKS) often used for compute-intensive
workloads such as graphics and
visualization workloads. AKS supports
the creation of GPU-enabled node
pools to run these compute-intensive
workloads in Kubernetes.

Data Labeling Service Azure ML - Data Labeling A central place to create, manage, and
monitor labeling projects (public
preview). Use it to coordinate data,
labels, and team members to efficiently
manage labeling tasks. Machine
Learning supports image classification,
either multi-label or multi-class, and
object identification with bounded
boxes.

AI Platform Training Azure ML – Compute Targets Designated compute


resource/environment where you run
your training script or host your service
deployment. This location may be your
local machine or a cloud-based
compute resource. Using compute
targets make it easy for you to later
change your compute environment
without having to change your code.

AI Platform Predictions Azure ML - Deployments Deploy your machine learning model as


a web service in the Azure cloud or to
Azure IoT Edge devices. Leverage
serverless Azure Functions for model
inference for dynamic scale.

Continuous Evaluation Azure ML – Data Drift Monitor for data drift between the
training dataset and inference data of a
deployed model. In the context of
machine learning, trained machine
learning models may experience
degraded prediction performance
because of drift. With Azure Machine
Learning, you can monitor data drift
and the service can send an email alert
to you when drift is detected.

What-If Tool Azure ML – Model Interpretability Ensure machine learning model


compliance with company policies,
industry standards, and government
regulations.

Cloud TPU Azure ML – FPGA (Field Programmable FPGAs contain an array of


Gate Arrays) programmable logic blocks, and a
hierarchy of reconfigurable
interconnects. The interconnects allow
these blocks to be configured in various
ways after manufacturing. Compared to
other chips, FPGAs provide a
combination of programmability and
performance.
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Kubeflow Machine Learning Operations (MLOps) MLOps, or DevOps for machine


learning, enables data science and IT
teams to collaborate and increase the
pace of model development and
deployment via monitoring, validation,
and governance of machine learning
models.

Dialogflow Microsoft Bot Framework Build and connect intelligent bots that
interact with your users using text/SMS,
Skype, Teams, Slack, Microsoft 365 mail,
Twitter, and other popular services.

Big data and analytics


Data warehouse
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

BigQuery Azure Synapse Analytics Cloud-based Enterprise Data


Warehouse (EDW) that uses Massively
Parallel Processing (MPP) to quickly run
complex queries across petabytes of
data.

BigQuery SQL Server Big Data Clusters Allow you to deploy scalable clusters of
SQL Server, Spark, and HDFS containers
running on Kubernetes. These
components are running side by side to
enable you to read, write, and process
big data from Transact-SQL or Spark,
allowing you to easily combine and
analyze your high-value relational data
with high-volume big data.

Big data processing


GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Dataproc Azure HDInsight Managed Apache Spark-based analytics


platform.

Data Orchestration and ETL


GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Cloud Data Fusion Azure Data Factory Processes and moves data between
different compute and storage services,
as well as on-premises data sources at
specified intervals. Create, schedule,
orchestrate, and manage data pipelines.

Cloud Data Catalog Azure Data Catalog A fully managed service that serves as a
system of registration and system of
discovery for enterprise data sources
Analytics and visualization
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Cloud Dataflow Azure Databricks Managed platform for streaming batch


data based on Open Source Apache
products.

Datastudio Power BI Business intelligence tools that build


visualizations, perform ad hoc analysis,
Looker and develop business insights from
data.

Cloud Search Azure Search Delivers full-text search and related


search analytics and capabilities.

BigQuery SQL Server – ML Services Provides a serverless non-cloud


interactive query service that uses
Big Data Clusters (Spark) standard SQL for analyzing databases.

SQL Server Analysis Services

Compute
Virtual servers
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Compute Engine Azure Virtual Machines Virtual servers allow users to deploy,
manage, and maintain OS and server
software. Instance types provide
combinations of CPU/RAM. Users pay
for what they use with the flexibility to
change sizes.

Batch Azure Batch Run large-scale parallel and high-


performance computing applications
efficiently in the cloud.

Compute Engine Managed Instance Azure Virtual Machine Scale Sets Allows you to automatically change the
Groups number of VM instances. You set
defined metric and thresholds that
determine if the platform adds or
removes instances.

VMware as a service Azure VMware by CloudSimple Redeploy and extend your VMware-
based enterprise workloads to Azure
with Azure VMware Solution by
CloudSimple. Keep using the VMware
tools you already know to manage
workloads on Azure without disrupting
network, security, or data protection
policies.

Containers and container orchestrators


GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Cloud Run Azure Container Instances Azure Container Instances is the fastest
and simplest way to run a container in
Azure, without having to provision any
virtual machines or adopt a higher-level
orchestration service.

Artifact Registry (beta) Azure Container Registry Allows customers to store Docker
formatted images. Used to create all
Container Registry types of container deployments on
Azure.

Kubernetes Engine (GKE) Azure Kubernetes Service (AKS) Deploy orchestrated containerized
applications with Kubernetes. Simplify
cluster management and monitoring
through automatic upgrades and a
built-in operations console. See AKS
solution journey.

Kubernetes Engine Monitoring Azure Monitor for containers Azure Monitor for containers is a
feature designed to monitor the
performance of container workloads
deployed to: Managed Kubernetes
clusters hosted on Azure Kubernetes
Service (AKS); Self-managed Kubernetes
clusters hosted on Azure using AKS
Engine; Azure Container Instances, Self-
managed Kubernetes clusters hosted
on Azure Stack or on-premises; or
Azure Red Hat OpenShift.

Anthos Service Mesh Service Fabric Mesh Fully managed service that enables
developers to deploy microservices
applications without managing virtual
machines, storage, or networking.

Container architectures
Here are some architectures that use AKS as the orchestrator.

Baseline architecture on Azure Kubernetes Service (AKS)


07/20/2020
37 min read
Deploy a baseline infrastructure that deploys an AKS cluster with focus on security.

Microservices architecture on Azure Kubernetes Service (AKS)


5/07/2020
17 min read
Deploy a microservices architecture on Azure Kubernetes Service (AKS)
CI/CD pipeline for container-based workloads
7/05/2018
7 min read
Build a DevOps pipeline for a Node.js web app with Jenkins, Azure Container Registry, Azure Kubernetes
Service, Cosmos DB, and Grafana.
view all
Functions
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Cloud Functions Azure Functions Integrate systems and run backend


processes in response to events or
schedules without provisioning or
managing servers.

Database
TYPE GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Relational database (PaaS) Cloud SQL ( SQL Server, SQL Server Options Managed relational database
MySQL, PostgreSQL ) service where resiliency,
Azure Database for MySQL security, scale, and
maintenance are primarily
Azure Database for handled by the platform.
PostgreSQL

Cloud Spanner Azure Cosmos DB Managed relational database


service where resiliency,
security, scale, and
maintenance are primarily
handled by the platform.

NoSQL (PaaS) Cloud Bigtable Azure Cosmos DB Globally distributed, multi-


model database that natively
Cloud Firestore supports multiple data
models: key-value,
Firebase Realtime Database documents, graphs, and
columnar.

Caching Cloud Memorystore Azure Cache for Redis An in-memory–based,


distributed caching service
Redis Enterprise Cloud that provides a high-
performance store typically
used to offload non-
transactional work from a
database.

DevOps and application monitoring


GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Operations (formerly Stackdriver) Azure Monitor Maximizes the availability and


performance of your applications and
services by delivering a comprehensive
solution for collecting, analyzing, and
acting on telemetry from your cloud
and on-premises environments. It helps
you understand how your applications
are performing and proactively
identifies issues affecting them and the
resources on which they depend.

Cloud Trace Azure Monitor Maximizes the availability and


performance of your applications and
services by delivering a comprehensive
solution for collecting, analyzing, and
acting on telemetry from your cloud
and on-premises environments. It helps
you understand how your applications
are performing and proactively
identifies issues affecting them and the
resources on which they depend.

Cloud Debugger Azure Monitor Maximizes the availability and


performance of your applications and
services by delivering a comprehensive
solution for collecting, analyzing, and
acting on telemetry from your cloud
and on-premises environments. It helps
you understand how your applications
are performing and proactively
identifies issues affecting them and the
resources on which they depend.

Cloud Profiler Azure Monitor Maximizes the availability and


performance of your applications and
services by delivering a comprehensive
solution for collecting, analyzing, and
acting on telemetry from your cloud
and on-premises environments. It helps
you understand how your applications
are performing and proactively
identifies issues affecting them and the
resources on which they depend.

Cloud Source Repositories Azure DevOps Repos, GitHub Repos A cloud service for collaborating on
code development.

Cloud Build Azure DevOps Pipelines, GitHub Actions Fully managed build service that
supports continuous integration and
deployment.

Artifact Registry Azure DevOps Artifacts, GitHub Add fully integrated package
Packages management to your continuous
integration/continuous delivery (CI/CD)
pipelines with a single click. Create and
share Maven, npm, NuGet, and Python
package feeds from public and private
sources with teams of any size.
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Cloud Developer Tools (including Cloud Azure Developer Tools Collection of tools for building,
Code) debugging, deploying, diagnosing, and
managing multiplatform scalable apps
and services.

Gcloud SDK Azure CLI The Azure command-line interface


(Azure CLI) is a set of commands used
to create and manage Azure resources.
The Azure CLI is available across Azure
services and is designed to get you
working quickly with Azure, with an
emphasis on automation.

Cloud Shell Azure Cloud Shell Azure Cloud Shell is an interactive,


authenticated, browser-accessible shell
for managing Azure resources. It
provides the flexibility of choosing the
shell experience that best suits the way
you work, either Bash or PowerShell.

PowerShell on GCP Azure PowerShell Azure PowerShell is a set of cmdlets for


managing Azure resources directly from
the PowerShell command line. Azure
PowerShell is designed to make it easy
to learn and get started with, but
provides powerful features for
automation. Written in .NET Standard,
Azure PowerShell works with PowerShell
5.1 on Windows, and PowerShell 6.x
and higher on all platforms.

Cloud Deployment Manager Azure Automation Delivers a cloud-based automation and


configuration service that supports
consistent management across your
Azure and non-Azure environments. It
comprises process automation,
configuration management, update
management, shared capabilities, and
heterogeneous features. Automation
gives you complete control during
deployment, operations, and
decommissioning of workloads and
resources.

Cloud Deployment Manager Azure Resource Manager Provides a way for users to automate
the manual, long-running, error-prone,
and frequently repeated IT tasks.

Internet of things (IoT)


GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Cloud IoT Core Azure IoT Hub,Azure Event Hubs A cloud gateway for managing
bidirectional communication with
billions of IoT devices, securely and at
scale.
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Cloud Pub/Sub Azure Stream Analytics,HDInsight Kafka Process and route streaming data to
subsequent processing engine or
storage or database platform.

Edge Tpu Azure IoT Edge Deploy cloud intelligence directly on IoT
devices to run in on-premises scenarios.

Management
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Cloud Billing Azure Billing API Services to help generate, monitor,


forecast, and share billing data for
resource usage by time, organization, or
product resources.

Cloud Console Azure portal A unified management console that


simplifies building, deploying, and
operating your cloud resources.

Operations (formerly Stackdriver) Azure Monitor Comprehensive solution for collecting,


analyzing, and acting on telemetry from
your cloud and on-premises
environments.

Messaging and eventing


GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Cloud Pub/Sub Azure Service Bus Supports a set of cloud-based,


message-oriented middleware
technologies including reliable message
queuing and durable publish/subscribe
messaging.

Cloud Pub/Sub Azure Event Grid A fully managed event routing service
that allows for uniform event
consumption using a publish/subscribe
model.

Networking
A REA GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
A REA GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Cloud virtual networking Virtual Private Network Azure Virtual Network (Vnet) Provides an isolated, private
(VPC) environment in the cloud.
Users have control over their
virtual networking
environment, including
selection of their own IP
address range,
adding/updating address
ranges, creation of subnets,
and configuration of route
tables and network
gateways.

DNS management Cloud DNS Azure DNS Manage your DNS records
using the same credentials
and billing and support
contract as your other Azure
services

Cloud DNS Azure Traffic Manager Azure Traffic Manager is a


DNS-based load balancer
that enables you to
distribute traffic optimally to
services across global Azure
regions, while providing high
availability and
responsiveness.

Hybrid Connectivity Cloud Interconnect Azure ExpressRoute Establishes a private network


connection from a location
to the cloud provider (not
over the Internet).

Cloud VPN Gateway Azure Virtual Network Connects Azure virtual


Gateway networks to other Azure
virtual networks, or
customer on-premises
networks (site-to-site).
Allows end users to connect
to Azure services through
VPN tunneling (point-to-
site).

Cloud VPN Gateway Azure Virtual WAN Azure virtual WAN simplifies
large scale branch
connectivity with VPN and
ExpressRoute.

Cloud router Azure Virtual Network Enables dynamic routes


Gateway exchange using BGP.

Load balancing Network Load Balancing Azure Load Balancer Azure Load Balancer load-
balances traffic at layer 4 (all
TCP or UDP).
A REA GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Global load balancing Azure Front door Azure front door enables
global load balancing across
regions using a single
anycast IP.

Global load balancing Azure Application Gateway Application Gateway is a


layer 7 load balancer. IT
takes backends with any IP
that is reachable. It supports
SSL termination, cookie-
based session affinity, and
round robin for load-
balancing traffic.

Global load balancing Azure Traffic Manager Azure Traffic Manager is a


DNS-based load balancer
that enables you to
distribute traffic optimally to
services across global Azure
regions, while providing high
availability and
responsiveness.

Content delivery network Cloud CDN Azure CDN A content delivery network
(CDN) is a distributed
network of servers that can
efficiently deliver web
content to users.

Firewall Firewall rules Application security groups Azure Application security


groups allows you to group
virtual machines and define
network security policies
based on those groups.

Firewall rules Network Security groups Azure network security


group filters network traffic
to and from Azure resources
in an Azure virtual network.

Firewall rules Azure Firewall Azure Firewall is a managed,


cloud-based network
security service that protects
your Azure Virtual Network
resources. It's a fully stateful
firewall as a service with
built-in high availability and
unrestricted cloud scalability.

Web Application Firewall Cloud Armor Application Gateway - Web Azure Web Application
Application Firewall Firewall (WAF) provides
centralized protection of
your web applications from
common exploits and
vulnerabilities.
A REA GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Cloud Armor Front door – Azure Web Azure Web Application


Application Firewall Firewall (WAF) on Azure
Front Door provides
centralized protection for
your web applications.

Cloud Armor CDN – Azure Web Azure Web Application


Application Firewall Firewall (WAF) on Azure
Content Delivery Network
(CDN) from Microsoft
provides centralized
protection for your web
content.

NAT Gateway Cloud NAT Azure Virtual Network NAT Virtual Network NAT
(network address
translation) provides
outbound NAT translations
for internet connectivity for
virtual networks.

Private Connectivity to PaaS VPC Service controls Azure Private Link Azure Private Link enables
you to access Azure PaaS
Services and Azure hosted
customer-owned/partner
services over a private
endpoint in your virtual
network.

Telemetry VPC Flow logs NSG Flow logs Network security group
(NSG) flow logs are a feature
of Network Watcher that
allows you to view
information about ingress
and egress IP traffic through
an NSG.

Firewall Rules Logging NSG Flow logs Network security group


(NSG) flow logs are a feature
of Network Watcher that
allows you to view
information about ingress
and egress IP traffic through
an NSG.

Operations (formerly Azure Monitor Azure Monitor delivers a


Stackdriver) comprehensive solution for
collecting, analyzing, and
acting on telemetry from
your cloud and on-premises
environments. Log queries
help you to fully leverage
the value of the data
collected in Azure Monitor
Logs.
A REA GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Network Intelligence Center Azure Network Watcher Azure Network Watcher


provides tools to monitor,
diagnose, view metrics, and
enable or disable logs for
resources in an Azure virtual
network.

Other Connectivity Options S2S,P2S Direct Interconnect,Partner Point to Site lets you create
Interconnect,Carrier Peering a secure connection to your
virtual network from an
individual client computer.
Site to Site is a connection
between two or more
networks, such as a
corporate network and a
branch office network.

Security
A REA GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Authentication and Cloud IAM Azure Active Directory Allows users to securely
Authorization control access to services
and resources while offering
data security and protection.
Create and manage users
and groups and use
permissions to allow and
deny access to resources.

Cloud IAM Azure Role Based Access Role-based access control


Control (RBAC) helps you manage
who has access to Azure
resources, what they can do
with those resources, and
what areas they have access
to.

Resource Manager Azure Subscription Structure to organize and


Management manage assets in Azure.

Multi-factor Authentication Azure Active Directory Safeguard access to data


Multi-factor Authentication and applications while
meeting user demand for a
simple sign-in process.

Firebase Authentication Azure Active Directory B2C A highly available, global,


identity management service
for consumer-facing
applications that scales to
hundreds of millions of
identities.
A REA GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Encryption Cloud KMS,Secret Manager Azure Key Vault Provides security solution
and works with other
services by providing a way
to manage, create, and
control encryption keys
stored in hardware security
modules (HSM).

Data at rest encryption Encryption by default Azure Storage Service Azure Storage Service
Encryption - encryption by Encryption helps you protect
default and safeguard your data and
meet your organizational
security and compliance
commitments.

Security Security Command Center Azure Security Center An automated security


assessment service that
improves the security and
compliance of applications.
Automatically assess
applications for
vulnerabilities or deviations
from best practices.

Web Security Scanner Azure Security Center An automated security


assessment service that
improves the security and
compliance of applications.
Automatically assess
applications for
vulnerabilities or deviations
from best practices.

Event Threat Detection Azure Advanced Threat Detect and investigate


Protection advanced attacks on-
premises and in the cloud.

Storage
Object storage
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Cloud Storage Azure Blob storage Object storage service, for use cases
including cloud applications, content
Cloud Storage for Firebase distribution, backup, archiving, disaster
recovery, and big data analytics.

Virtual server disks


GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Persistant Disk Azure managed disks SSD storage optimized for I/O intensive
read/write operations. For use as high-
Local SSD performance Azure virtual machine
storage.
File Storage
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Filestore Azure Files,Azure NetApp Files File based storage and hosted NetApp
Appliance Storage.

Drive Enterprise OneDrive For business Shared files from from personal devices
to cloud

Bulk data transfer


GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Transfer Appliance Import/Export A data transport solution that uses


secure disks and appliances to transfer
large amounts of data. Also offers data
protection during transit.

Transfer Appliance Azure Data Box Petabyte- to exabyte-scale data


transport solution that uses secure data
storage devices to transfer large
amounts of data to and from Azure.

Application services
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

App Engine Azure App Service Managed hosting platform providing


easy to use services for deploying and
scaling web applications and services.

Apigee Azure API Management A turnkey solution for publishing APIs


to external and internal consumers.

Miscellaneous
A REA GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Workflow Composer Azure Logic Apps Serverless technology for


connecting apps, data and
devices anywhere, whether
on-premises or in the cloud
for large ecosystems of SaaS
and cloud-based connectors.

Enterprise application G Suite Microsoft 365 Fully integrated Cloud


services service providing
communications, email,
document management in
the cloud and available on a
wide variety of devices.

Gaming Game Servers Azure PlayFab Managed services for


hosting dedicated game
servers.
A REA GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N

Hybrid Anthos Azure Arc For customers who want to


simplify complex and
distributed environments
across on-premises, edge
and multi-cloud, Azure Arc
enables deployment of
Azure services anywhere and
extends Azure management
to any infrastructure.

Blockchain Digital Asset Azure Blockchain Service Azure Blockchain Service is a


fully managed ledger service
that enables users the ability
to grow and operate
blockchain networks at scale
in Azure.

Monitoring Cloud Monitoring Application Insights Service which provides


visibility into the
performance, uptime, and
overall health of cloud-
powered applications.

Logging Cloud Logging Log Analytics Service for real-time log


management and analysis.

Azure Migration Tools


TYPE A Z URE SERVIC E DESC RIP T IO N

Generic Database migrationGuide (SQL Database migration guide Migration of database schema and data
or Open Source Database) from one database format to a specific
database technology in the cloud.

SQL Server Database Migration Data MigrationAssistant Assessment of SQL Server database
Assessment Tool before migration

SQL Server Database Migration Tool Data Migration Service Actual migration of database to Azure

Open source Database Migration Tool Database Native Tool Actual Migration of open source
likemysqldump,pg\dump database to Azure

Assessment and migration tool Azure Migrate Assesses on-premises workloads for
migration to Azure, performs
performance-based sizing, and provides
cost estimations.

Migration tool Movere Movere is a discovery solution that


provides the data and insights needed
to plan cloud migrations and
continuously optimize, monitor and
analyze IT environments with
confidence.

More learning
If you are new to Azure, review the interactive Core Cloud Services - Introduction to Azure module on Microsoft
Learn.
Microsoft Azure Well-Architected Framework
11/2/2020 • 9 minutes to read • Edit Online

The Azure Well-Architected Framework is a set of guiding tenets that can be used to improve the quality of a
workload. The framework consists of five pillars of architecture excellence: Cost Optimization, Operational
Excellence, Performance Efficiency, Reliability, and Security.
To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the
Microsoft Azure Well-Architected Review.

P IL L A R DESC RIP T IO N

Cost Optimization Managing costs to maximize the value delivered.

Operational Excellence Operations processes that keep a system running in


production.

Performance Efficiency The ability of a system to adapt to changes in load.

Reliability The ability of a system to recover from failures and continue


to function.

Security Protecting applications and data from threats.

Cost Optimization
When you are designing a cloud solution, focus on generating incremental value early. Apply the principles of
Build-Measure-Learn , to accelerate your time to market while avoiding capital-intensive solutions. Use the
pay-as-you-go strategy for your architecture, and invest in scaling out, rather than delivering a large investment
first version. Consider opportunity costs in your architecture, and the balance between first mover advantage
versus "fast follow". Use the cost calculators to estimate the initial cost and operational costs. Finally, establish
policies, budgets, and controls that set cost limits for your solution.
Cost guidance
Review cost principles
Develop a cost model
Create budgets and alerts
Review the cost optimization checklist

Operational Excellence
This pillar covers the operations processes that keep an application running in production. Deployments must be
reliable and predictable. They should be automated to reduce the chance of human error. They should be a fast
and routine process, so they don't slow down the release of new features or bug fixes. Equally important, you
must be able to quickly roll back or roll forward if an update has problems.
Monitoring and diagnostics are crucial. Cloud applications run in a remote data-center where you do not have full
control of the infrastructure or, in some cases, the operating system. In a large application, it's not practical to log
into VMs to troubleshoot an issue or sift through log files. With PaaS services, there may not even be a dedicated
VM to log into. Monitoring and diagnostics give insight into the system, so that you know when and where
failures occur. All systems must be observable. Use a common and consistent logging schema that lets you
correlate events across systems.
The monitoring and diagnostics process has several distinct phases:
Instrumentation. Generating the raw data, from application logs, web server logs, diagnostics built into the
Azure platform, and other sources.
Collection and storage. Consolidating the data into one place.
Analysis and diagnosis. To troubleshoot issues and see the overall health.
Visualization and alerts. Using telemetry data to spot trends or alert the operations team.
Use the DevOps checklist to review your design from a management and DevOps standpoint.
Operational excellence guidance
Design patterns for management and monitoring
Best practices: Monitoring and diagnostics

Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner. The main ways to achieve this are by using scaling appropriately and implementing PaaS
offerings that have scaling built in.
There are two main ways that an application can scale. Vertical scaling (scaling up) means increasing the capacity
of a resource, for example by using a larger VM size. Horizontal scaling (scaling out) is adding new instances of a
resource, such as VMs or database replicas.
Horizontal scaling has significant advantages over vertical scaling:
True cloud scale. Applications can be designed to run on hundreds or even thousands of nodes, reaching
scales that are not possible on a single node.
Horizontal scale is elastic. You can add more instances if load increases, or remove them during quieter
periods.
Scaling out can be triggered automatically, either on a schedule or in response to changes in load.
Scaling out may be cheaper than scaling up. Running several small VMs can cost less than a single large VM.
Horizontal scaling can also improve resiliency, by adding redundancy. If an instance goes down, the
application keeps running.
An advantage of vertical scaling is that you can do it without making any changes to the application. But at some
point you'll hit a limit, where you can't scale any up any more. At that point, any further scaling must be
horizontal.
Horizontal scale must be designed into the system. For example, you can scale out VMs by placing them behind a
load balancer. But each VM in the pool must be able to handle any client request, so the application must be
stateless or store state externally (say, in a distributed cache). Managed PaaS services often have horizontal
scaling and autoscaling built in. The ease of scaling these services is a major advantage of using PaaS services.
Just adding more instances doesn't mean an application will scale, however. It might simply push the bottleneck
somewhere else. For example, if you scale a web front end to handle more client requests, that might trigger lock
contentions in the database. You would then need to consider additional measures, such as optimistic
concurrency or data partitioning, to enable more throughput to the database.
Always conduct performance and load testing to find these potential bottlenecks. The stateful parts of a system,
such as databases, are the most common cause of bottlenecks, and require careful design to scale horizontally.
Resolving one bottleneck may reveal other bottlenecks elsewhere.
Use the Performance efficiency checklist to review your design from a scalability standpoint.
Performance efficiency guidance
Design patterns for scalability and performance
Best practices: Autoscaling, Background jobs, Caching, CDN, Data partitioning

Reliability
A reliable workload is one that is both resilient and available. Resiliency is the ability of the system to recover
from failures and continue to function. The goal of resiliency is to return the application to a fully functioning
state after a failure occurs. Availability is whether your users can access your workload when they need to.
In traditional application development, there has been a focus on increasing the mean time between failures
(MTBF). Effort was spent trying to prevent the system from failing. In cloud computing, a different mindset is
required, due to several factors:
Distributed systems are complex, and a failure at one point can potentially cascade throughout the system.
Costs for cloud environments are kept low through the use of commodity hardware, so occasional hardware
failures must be expected.
Applications often depend on external services, which may become temporarily unavailable or throttle high-
volume users.
Today's users expect an application to be available 24/7 without ever going offline.
All of these factors mean that cloud applications must be designed to expect occasional failures and recover from
them. Azure has many resiliency features already built into the platform. For example:
Azure Storage, SQL Database, and Cosmos DB all provide built-in data replication, both within a region and
across regions.
Azure managed disks are automatically placed in different storage scale units to limit the effects of hardware
failures.
VMs in an availability set are spread across several fault domains. A fault domain is a group of VMs that share
a common power source and network switch. Spreading VMs across fault domains limits the impact of
physical hardware failures, network outages, or power interruptions.
That said, you still need to build resiliency into your application. Resiliency strategies can be applied at all levels of
the architecture. Some mitigations are more tactical in nature — for example, retrying a remote call after a
transient network failure. Other mitigations are more strategic, such as failing over the entire application to a
secondary region. Tactical mitigations can make a big difference. While it's rare for an entire region to experience
a disruption, transient problems such as network congestion are more common — so target these first. Having
the right monitoring and diagnostics is also important, both to detect failures when they happen, and to find the
root causes.
When designing an application to be resilient, you must understand your availability requirements. How much
downtime is acceptable? This is partly a function of cost. How much will potential downtime cost your business?
How much should you invest in making the application highly available?
Reliability guidance
Designing reliable Azure applications
Design patterns for resiliency
Best practices: Transient fault handling, Retry guidance for specific services

Security
Think about security throughout the entire lifecycle of an application, from design and implementation to
deployment and operations. The Azure platform provides protections against a variety of threats, such as network
intrusion and DDoS attacks. But you still need to build security into your application and into your DevOps
processes.
Here are some broad security areas to consider.
Identity management
Consider using Azure Active Directory (Azure AD) to authenticate and authorize users. Azure AD is a fully
managed identity and access management service. You can use it to create domains that exist purely on Azure, or
integrate with your on-premises Active Directory identities. Azure AD also integrates with Office365, Dynamics
CRM Online, and many third-party SaaS applications. For consumer-facing applications, Azure Active Directory
B2C lets users authenticate with their existing social accounts (such as Facebook, Google, or LinkedIn), or create a
new user account that is managed by Azure AD.
If you want to integrate an on-premises Active Directory environment with an Azure network, several approaches
are possible, depending on your requirements. For more information, see our Identity Management reference
architectures.
Protecting your infrastructure
Control access to the Azure resources that you deploy. Every Azure subscription has a trust relationship with an
Azure AD tenant. Use role-based access control (RBAC) to grant users within your organization the correct
permissions to Azure resources. Grant access by assigning RBAC role to users or groups at a certain scope. The
scope can be a subscription, a resource group, or a single resource. Audit all changes to infrastructure.
Application security
In general, the security best practices for application development still apply in the cloud. These include things
like using SSL everywhere, protecting against CSRF and XSS attacks, preventing SQL injection attacks, and so on.
Cloud applications often use managed services that have access keys. Never check these into source control.
Consider storing application secrets in Azure Key Vault.
Data sovereignty and encryption
Make sure that your data remains in the correct geopolitical zone when using Azure's highly available. Azure's
geo-replicated storage uses the concept of a paired region in the same geopolitical region.
Use Key Vault to safeguard cryptographic keys and secrets. By using Key Vault, you can encrypt keys and secrets
by using keys that are protected by hardware security modules (HSMs). Many Azure storage and DB services
support data encryption at rest, including Azure Storage, Azure SQL Database, Azure Synapse Analytics, and
Cosmos DB.
Security resources
Azure Security Center provides integrated security monitoring and policy management across your Azure
subscriptions.
Azure Security Documentation
Microsoft Trust Center
Principles of cost optimization
11/2/2020 • 2 minutes to read • Edit Online

A cost-effective workload is driven by business goals and the return on investment (ROI) while staying within
a given budget. The principles of cost optimization are a series of important considerations that can help
achieve both business objectives and cost justification.
To assess your workload using the tenets found in the Azure Well-Architected Framework, see the Microsoft
Azure Well-Architected Review.
Keep within the cost constraints
Every design choice has cost implications. Before choosing an architectural pattern, Azure service, or a price
model for the service, consider the budget constraints set by the company. As part of design, identify
acceptable boundaries on scale, redundancy, and performance against cost. After estimating the initial cost,
set budgets and alerts at different scopes to measure the cost. One of cost drivers can be unrestricted
resources. These resources typically need to scale and consume more cost to meet demand.
Aim for scalable costs
A key benefit of the cloud is the ability to scale dynamically. The workload cost should scale linearly with
demand. You can save cost through automatic scaling. Consider the usage metrics and performance to
determine the number of instances. Choose smaller instances for a highly variable workload and scale out to
get the required level of performance, rather than up. This choice will enable you to make your cost
calculations and estimates granular.
Pay for consumption
Adopt a leasing model instead of owning infrastructure. Azure offers many SaaS and PaaS resources that
simplify overall architecture. The cost of hardware, software, development, operations, security, and data
center space included in the pricing model.
Also, choose pay-as-you-go over fixed pricing. That way, as a consumer, you're charged for only what you
use.
Right resources, right size
Choose the right resources that are aligned with business goals and can handle the performance of the
workload. An inappropriate or misconfigured service can impact cost. For example, building a multi-region
service when the service levels don't require high-availability or geo-redundancy will increase cost without
any reasonable business justification.
Certain infrastructure resources are delivered as fix-sized building blocks. Ensure that these blocks are
adequately sized to meet capacity demand, deliver expected performance without wasting resources.
Monitor and optimize
Treat cost monitoring and optimization as a process, rather than a point-in-time activity. Conduct regular cost
reviews and measure and forecast the capacity needs so that you can provision resources dynamically and
scale with demand. Review the cost management recommendations and take action.
If you're just starting in this process review enable success during a cloud adoption journey .
Checklist - Design for cost
11/2/2020 • 2 minutes to read • Edit Online

Use this checklist when designing a cost-effective workload.

Cost model
Capture clear requirements . Gather detailed information about the business workflow, regulatory,
security, and availability.
Capture requirements
Estimate the initial cost . Use tools such as Azure pricing calculator to assess cost of the services you plan
to use in the workload. Use Azure Migrate and Microsoft Azure Total Cost of Ownership (TCO) Calculator for
migration projects. Accurately reflect the cost associated with right storage type. Add hidden costs, such as
networking cost for large data download.
Estimate the initial cost
Define policies for the cost constraints defined by the organization . Understand the constraints and
define acceptable boundaries for quality pillars of scale, availability, security.
Consider the cost constraints
Identify shared assets . Evaluate the business areas where you can use shared resources. Review the
billing meters build chargeback reports per consumer to identify metered costs for shared cloud services.
Create a structured view of the organization in the cloud
Plan a governance strategy . Plan for cost controls through Azure Policy. Use resource tags so that custom
cost report can be created. Define budgets and alerts to send notifications when certain thresholds are
reached.
Governance

Architecture
Check the cost of resources in various Azure geographic regions . Check your egress and ingress
cost, within regions and across regions. Only deploy to multiple regions if your service levels require it for
either availability or geo-distribution.
Azure regions
Choose a subscription that is appropriate for the workload . Azure Dev/Test subscription types are
suitable for experimental or non-production workloads and have lower prices on some Azure services such
as specific VM sizes. If you can commit to one or three years, consider subscriptions and offer types that
support Azure Reservations.
Subscription and offer type
Choose the right resources to handle the performance . Understand the usage meters and the
number of meters for each resource in the workload. Consider tradeoffs over time. For example, cheaper
virtual machines may initially indicate a lower cost but can be more expensive over time to maintain a
certain performance level. Be clear about the billing model of third-party services.
Azure resources
Use cost alerts to monitor usage and spending
Compare consumption-based pricing with pre-provisioned cost . Establish baseline cost by
considering the peaks and the frequency of peaks when analyzing performance.
Consumption and fixed cost models
Use proof-of-concept deployments . The Azure Architecture Center has many reference architectures and
implementations that can serve as a starting point. The Azure Tech Community has architecture and services
forums.
Choose managed ser vices when possible . With PaaS and SaaS options, the cost of running and
maintaining the infrastructure is included in the service price.
Managed services
Develop a cost model
11/2/2020 • 6 minutes to read • Edit Online

Cost modeling is an exercise where you create logical groups of cloud resources that are mapped to the
organization's hierarchy and then estimate costs for those groups. The goal of cost modeling is to estimate the
overall cost of the organization in the cloud.
1. Understand how your responsibilities align with your organization
Map the organization's needs to logical groupings offered by cloud services. This way the business leaders
of the company get a clear view of the cloud services and how they're controlled.
2. Capture clear requirements
Start your planning with a careful enumeration of requirements. From the high-level requirements, narrow
down each requirement before starting on the design of the solution.
3. Consider the cost constraints
Evaluate the budget constraints on each business unit and determine the governance policies in Azure to
lower cost by reducing wastage, overprovisioning, or expensive provisioning of resources.
4. Consider tradeoffs
Optimal design doesn't equate to a lowest-cost design.
As requirements are prioritized, cost can be adjusted. Expect a series of tradeoffs in the areas that you want
to optimize, such as security, scalability, resilience, and operability. If the cost to address the challenges in
those areas is high, stakeholders will look for alternate options to reduce cost. There might be risky choices
made in favor or a cheaper solution.
5. Derive functional requirements from high-level goals
Break down the high-level goals into functional requirements for the components of the solution. Each
requirement must be based on realistic metrics to estimate the actual cost of the workload.
6. Consider the billing model for Azure resources
Azure services are offered with consumption-based price where you're charged for only what you use.
There's also options for fixed price where you're charged for provisioned resources
Most services are priced based on units of size, amount of data, or operations. Understand the meters that
are used to track usage. For more information, see Azure resources.
At the end of this exercise, you should have identified the lower and upper limits on cost and set budgets for the
workload. Azure allows you to create and manage budgets in Azure Cost Management. For information, see
Quickstart: Create a budget with an Azure Resource Manager template.
H AVE F REQ UEN T A N D C L EA R C O M M UN IC AT IO N W IT H T H E STA K EH O L DERS

In the initial stages, communication between stakeholders is vital. The overall team must align on the requirements so that
overall business goals are met. If not, the entire solution might be at risk.
For instance, the development team indicates that the resilience of a monthly batch-processing job is low. They might request
the job to work as a single node without scaling capabilities. This request opposes the architect’s recommendation to
automatically scale out, and route requests to worker nodes.
This type of disagreement can introduce a point of failure into the system, risking the Service Level Agreement, and cause an
increase in operational cost.

Organization structure
Map the organization's needs to logical groupings offered by cloud services. This way the business leaders of the
company get a clear view of the cloud services and how they're controlled.
1. Understand how your workload fits into cost optimization across the portfolio of cloud workloads.
If you are working on a workload that fits into a broader portfolio of workloads, see CAF to get started guide to
document foundational decisions. That guide will help your team capture the broader portfolio view of business
units, resources organizations, responsibilities, and a view of the long term portfolio.
If cost optimization is being executed by a central team across the portfolio, see CAF to get started managing
enterprise costs.
2. Encourage a culture of democratized cost optimization decisions
As a workload owner, you can have a measurable impact on cost optimization. There are other roles in the
organization which can help improve cost management activites. To help embed the pillar of cost optimization into
your organization beyond your workload team, see the CAF article: Build a cost-conscious organization.
3. Reduce costs through shared cloud services and landing zones
If your workload has depedencies on shared assets like Active Directory, Network connectivity, security devices, or
other services that are also used by other workloads, encourage your central IT organization to provide those
services through a centrally managed landing zone to reduce duplicate costs. See the CAF article: get started with
centralized design & configuration to get started with the development of landing zones.
4. Calculate the ROI by understanding what is included in each grouping and what isn't.
Which aspects of the hierarchy are covered by cloud ser vices?

Azure pricing model is based on expenses incurred for the service. Expenses include hardware, software,
development, operations, security, and data center space to name a few. Evaluate the cost benefit of shifting
away from owned technology infrastructure to leased technology solutions.
5. Identify scenarios where you can use shared cloud services to lower cost.
Can some ser vices be shared by other consumers?

Identify areas where a service or an application environment can be shared with other business units.
Identify resources that can be used as shared services and review their billing meters. Examples include a
virtual network and its hybrid connectivity or a shared app service environment (ASE). If the meter data isn't
able to be split across consumers, decide on custom solutions to allocate proportional costs. Move shared
services to dedicated resources for consumers for cost reporting.
Build chargeback reports per consumer to identify metered costs for shared cloud services. Aim
for granular reports to understand which workload is consuming what amount of the shared cloud
service.

Next step
Capture cost requirements

Cost constraints
Here are some considerations for determining the governance policies that can assist with cost management.
What are the budget constraints set by the company for each business unit?
What are policies for the budget alert levels and associated actions?
Identify acceptable boundaries for scale, redundancy, and performance against cost.
Assess the limits for security. Don't compromise on security. Premium cloud security features can drive the cost
up. It’s not necessary to overinvest. Instead use the cost profile to drive a realistic threat profile.
Identify unrestricted resources. These resources typically need to scale and consume more cost to meet
demand.
Next step
Consider tradeoffs

Functional requirements
Break down high-level goals into functional requirements. For each of those requirements, define metrics to
calculate cost estimates accurately. Cloud services are priced based on performance, features, and locations. When
defining these metrics, identify acceptable boundaries of performance, scale, resilience, and security. Start by
expressing your goals in number of business transactions over time, breaking them down to fine-grain
requirements.
What resources are needed for a single transaction, and how many transactions are done per second,
day, year?

Start with a fixed cost of operations and a rough estimate of transaction volume to work out a cost-per-
transaction to establish a baseline. Consider the difference between cost models based on fixed, static
provisioning of services, more variable costs based upon autoscaling such as serverless technologies.

Use T -shirt sizes for choosing SKUs


When choosing options for services, start with an abstract representation of size. For example, if you choose a T-
shirt size approach, small, medium, large sizes, can represent an on-demand virtual machine instead of picking
specific virtual machines or managed disks SKU sizes.
Abstract sizes give you an idea of the expected performance and cost requirements. It sets the tone for various
consumption units that are used to measure compute resources for performance. Also, it helps in understanding
the on-demand consumption model of the services.
For more information, see Estimate the initial cost.
Next steps
Estimate the initial cost
Capture cost requirements
11/2/2020 • 4 minutes to read • Edit Online

Start your planning with a careful enumeration of requirements. Make sure the needs of the stakeholders are
addressed. For strong alignment with business goals, those areas must be defined by the stakeholders and
shouldn’t be collected from a vendor.
Capture requirements at these levels:
Business workflow
Compliance and regulatory
Security
Availability
What do you aim to achieve by building your architecture in the cloud?

Here are some common answers.


Take advantage of features only available in the cloud, such as intelligent security systems, regions footprint, or
resiliency features.
Use the on-demand nature of the cloud to meet peak or seasonal requirements, then releasing that cost
investment when it is no longer needed.
Consolidate physical systems.
Retire on-premises infrastructure.
Reduce hardware or data center management costs.
Increase performance or processing capabilities, through services like big data and machine learning.
Meet regulatory considerations, including taking advantage of certified infrastructure.
Narrow down each requirement before you start the design of the workload. Expect the requirements to change
over time as the solution is deployed and optimized.

Landing zone
Consider the cost implications of the geographic region to which the landing zone is deployed.
The landing zone consists of the subscription and resource group, in which your cloud infrastructure components
exist. This zone impacts the overall cost. Consider the tradeoffs. For example, there are additional costs for network
ingress and egress for cross-zonal traffic. For more information, see Azure regions and Azure resources.
For information about landing zone for the entire organization, see CAF: Implement landing zone best practices.

Security
Security is one of the most important aspects of any architecture. Security measures protect the valuable data of
the organization. It provides confidentiality, integrity, and availability assurances against attacks and misuse of the
systems.
Factor in the cost of security controls, such as authentication, MFA, conditional access, information protection,
JIT/PIM, and premium Azure AD features. Those options will drive up the cost.
For security considerations, see the Security Pillar.
Business continuity
Does the application have a Ser vice Level Agreement that it must meet?

Factor in the cost when you create high availability and disaster recovery strategies.
Overall Service Level Agreement (SLA), Recovery Time Objective (RTO), and Recovery Point Objective (RPO) may
drive towards expensive design choices in order to support higher availability requirements. For example, a choice
might be to host the application across regions, which is costlier than single region but supports high availability.
If your service SLAs, RTOs and RPOs allow, then consider cheaper options. For instance, pre-build automation
scripts and packages that would redeploy the disaster recovery components of the solution from the ground-up in
case a disaster occurs. Alternatively, use Azure platform-managed replication. Both options can lower cost because
fewer cloud services are pre-deployed and managed, reducing wastage.
In general, if the cost of high availability exceeds the cost of application downtime, then you could be over
engineering the high availability strategy. Conversely, if the cost of high availability is less than the cost of a
reasonable period of downtime, you may need to invest more.
Suppose the downtime costs are relatively low, you can save by using recovery from your backup and disaster
recovery processes. If the downtime is likely to cost a significant amount per hour, then invest more in the high
availability and disaster recovery of the service. It's a three-way tradeoff between cost of service provision, the
availability requirements, and the organization's response to risk.

Application lifespan
Does your ser vice run seasonally or follow long-term patterns?

For long running applications, consider using Azure Reservations if you can commit to one-year or three-year
term. VM reservations can reduce cost by 60% or more when compared to pay-as-you-go prices.
Reservation is still an operational expense with all the corresponding benefits. Monitor the cost on workloads that
have been running in the cloud for an extended period to forecast the reserved instance sizes that are needed. For
information about optimization, see Reserved instances.
If your application runs intermittently, consider using Azure Functions in a consumption plan so you only pay for
compute resources you use.

Automation opportunities
Is it a business requirement to have the ser vice be available 24x7?

You may not have a business goal to leave the service running all the time. Doing so will incur a consistent cost.
Can you save by shutting down the service or scaling it down outside normal business hours? If you can,
Azure has a rich set of APIs, SDKs, and automation technology that utilizes DevOps and traditional automation
principles. Those technologies ensure that the workload is available at an appropriate level of scale as needed.
Repurpose some compute and data resources for other tasks that run out of regular business hours. See the
Compute Resource Consolidation pattern and consider containers or elastic pools for more compute and data
cost flexibility.

Budget for staff education


Keep the technical staff up to date in cloud management skills so that the invested services are optimally used.
Consider using resources such as FastTrack for Azure and Microsoft Learn to onboard the staff. Those resources
provide engineering investments at no cost to customers.
Identify training requirements and costs for cloud migration projects, application development, and
architecture refinement.
Invest in key areas, such as identity management, security configuration, systems monitoring, and automation.
Give the staff access to training and relevant announcements. This way, they can be aware of new cloud
capabilities and updates.
Provide opportunities to get real-world experience of customers across the globe through conferences, specific
cloud training, and passing dedicated Microsoft Exams (AZ, MS, MB, etc.).

Standardization
Ensure that your cloud environments are integrated into any IT operations processes. Those operations include
user or application access provisioning, incident response, and disaster recovery. That mapping may uncover areas
where additional cloud cost is needed.

Next step
Determine the cost constraints
Azure regions
11/2/2020 • 2 minutes to read • Edit Online

Cost of an Azure service can vary between locations based on demand and local infrastructure costs. Consider all
these geographical areas when choosing the location of your resources to estimate costs.

T ERM IN O LO GY DESC RIP T IO N

Azure region A set of datacenters deployed within a latency-defined


perimeter and connected through a dedicated regional low-
latency network.

Availability zone A unique physical location within a region with independent


power, network, and cooling to be tolerant to datacenter
failures through redundancy and logical isolation of services.

Billing zone A geographic collection of regions that is used for billing.

Location A region or a zone within a region. Azure has datacenters all


over the world and each datacenter is placed in a location.

Landing zone The ultimate location of your cloud solution or the landing
zone, typically consisting of logical containers such as a
subscription and resource group, in which your cloud
infrastructure components exist.

The complete list of Azure geographies, regions, and locations, is shown in Azure global infrastructure.
To see availability of a product by region, see Products available by region.

Tradeoff
Locating resources in a cheaper region should not negate the cost of network ingress and egress or by
degraded application performance because of increased latency.
An application hosted in a single region may cost less than an application hosted across regions because of
replication costs or the need for extra nodes.

Compliance and regulatory


Azure also offers differentiated cloud regions for specific security and compliance requirements.
Does your solution need specific levels of security and compliance?

If your solution needs to follow certain government regulations, the cost will be higher. Otherwise you can meet
less rigid compliance, through Azure Policy, which is free.
Certain Azure regions are built specifically for high compliance and security needs. For example, with Azure
Government (USA) you're given an isolated instance of Azure. Azure Germany has datacenters that meet privacy
certifications. These specialized regions have higher cost.
Regulatory requirements can dictate restrictions on data residency. These requirements may impact your data
replication options for resiliency and redundancy.
Traffic across billing zones and regions
Cross-regional traffic and cross-zonal traffic incur additional costs.
Is the application critical enough to have the footprint of the resources cross zones and,or cross
regions?

Bandwidth refers to data moving in and out of Azure datacenters. Inbound data transfers (data going into Azure
datacenters) are free for most services. For outbound data transfers, (data going out of Azure datacenters) the data
transfer pricing is based on billing zones. For more information, see Bandwidth Pricing Details.
Suppose, you want to build a cost-effective solution by provisioning resources in locations that offer the lowest
prices. The dependent resources and their users are located in different parts of the world. In this case, data
transfer between locations will add cost if there are meters tracking the volume of data moving across locations.
Any savings from choosing the cheapest location could be offset by the additional cost of transferring data.
The cross-regional and cross-zone additional costs do not apply to global services, such as Azure Active
Directory.
Not all Azure services support zones and not all regions in Azure support zones.

Before choosing a location, consider how important is the application to justify the cost of having
resources cross zones and/or cross regions. For non-mission critical applications such as, developer or test,
consider keeping the solution and its dependencies in a single region or single zone to leverage the
advantages of choosing the lower-cost region.
Azure resources
11/2/2020 • 3 minutes to read • Edit Online

Just like on-premises equipment, there are several elements that affect monthly costs when using Azure services.

Usage meters for resources


Most services are priced based on units of size, amount of data, or operations. When you provision an Azure
resource, Azure creates metered instances for that resource. The meters track the resources' usage and generate a
usage record that is used to calculate your bill.
For example, you provision a virtual machine in Azure. Some meters that track its usage include: Compute Hours,
IP Address Hours, Data Transfer In, Data Transfer Out, Standard Managed Disk, Standard Managed Disk Operations,
Standard IO-Disk, Standard IO-Block Blob Read, Standard IO-Block Blob Write, Standard IO-Block Blob Delete.
How is the usage tracked for each resource in the workload?

For each Azure resource, have a clear understanding of the meters that track usage and the number of meters
associated with the resource tier. The meters correlate to several billable units. Those units are charged to the
account for each billing period. The rate per billable unit depends on the resource tier.
A resource tier impacts pricing because each tier offers levels of features such as performance or availability. For
example, a Standard HDD hard disk is cheaper than a Premium SSD hard disk.

Start with a lower resource tier then scale the resource up as needed. Growing a service with little to no
downtime is easier when compared to downscaling a service. Downscaling usually requires deprovisioning or
downtime. In general, choose scaling out instead of scaling up.

As part of the requirements, consider the metrics for each resource and build your alerts on baseline thresholds
for each metric. The alerts can be used to fine-tune the resources. For more information, see Respond to cost alert.

Allocated usage for the resource


Another way to look at pricing is the allocated usage.
Suppose, you de-allocate the virtual machine. You'll not be billed for Compute Hours, I/O reads or writes, and
other compute meters because the virtual machine is not running and has no given compute resources. However,
you'll be charged for storage costs for the disks.
Here are some considerations:
The meters and pricing vary per product and often have different pricing tiers based on the location, size, or
capacity of the resource.
Cost for associated with infrastructure is kept low with commodity hardware.
Failures cannot be prevented but the effects of failure can be minimized through design choices. The resiliency
features are factored in the price of a service and its features.
Here are some examples:
Azure Disk
Start with a small size in GB for a managed disk instead of pay-per-GB model. It’s cost effective because cost is
incurred on the allocated storage.
ExpressRoute
Start with a smaller bandwidth circuit and scale up as needed.
Compute infrastructure
Deploy an additional smaller instance of compute alongside a smaller unit in parallel. That approach is more cost
effective in comparison to restarting an instance to scale up.
For details about how billing works, see Azure Cost Management + Billing documentation.

Subscription and offer type


What is the subscription and offer type in which resources are created?

Azure usage rates and billing periods can vary depending on the subscription and offer type. Some subscription
types also include usage allowances or lower prices. For example, Azure Dev/Test subscription offers lower prices
on Azure services such as specific VM sizes, PaaS web apps, and VM images with pre-installed software. Visual
Studio subscribers obtain as part of their benefits access to Azure subscriptions with monthly allowances.
For information about the subscription offers, see Microsoft Azure Offer Details.
As you decide the offer type, consider the types that support Azure Reservations. With reservations, you prepay
for reserved capacity at a discount. Reservations are suitable for workloads that have a long-term usage pattern.
Combining the offer type with reservations can significantly lower the cost. For information about subscription
and offer types that are eligible for reservations, see Discounted subscription and offer types.

Billing structure for services in Azure Marketplace


Are you considering third-par ty ser vices through Azure Marketplace?

Azure Marketplace offers both the Azure products and services from third-party vendors. Different billing
structures apply to each of those categories. The billing structures can range from free, pay-as-you-go, one-time
purchase fee, or a managed offering with support and licensing monthly costs.
Governance
11/2/2020 • 2 minutes to read • Edit Online

Governance can assist with cost management. This work will benefit your ongoing cost review process and will
offer a level of protection for new resources.

Understand how centralized governance functions can support your


team
Centralized governance can relieve some of the burden related to on-going cost monitoring and optimization.
However, that is an augmentation of the workload team's responsibilities, not a replacement. For an understanding
of how centralized cloud governance teams operate, see the Cloud Adoption Framework's Govern methodology.
For more detailed information on cost optimization, see the section on Cost Management discipline.
For an example of the types of guardrails provided by a governance team, see the narrative for improving the
cost management discipline, which includes examples of suggested tags and policies for improving cost
governance.
If your team isn't supported by centralized governance teams, see Cloud governance function to better
understand the types of activities your team may need to consider including in each sprint.

Follow organizational policies that define cost boundaries


Use policies to ensure compliance to the identified cost boundaries. Also, it eliminates the need for manual
resource approval and speeds up the total provisioning time.
Azure Policy can set rules on management groups, subscriptions, and resources groups. The policies control clouds
service resource SKU size, replication features, and locations that are allowed. Use policies to prevent provisioning
of expensive resources. Identify the built-in Azure policies that can help lower cost. For additional control, create
custom policies.
For more information, see Create management groups for resource organization and management.
Control the group who can manage resources in the subscription, see Built-in roles for Azure resources.
Set limits or quotas to prevent unexpected costs, For more information, see Azure subscription and service limits,
quotas, and constraints.

Enforce resource tagging


Use tags on resources and resource groups to track the incurred costs. Identify the service meters that can't be
tagged or viewed in the cost analysis tool in Azure portal.
The advantages of tagging include:
The cost can be reported to an owner, an application, a business department or a project initiative. This feature
is useful because the overall cost can span multiple resources, locations, and subscriptions.
Filter information. This filtering can be used in cost analysis tool in Azure portal allowing you to get granular
reports.
There are some limitations:
Tags can’t be inherited. If you tag a resource group, the resources under that group won’t inherit that tag. This
aspect requires, additional efforts in looking up the parent resource group to get tagging information if you
want to tag related resource.
Not all Azure resources can be tagged and not all taggable resources in Azure are accounted for in the Azure
cost analysis tool.
Estimate the initial cost
11/2/2020 • 6 minutes to read • Edit Online

It’s difficult to attribute costs before deploying a workload to the cloud. If you use methods for on-premises
estimation or directly map on-premises assets to cloud resources, estimate will be inaccurate. For example, if you
build your own datacenter your costs may appear comparable to cloud. Most on-premises estimates don't account
for costs like cooling, electricity, IT and facilities labor, security, and disaster recovery.
Here are some best practices:
Use proof-of-concept deployments to help refine cost estimates.
Choose the right resources that can handle the performance of the workload. For example, cheaper virtual
machines may initially indicate a lower cost but can be more expensive eventually to maintain a certain
performance level.
Accurately reflect the cost associated with right storage type.
Add hidden costs, such as networking cost for large data download.

Migration workloads
Quantify the cost of running your business in Azure by calculating Total Cost Ownership (TCO) and the Return on
Investment (ROI). Compare those metrics to existing on-premises equivalents.
It’s difficult to attribute costs before migrating to the cloud.
Using on-premises calculation may not accurately reflect the cost of cloud resources. Here are some challenges:
On-premises TCO may not accurately account for hidden expenses. These expenses include under-utilization of
purchased hardware or network maintenance costs including labor and equipment failure.
Cloud TCO may not accurately account for a drop in the organization’s operational labor hours. Cloud
provider’s infrastructure, platform management services, and additional operational efficiencies are included in
the cloud service pricing. Especially true at a smaller scale, the cloud provider’s services don't result in
reduction of IT labor head count.
ROI may not accurately account for new organizational benefits because of cloud capabilities. It's hard to
quantify improved collaboration, reduced time to service customers, and fast scaling with minimal or no
downtime.
ROI may not accurately account for business process re-engineering needed to fully adopt cloud benefits. In
some cases, this re-engineering may not occur at all, leaving an organization in a state where new technology is
used inefficiently.
Azure provides these tools to determine cost.
Microsoft Azure Total Cost of Ownership (TCO) Calculator to reflect all costs.

For migration projects, the TCO Calculator may assist, as it pre-populates some common cost but allows
you to modify the cost assumptions.

Azure pricing calculator to assess cost of the services you plan to use in your solution.
Azure Migrate to evaluate your organization's current workloads in on-premises datacenters. It suggests Azure
replacement solution, such virtual machine sizes based on your workload. It also provides a cost estimate.
Example estimate for a microservices workload
Let's consider this scenario as an example. We'll use the Azure Pricing calculator to estimate the initial cost before
the workload is deployed. The cost is calculated per month or for 730 hours.
In this example, we've chosen the microservices pattern. As the container orchestrator, one of the options could be
Azure Kubernetes Service (AKS) that manages a cluster of pods. We choose NGINX ingress controller because it's
well-known controller for such workloads.

The example is based on the current price and is subject to change. The calculation shown is for information
purposes only.

Compute
For AKS, there's no charge for cluster management.
For AKS agent nodes, there are many instance sizes and SKUs options. Our example workload is expected to follow
a long running pattern and we can commit to three years. So, an instance that is eligible for reserved instances
would be a good choice. We can lower the cost by choosing the 3-year reser ved plan .
The workload needs two virtual machines. One is for the backend services and the other for the utility services.
The B12MS instance with 2 virtual machines is sufficient for this initial estimation. We can lower cost by choosing
reserved instances.
Estimated Total: $327.17 per month with upfront payment of $11,778.17 .
Application gateway
For this scenario, we consider the Standard_v2 Tier of Azure Application Gateway because of the autoscaling
capabilities and performance benefits. We also choose consumption-based pricing, which is calculated by capacity
units (CU). Each capacity unit is calculated based on compute, persistent connections, or throughput. For
Standard_v2 SKU - Each compute unit can handle approximately 50 connections per second with RSA 2048-bit key
TLS certificate. For this workload, we estimate 10 capacity units.
Estimated Total: $248.64 per month.
Load balancer
NGINX ingress controller deploys a load balancer that routes internet traffic to the ingress. Approximately 15 load
balancer rules are needed. NAT rules are free. The main cost driver is the amount of data processed inbound and
outbound independent of rules. We estimate traffic of 1 TB (inbound and outbound).
Estimated Total: $96.37 per month.
Bandwidth
We estimate 2-TB outbound traffic. The first 5 GB/month are free in Zone 1 (Zone 1 includes North America,
Europe, and Australia). Between 5 GB - 10 TB /month is charged $0.087 per GB.
Estimated Total: $177.74 per month
External data source
Because the schema-on read nature of the data handled by the workload, we choose Azure Cosmos DB as the
external data store. By using the Cosmos DB capacity calculator, we can calculate the throughput to reserve.
Cost variables
For lower latency, in this scenario we enable geo-replication by using the Multi-regions writes feature. By
default Cosmos DB uses one region for writes and the rest for reads.
Default choices in consistency model of Session and indexing policy to Automatic . Automatic indexing
makes Cosmos DB to index all properties in all items for flexible and efficient queries. Custom indexing
policy allows you to include/exclude properties from the index for lower write RU and storage size. So
uploading a custom indexing policy can make you reduce costs.
Total data store isn't a significant cost driver, here it's set to 500 GB.
The throughput is variable indicating peaks. The percentage of time at peak is set to 10%.
The item size is an average of 90k. By using the capacity calculator you can upload sample json files with
your document's data structure, the average document's size, the number of reads/writes per second. These
variables have the largest impact on cost because they're used to calculate the throughput. The throughput
values are shown in the image.
Now, we use those values in the Azure Pricing calculator.
The average throughput based on these settings is 20,000 RUs/sec which is the minimum throughput required for
a 3-year reser ved capacity plan.
Here is the total cost for three years using the reserved plan:
$1,635.20 Average per month ($58,867.20 charged upfront)
You save $700.00 by choosing the 3-year reserved capacity over the pay-as-you-go price.
CI/CD pipelines
With Azure DevOps, Basic Plan is included for Visual Studio Enterprise, Professional, Test Professional, and MSDN
Platforms subscribers. And no charge for adding or editing work items and bugs, viewing dashboards, backlogs,
and Kanban boards for stakeholders.
Basic plan license for five users is free.
Additional services
For Microsoft Hosted Pipelines, the Free tier includes one parallel CI/CD job with 1,800 minutes (30 hours) per
month. However you can select the Paid tier and have one CI/CD parallel job ($40.00), in this tier, each parallel
CI/CD job includes unlimited minutes.

For this stage of cost estimation, the Self Hosted Pipelines is not required because the workload doesn't have
custom software that runs in your build process which isn't included in the Microsoft-hosted option.

Azure Artifacts is a service where you can create package feeds to publish and consume Maven, npm, NuGet,
Python, and universal packages. Azure Artifacts is billed on a consumption basis, and is free up until 2 GB of
storage. For this scenario, we estimate 56 GB in artifacts ($56.00)
Azure DevOps offers a cloud-based solution for load testing your apps. Load tests are measured and billed in
virtual user minutes (VUMs). For this scenario, we estimate a 200,000 VUMs ($72.00).
Estimated Total: $168.00 per month
Managed services
11/2/2020 • 2 minutes to read • Edit Online

Look for areas in the architecture where it may be natural to incorporate platform-as-a-service (PaaS) options.
These include caching, queues, and data storage. PaaS reduces time and cost of managing servers, storage,
networking, and other application infrastructure.
With PaaS, the infrastructure cost is included in the pricing model of the service. For example, you can provision a
lower SKU virtual machine as a jumpbox. There are additional costs for storage and managing a separate server.
You also need to configure a public IP on the virtual machine, which is not recommended. A managed service such
as Azure Bastion takes into consideration all those costs and offers better security.
Azure provides a wide range of PaaS resources. Here are some examples of when you might consider PaaS options:

TA SK USE

Host a web server Azure App Service instead of setting up IIS servers.

Indexing and querying heterogenous data Azure Cognitive Search instead of ElasticSearch.

Host a database server Azure offers many SQL and no-SQL options such as Azure
SQL Database and Azure Cosmos DB.

Secure access to virtual machine Azure Bastion instead of virtual machines as jump boxes.

Network security Azure Firewall instead of virtual network appliances.

For more information, see Use platform as a service (PaaS) options.


Reference architecture
To see an implementation that provides better security and lowers cost through PaaS services, see Network DMZ
between Azure and an on-premises datacenter.
Consumption and fixed cost models
11/2/2020 • 2 minutes to read • Edit Online

The common pricing options for Azure services are:


Consumption-based price - You are charged for only what you use. This model is also known as the Pay-As-
You-Go rate.
Fixed price - You provision resources and are charged for those instances whether or not they are used.
A common way to estimate cost is by considering workloads on a peak throughput. Under consistently high
utilization, consumption-based pricing can be less efficient for estimating baseline costs when compared to the
equivalent provisioned pricing. PaaS and serverless technologies can help you understand the economy cutoff
point for consumption-based pricing.
Observe the difference between cost models based on fixed, static provisioning of services, more variable costs
based on autoscaling of serverless technologies.

Start with a fixed minimum level of performance and then use architectural patterns (such as Queue Based Load
Leveling) and autoscaling of services. With this approach the peaks can be smoothed out into a more consistent
flow of compute and data. This approach should temporarily extend your burst performance when the service is
under sustained load. If cost is an important factor but you need to maintain service availability under burst
workload use the Throttling pattern to maintain quality of service under load.
Compare and contrast the options and understand how to provision workloads that can potentially switch between
the two models. The model will be a tradeoff between scalability and predictability. Ideally in the architecture, blend
the two aspects.
Provisioning cloud resources to optimize cost
11/2/2020 • 2 minutes to read • Edit Online

Deployment of cloud resources of a workload is known as provisioning.


Use the Azure Pricing calculator to estimate the cost of your SKU choices. This section describes some
considerations. This list is not meant to be an exhaustive list, but a subset of options.
AI + Machine Learning
Big data analytics
Compute
Networking
Data stores
AI + Machine Learning cost estimates
11/2/2020 • 3 minutes to read • Edit Online

The main cost driver for machine learning workloads is the compute cost. Those resources are needed to run the
training model and host the deployment. For information about choosing a compute target, see What are compute
targets in Azure Machine Learning?.
The compute cost depends on the cluster size, node type, and number of nodes. Billing starts while the cluster
nodes are starting, running, or shutting down.
With services such as Azure Machine Learning, you have the option of creating fix-sized clusters or use autoscaling.

If the amount of compute is not known, start with a zero-node cluster. The cluster will scale up when it
detects jobs in the queue. A zero-node cluster is not charged.

Fix-sized clusters are appropriate for jobs that run at a constant rate and the amount of compute is known and
measured beforehand. The time taken to spin up or down a cluster incurs additional cost.

If you don’t need retraining frequently, turn off the cluster when not in use.

To lower the cost for experimental or development workloads, choose Spot VMs. They aren't recommended for
production workloads because they might be evicted by Azure at any time. For more information, see Use Spot
VMs in Azure.
For more information about the services that make up a machine learning workload, see What are the machine
learning products at Microsoft?.
This article provides cost considerations for some technology choices. This is not meant to be an exhaustive list, but
a subset of options.

Azure Machine Learning


There are two editions, Enterprise and Basic . Training models don't incur the machine learning service surcharge.
You're charged for these factors.
The cost is driven by compute choices, such as, the virtual machine sizes and the region in which they are
available. If you can commit to one or three years, choosing reserved instances can lower cost. For more
information, see Reserved instances.
As part of provisioning Machine Learning resources, additional resource are deployed such as Azure
Container Registry, Azure Block Blob Storage, and Key Vault. You're charged for as per the pricing of those
individual services.
If you deploy models to a Kubernetes Service cluster, Machine Learning adds a surcharge on top of the
Kubernetes Service compute cost. This cost can be lowered through autoscaling.
For more information, see these articles:
Machine Learning pricing calculator
Azure Machine Learning
Reference architecture
Training of Python scikit-learn and deep learning models on Azure
Distributed training of deep learning models on Azure
Batch scoring of Python machine learning models on Azure
Batch scoring of deep learning models on Azure
Real-time scoring of Python scikit-learn and deep learning models on Azure
Machine learning operationalization (MLOps) for Python models using Azure MachineLearning
Batch scoring of R machine learning models on Azure
Real-time scoring of R machine learning models on Azure
Batch scoring of Spark machine learning models on Azure Databricks
Enterprise-grade conversational bot
Build a real-time recommendation API on Azure

Azure Cognitive services


The billing depends on the type of service. The charges are based on the number of transactions for each type of
operation specific to a service. Certain number of transactions are free. If you need additional transactions, choose
from the Standard instances. For more information, see
Cognitive services pricing calculator
Cognitive services pricing
Reference architecture
Build an enterprise-grade conversational bot

Azure Bot Service


The Azure Bot Service is a managed service purpose-built for enterprise-grade bot development. Billing is based
on the number of messages. Certain number of messages are free. If you need to create custom channels, choose
Premium channels , which can drive up the cost of the workload.
For a Web App Bot, an Azure App Service is provisioned to host the bot. Also, an instance of Application Insights is
provisioned. You're charged for as per the pricing of those individual services.
Reference architecture
Enterprise-grade conversational bot
Big data analytics cost estimates
11/2/2020 • 6 minutes to read • Edit Online

Most big data workloads are designed to do:


Batch processing of big data sources at rest.
Stream processing of data in motion.
Those workloads have different needs. Batch processing is done with long-running batch jobs. For stream
processing, the data ingestion component should be able to capture and in some cases buffer, store real-time
messages. Both workloads also have the requirement to store large volume of data. Then, filter, aggregate, and
prepare that data for analysis.
For information about choosing technologies for each workload, see these articles:
Batch processing
Technology choices for batch processing
Capability matrix
Stream processing
What are your options when choosing a technology for real-time processing?
Capability matrix
This article provides cost considerations for some of those choices. This is not meant to be an exhaustive list, but a
subset of options.

Azure Synapse Analytics


The analytics resources are measured in Data Warehouse Units (DWUs), which tracks CPU, memory, and IO. DWU
also indicates the required level of performance. If you need higher performance, add more DWU blocks.
You can provision the resources in one of two service levels.
Compute Optimized Gen1 tracks usage in DWUs and is offered in a pay-as-you-go model.
Compute Optimized Gen2 tracks the compute DWUs (cDWUs) which allows you to scale the compute nodes.
This level is intended for intensive workloads with higher query performance and compute scalability. You can
choose the pay-as-you-go model or save 37% to 65% by using reserved instances if you can commit to one or
three years. For more information, see Reserved instances.

Start with smaller DWUs and measure performance for resource intensive operations, such as heavy
data loading or transformation. This will help you determine the number of units you need to increase or
decrease. Measure usage during the peak business hours so you can assess the number of concurrent queries
and accordingly add units to increase the parallelism. Conversely, measure off-peak usage so that you can
pause compute when needed.

In Azure Synapse Analytics, you can import or export data from an external data store, such as Azure Blob Storage
and Azure Data Lake Store. Storage and analytics resources aren't included in the price. There is additional
bandwidth cost for moving data in and out of the data warehouse.
For more information, see these articles:
Azure Synapse Pricing
Manage compute in Azure Synapse Analytics data warehouse
Reference architecture
Automated enterprise BI with Azure Synapse Analytics and Azure Data Factory
Enterprise BI in Azure with Azure Synapse Analytics

Azure Databricks
Azure Databricks offers two SKUs Standard and Premium , each with these options, listed in the order of least to
most expensive.
Data Engineering Light is for data engineers to build and execute jobs in automated Spark clusters.
Data Engineering includes autoscaling and has features for machine learning flows.
Data Analytics includes the preceding set of features and is intended for data scientists to explore, visualize,
manipulate, and share data and insights interactively.
Choose a SKU depending on your workload. If you need features like log audit, which is available in Premium , the
overall cost can increase. If you need autoscaling of clusters to handle larger workloads or interactive Databricks
dashboards, choose an option higher than Data Engineering Light .
Here are factors that impact Databricks billing:
Azure location where the resource is provisioned.
The virtual machine instance tier and number of hours the instances were running.
Databricks units (DBU), which is a unit of processing capability per hour, billed on per-second usage.

The example is based on the current price and is subject to change. The calculation shown is for information
purposes only.

Suppose you run a Premium cluster for 100 hours in East US 2 with 10 DS13v2 instances.

IT EM EXA M P L E EST IM AT E

Cost for 10 DS13v2 instances 100 hours x 10 instances x $0.741/hour = $741.00

DBU cost for Data Analytics workload 100 hours x 10 instances x 2 DBU per node x $0.55/DBU =
$1,100

Total $1,841

For more information, see Azure Databricks Pricing.


If you can commit to one or three years, opt for reserved instances, which can save 38% - 59%. For more
information, see Reserved instances.

Turning off the Spark cluster when not in use to prevent unnecessary charges.

Reference architecture
Stream processing with Azure Databricks
Build a Real-time Recommendation API on Azure
Batch scoring of Spark models on Azure Databricks
Azure Stream Analytics
Stream analytics uses streaming units (SUs) to measure the amount of compute, memory, and throughput required
to process data. When provisioning a stream processing job, you're expected to specify an initial number of SUs.
Higher streaming units mean higher cost because more resources are used.
Stream processing with low latency requires a significant amount of memory. This resource is tracked by the SU%
utilization metric. Lower utilization indicates that the workload requires more compute resources. You can set an
alert on 80% SU Utilization metric to prevent resource exhaustion.

To evaluate the number of units you need, process an amount of data that is realistic for your production
level workload, observe the SU% Utilization metric, and accordingly adjust the SU value.

You can create stream processing jobs in Azure Stream Analytics and deploy them to devices running Azure IoT
Edge through Azure IoT Hub. The number of devices impacts over all cost. Billing starts when a job is deployed to
devices, regardless of the job status (running, failed, stopped).
SUs are based on the partition configuration for the inputs and the query that's defined within the job. For more
information, see Calculate the max streaming units for a job and Understand and adjust Streaming Units.
For pricing details, see Azure Stream Analytics pricing.
Reference architecture
Batch scoring of Python machine learning models on Azure
Azure IoT reference architecture

Azure Analysis Services


Big data solutions need to store data that can be used for reporting. Azure Analysis Services supports the creation
of tabular models to meet this need.
Here are the tiers:
Developer is recommended for evaluation, development, and test scenarios.
Basic is recommended for small production environment.
Standard is recommended for mission critical workloads.
Analysis Services uses Query Processing Units (QPUs) to determine the processing power. A QPU is an abstracted
measure of compute and data processing resources that impact performance. Higher the QPU results in higher
performance.
Each tier offers one or more instances. The main cost drivers are the QPUs and memory allocated for the tier
instance. Start with a smaller instance, monitor the QPU usage, and scale up or down by selecting a higher or lower
instance within the tier. Also, monitor usage during off-peak hours. You can pause the server when not in use. No
charges apply when you pause your instance. For more information, see these articles:
The right tier when you need it
Monitor server metrics
Azure Analysis Services pricing
Reference architecture
Enterprise business intelligence - Azure Reference Architectures
Automated enterprise BI - Azure Architecture Center

Azure Data Factory V2


Azure Data Factory is a big data orchestrator. The service transfers data to and from diverse types of data stores. It
transforms the data by using other compute services. It creates workflows that automate data movement and
transformation. You are only charged for consumption. Consumption is measured by these factors:
Pipeline activities that take the actions on the data. Those actions include copying the data from various
sources, transforming it, and controlling the flow. For more information, see data movement activities, data
transformation activities, and control activities.
You are charged for the total activities in thousands.
Executions measured in data integration units. Each unit tracks CPU, memory, and network resource
allocation. This measurement applies to Azure Integration Runtime.
You are also charged for execution activities, such as copying data, lookups, and external activities. Each activity is
individually priced. You are also charged for pipelines with no associated triggers or runs within the month. All
activities are prorated by the minute and rounded up.
Reference architecture
Automated enterprise BI - Azure Architecture Center
Enterprise business intelligence - Azure Reference Architectures
Build an enterprise-grade conversational bot - Azure Architecture Center
Compute cost estimates
11/2/2020 • 8 minutes to read • Edit Online

Compute refers to the hosting model for the computing resources that your application runs on. Whether you’re
hosting model is Infrastructure as a Service (IaaS), Platform as a Service (PaaS), or Function as a service (FaaS),
each resource requires your evaluation to understand the tradeoffs that can be made that impact your cost. To
learn more about hosting models, read Understand the hosting models.
Infrastructure-as-a-Ser vice (IaaS) lets you provision individual virtual machines (VMs) along with the
associated networking and storage components. Then you deploy whatever software and applications you
want onto those VMs. This model is the closest to a traditional on-premises environment, except that
Microsoft manages the infrastructure. You still manage the individual VMs.
Platform-as-a-Ser vice (PaaS) provides a managed hosting environment, where you can deploy your
application without needing to manage VMs or networking resources. Azure App Service is a PaaS service.
Functions-as-a-Ser vice (FaaS) goes even further in removing the need to worry about the hosting
environment. In a FaaS model, you simply deploy your code and the service automatically runs it. Azure
Functions are a FaaS service.
What are the cost implications to consider for choosing a hosting model?

If your application can be broken down into short pieces of code, a FaaS hosting model might be the best choice.
You’re charged only for the time it takes to execute your code. For example, Azure Functions is a FaaS service that
processes events with serverless code. Azure Functions allows you to run short pieces of code (called functions)
without worrying about application infrastructure. Use one of the three Azure Functions pricing plans to fit your
need. To learn more about the pricing plans, see How much does Functions cost?
If you want to deploy a larger or more complex application, PaaS may be the better choice. With PaaS, your
application is always running, as opposed to FaaS, where your code is executed only when needed. Since more
resources are used with PaaS, the price increases.
If you are migrating your infrastructure from on-premises to Azure, IaaS will greatly reduce and optimize
infrastructure costs and salaries for IT staff who are no longer needed to manage the infrastructure. Since IaaS uses
more resources than PaaS and FaaS, your cost could be highest.
What are the main cost drivers for Azure ser vices?

You will be charged differently for each service depending on your region, licensing plan (e.g., Azure Hybrid Benefit
for Windows Server), number and type of instances you need, operating system, lifespan, and other parameters
required by the service. Assess the need for each compute service by using the flowchart in Choose a candidate
service. Consider the tradeoffs that will impact your cost by creating different estimates using the Pricing
Calculator. If your application consists of multiple workloads, we recommend that you evaluate each workload
separately. See Consider limits and costs to perform a more detailed evaluation on service limits, cost, SLAs, and
regional availability.
Are there payment options for Vir tual Machines (VMs) to help meet my budget?

The best choice is driven by the business requirements of your workload. If you have higher SLA requirements, it
will increase overall costs. You will likely need more VMs to ensure uptime and connectivity. Other factors that will
impact cost are region, operating system, and the number of instances you choose. Your cost also depends on the
workload life span. See Virtual machines and Use Spot VMs in Azure for more details.
Pay as you go lets you pay for compute capacity by the second, with no long-term commitment or upfront
payments. This option allows you to increase or decrease compute capacity on demand. It is appropriate for
applications with short-term, spiky, or unpredictable workloads that cannot be interrupted. You can also
start or stop usage at any time, resulting in paying only for what you use.
Reser ved Vir tual Machine Instances lets you purchase a VM for one or three years in a specified region
in advance. It is appropriate for applications with steady-state usage. You may save more money compared
to pay-as-you-go pricing.
Spot pricing lets you purchase unused compute capacity at major discounts. If your workload can tolerate
interruptions, and its execution time is flexible, then using spot pricing for VMs can significantly reduce the
cost of running your workload in Azure.
Dev-Test pricing offers discounted rates on Azure to support your ongoing development and testing. Dev-
Test allows you to quickly create consistent development and test environments through a scalable, on-
demand infrastructure. This will allow you to spin up what you need, when you need it, and explore
scenarios before going into production. To learn more about Azure Dev-Test reduced rates, see Azure
Dev/Test Pricing.
For details on available sizes and options for the Azure VMs you can use to run your apps and workloads, see Sizes
for virtual machines in Azure. For details on specific Azure VM types, see Virtual Machine series.
Do I pay extra to run large-scale parallel and high-performance computing (HPC) batch jobs?

Use Azure Batch to run large-scale parallel and HPC batch jobs in Azure. You can save on VM cost because multiple
apps can run on one VM. Configure your workload with either the Low-priority tier (the cheapest option) or the
Standard tier (provides better CPU performance). There is no cost for the Azure Batch service. Charges accrue for
the underlying resources that run your batch workloads.

Use PaaS as an alternative to buying VMs


When you use the IaaS model, you do have final control over the VMs. It may appear to be a cheaper option at first,
but when you add operational and maintenance costs, the cost increases. When you use the PaaS model, these
extra costs are included in the pricing. In some cases, this means that PaaS services can be a cheaper than
managing VMs on your own. For help finding areas in the architecture where it may be natural to incorporate PaaS
options, see Managed services.
How can I cut costs with hosting my web apps in PaaS?

If you host you web apps in PaaS, you’ll need to choose an App Service plan to run your apps. The plan will define
the set of compute resources on which your app will run. If you have more than one app, they will run using the
same resources. This is where you will see the most significant cost savings, as you don’t incur cost for VMs.
If your apps are event-driven with a short-lived process using a microservices architecture style, we recommend
using Azure Functions. Your cost is determined by execution time and memory for a single function execution. For
pricing details, see Azure Functions pricing.
Is it more cost-effective to deploy development testing (dev-test) on a PaaS or IaaS hosting model?

If your dev-test is built on Azure managed services such as Azure DevOps, Azure SQL Database, Azure Cache for
Redis, and Application Insights, the cheapest solution might be using the PaaS hosting model. You won’t incur the
cost and maintenance of hardware. If your dev-test is built on Azure managed services such as Azure DevOps,
Azure DevTest Labs, VMs, and Application Insights, you need to add the cost of the VMs, which can greatly increase
your cost. For details on evaluating, see Azure Dev/Test Pricing.
A special case of PaaS - Containers
How can I get the best cost savings for a containerized workload that requires full orchestration?

Your business requirements may necessitate that you store container images so that you have fast, scalable
retrieval, and network-close deployment of container workloads. Although there are choices as to how you will run
them, we recommend that you use AKS to set up instances with a minimum of three (3) nodes. AKS reduces the
complexity and operational overhead of managing Kubernetes by offloading much of that responsibility to Azure.
There is no charge for AKS Cluster Management. Any additional costs are minimal. The containers themselves have
no impact on cost. You pay only for per-second billing and custom machine sizes.
Can I save money if my containerized workload does not need full orchestration?

Your business requirements may not necessitate full orchestration. If this is the case and you are using App Service
containers, we recommend that you use one of the App Service plans. Choose the appropriate plan based on your
environment and workload.
There is no charge to use SNI-based SSL. Standard and Premium service plans include the right to use one IP SSL
at no additional charge. Free and shared service plans do not support SSL. You can purchase the right to use
additional SSL connections for a fee. In all cases the SSL certificate itself must be purchased separately. To learn
more about each plan, see App services plans.
Where’s the savings if my workload is event driven with a shor t-lived process?

In this example, Service Fabric may be a better choice than AKS. The biggest difference between the two is that AKS
works only with Docker applications using Kubernetes. Service Fabric works with microservices and supports
many different runtime strategies. Service Fabric can deploy Docker and Windows Server containers. Like AKS,
Service Fabric is built for the microservice and event-driven architectures. AKS is strictly a service orchestrator and
handles deployments, whereas Service Fabric also offers a development framework that allows building
stateful/stateless applications.

Predict cost estimates using the Pricing Calculator


Use the Pricing Calculator to create your total cost estimate. After you run your initial scenario, you may find that
your plan is beyond the scope of your budget. You can adjust the overall cost and create various cost scenarios to
make sure your needs are met before you commit to purchasing.

NOTE
The costs in this example are based on the current price and are subject to change. The calculation is for information
purposes only. It shows the Collapsed view of the cost in this estimate.
TIP
You can start building your cost estimate at any time and re-visit it later. The changes will be saved until you modify or delete
your estimate.

Next steps
Provisioning cloud resources to optimize cost
Virtual Machines documentation
VM payment options
Pricing Calculator
Data store cost estimates
11/2/2020 • 9 minutes to read • Edit Online

Most cloud workloads adopt the polyglot persistence approach. Instead of using one data store service, a mix of
technologies is used. You can achieve optimal cost benefit from using this approach.
Each Azure data store has a different billing model. To establish a total cost estimate, first, identify the business
transactions and their requirements. Then, break each transaction into operations. Lastly, identify a data store
appropriate for the type of data. Do this for each workload separately.
Let's take an example of an e-commerce application. It needs to store data for transactions such as orders,
payments, and billing. The data structure is predetermined and not expected to change frequently. Data integrity
and consistency are crucial. There's a need to store product catalogs, social media posts, and product reviews. In
some cases, the data is unstructured, and is likely to change over time. Media files must be stored and data must
be stored for auditing purposes.
Learn about data stores in Understand data store models.

Guidelines: Identify the business transactions and their requirements


The following list of questions address the requirements that can have the greatest impact your cost estimate. For
example, your monthly bill may be within your budget now, but if you scale up or add storage space later, your cost
may increase well over your budget.
Will your data need to be migrated to on-premises, external data centers, or other cloud hosting environments?
What type of data are you intending to store?
How large are the entities you need to store?
What is the overall amount of storage capacity you need?
What kind of schemas will you apply to your data (e.g., fixed schema, schema-on-write, or schema-on-read)?
What are your data performance requirements (e.g., acceptable response times for querying and aggregation of
data once ingested)?
What level of fault-tolerance do you need to provide for data consumers?
What kind of data replication capabilities do you require?
Will the limits of a particular data store support your requirements for scale, number of connections, and
throughput?
How many instances will need to run to support your uptime and throughput requirements? (Consider
operations costs in this calculation.)
Can you partition your data to store it more cost effectively (e.g., can you move large objects out of an
expensive relational database into an object store)?
There are other requirements that may not have as great of an impact on your cost. For example, the US East
region is only slightly lower than Canada Central. See Criteria for choosing a data store for additional business
requirements.
Use the Pricing Calculator to determine different cost scenarios.
What are the network requirements that may impact cost?

Do you need to restrict or otherwise manage access to your data from other network resources?
Does data need to be accessible only from inside the Azure environment?
Does the data need to be accessible from specific IP addresses or subnets?
Does the data need to be accessible from applications or services hosted on-premises or in other external data
centers?

Guidelines: Break each transaction into operations


For example, one business transaction can be processed by these distinct operations:
10-15 database calls
three append operations on blob, and
two list operations on two separate file shares
How does data type in each operation effect cost?

Consider Azure Blob Storage Block Blobs instead of storing binary image data in Azure SQL Database. Blob
storage is cheaper than Azure SQL Database.
If your design requires SQL, store a lookup table in SQL Database and retrieve the document when needed to
serve it to the user in your application middle tier. SQL Database is highly targeted for high-speed data lookups
and set-based operations.
The hot access tier of Azure Block Blob Storage cost is cheaper than the equivalent size of the Premium SSD
volume that has the database.
Read this decision chart to make your choices.

Guidelines: Identify a data store appropriate for the type of data


NOTE
An inappropriate data store or one that is mis-configured can have a huge cost impact on your design.

Relational database management systems (RDBMS) cost


RDBMS is the recommended choice when you need strong consistency guarantees. An RDBMS typically supports a
schema-on-write model, where the data structure is defined ahead of time, and all read or write operations must
use the schema.
How can I save money if my data is on-premises and already on SQL ser ver?

If the on-premises data is already on a SQL server, it might be a natural choice. The on-premises license with
Software Assurance can be used to bring down the cost if the workload is eligible for Azure Hybrid Benefit. This
option applies for Azure SQL Database (PaaS) and SQL Server on Azure Virtual Machines (IaaS).
For open-source databases such as MySQL, MariaDB, or PostGreSQL, Azure provides managed services that are
easy to provision.
What are some design considerations that will affect cost?

If the SLAs don't allow for downtime, can a read-only replica in a different region enable business continuity?
If the database in the other region must be read/write, how will the data be replicated?
Does the data need to be synchronous, or could consistency allow for asynchronous replication?
How important is it for updates made in one node to appear in other nodes, before further changes can be
made?
Azure storage has several options to make sure data is copied and available when needed. Locally redundant
storage (LRS) synchronously replicates data in the primary region. If the entire primary center is unavailable the
replicated data is lost. At the expensive end, Geo-zone-redundant storage (preview) (GZRS) replicates data in
availability zones within the primary region and asynchronously copied to another paired region. Databases that
offer geo-redundant storage, such as SQL Database, are more expensive. Most other OSS RDBMS database use
LRS storage, which contributes to the lower price range.
For more information, see automated backups.
Cosmos DB offers five consistency levels: strong, bounded staleness, session, consistent prefix, and eventual
consistency. Each level provides availability and performance tradeoffs and is backed by comprehensive SLAs. The
consistency level itself does not affect cost.
How can I minimize compute cost?

Higher throughput and IOPS require higher compute, memory, I/O and storage limits. These limits are expressed
in a vCore model. With higher vCore number, you buy more resources and consequently the cost is higher. Azure
SQL Database has more vCores and allows you to scale in smaller increments. Azure Database for MySQL,
PostgreSQL, and MariaDB have fewer vCores and scaling up to a higher vCore can cost more. MySQL provides in-
memory tables in the Memor y Optimized tier, which can also increase the cost.
All options offer a consumption and provisioned pricing models. With pre-provisioned instances, you save more if
you can commit to one or three years.
How is primar y and backup storage cost calculated?

With Azure SQL Database the initial 32 GB of storage is included in the price. For the other listed options, you need
to buy storage separately and might increase the cost depending on your storage needs.
For most databases, there is no charge for the price of backup storage that is equal in size to primary storage. If
you need more backup storage, you will incur an additional cost.

Key/value and document databases cost


Azure Cosmos DB is recommended for key/value stores and document databases. Azure Cache for Redis is also
recommended for key/value stores.
For Cosmos DB, here are some considerations that will affect cost:
Can storage and item size be adjusted?
Can index policies be reduced or customized to bring down the extra cost for writes and eliminate the
requirement for additional throughput capacity?
If data is no longer needed, can it be deleted from the Azure Cosmos account? As an alternative, you can
migrate the old data to another data store such as Azure blob storage or Azure data warehouse.
For Azure Cache for Redis, there is no upfront cost, no termination fees, you pay only for what you use, and billing
is per-hour.

Graphic databases cost


Cost considerations for graphic database stores include storage required by data and indexes used across all the
regions. For example, graphs need to scale beyond the capacity of a single server, and make it easy to lookup the
index when processing a query.
The Azure service that supports this is Gremlin API in Azure Cosmos DB. Cost is limited to the Azure Cosmos DB
usage. You pay for the operations you perform against the database and for the storage consumed by your data.
Charges are determined by the number of provisioned containers, the number of hours the containers were online,
and the provisioned throughput for each container.
If the on-premises data is already on an SQL server on Azure Virtual Machines (IaaS), the license with Software
Assurance can be used to bring down the cost if the workload is eligible for Azure Hybrid Benefit.

Data analytics cost


Cost considerations for data analytics stores include data storage, multiple servers to maximize scalability, and the
ability to access large data as external tables.
Azure has many services that support data analytics stores: Azure Synapse Analytics, Azure Data Lake, Azure Data
Explorer, Azure Analysis Services, HDInsight, and Azure Databricks. As an example of use, historical data is typically
stored in data stores such as blob storage or Azure Data Lake Storage Gen2, which are then accessed by Azure
Synapse, Databricks, or HDInsights as external tables.
When using Azure Synapse, you will only pay for the capabilities you opt in to use. During public preview, there
will not be a cost for provisioning an Azure Synapse workspace. Enabling a managed VNET and Customer
Managed Keys may incur a workspace fee after public preview. Pricing of workspaces with additional capabilities
will be announced at a future date.

Column family database cost


The main cost consideration for a column family database is that it needs to be massively scalable.
The Azure services that support column family databases are Azure Cosmos DB Cassandra API and HBase in
HDInsight.
For Azure Cosmos DB Cassandra API, you pay the cost of Cosmos DB, which includes database operations and
consumed storage. Cassandra API is open-source.
For HBase in HDInsight, you pay the cost of HDInsight, which includes instance size and number. HBase is open-
source.

Search engine database cost


Cost incurs for a search engine database when applications need to search for information held in external data
stores. It also needs to index massive volumes of data and provide near real-time access to these indexes.
Azure Cognitive Search is a search service that uses AI capabilities to identify and explore relevant content at scale.

Time series database cost


The main cost considerations for a time series database is the need to collect large amounts of data in real time
from a large number of sources. Although the records written to a time series database are generally small, there
are often a large number of records, and total data size can grow rapidly which will drive up cost.
Azure Time Series Insights may be the best option to minimize cost.

Object storage cost


Cost considerations include storing large binary objects such as images, text files, video and audio streams, large
application data objects and documents, and virtual machine disk images.
Azure services that support object storage are Azure Blob Storage and Azure Data Lake Storage Gen2.

Shared files cost


The main cost consideration is having the ability to access files across a network. For example, given appropriate
security and concurrent access control mechanisms, sharing data in this way can enable distributed services to
provide highly scalable data access for performing basic, low-level operations such as simple read and write
requests.
The Azure service that supports shared files is Azure Files.
Azure messaging cost estimates
11/2/2020 • 2 minutes to read • Edit Online

The messaging services in this article have no up-front cost or termination fees, and you pay only for what you use.
In some cases, it's advantageous to combine two messaging services to increase the efficiency of your messaging
system. See Crossover scenarios for examples.
Cost is based on the number of operations or throughput units used depending on the message service. Using the
wrong messaging service for the intent can lead to higher costs. Before choosing a service, first, determine the
intent and requirements of the messages. Then, consider the tradeoffs between cost and operations/throughput
units. For tradeoff examples, see Technology choices for a message broker.
Use the Azure Pricing calculator for help creating various cost scenarios.

Service Bus cost


Connect on-premises and cloud-based applications and services to implement highly secure messaging workflows
using Service Bus. Cost is based on messaging operations and number of connections. The Basic tier is the
cheapest. If you want more operations and features, choose the Standard or Premium tier. For example, Service Bus
Premium runs in dedicated resources to provide higher throughput and more consistent performance.
For pricing details, see Service Bus pricing.

Event Grid cost


Manage routing of all events from any source to any destination to simplify event-based app development using
Event Grid. Event Grid can route a massive number of events per second per region. Cost is based on number of
operations performed. Examples of some operations are event ingress, subscription delivery attempts,
management calls, and filtering by subject suffix.
For pricing details, see Event Grid pricing.

Event Hubs
Stream millions of events per second from any source to build dynamic data pipelines and immediately respond to
business challenges using Event Hubs. Cost is based on throughput units. A key difference between Event Grid and
Event Hubs is in the way event data is made available to subscribers. For more information, see Pull model.
For questions and answers on pricing, see Pricing.
For pricing, see Event Hubs pricing.
Networking cost estimates
11/2/2020 • 5 minutes to read • Edit Online

Traffic routing and load balancing


Most workloads have a load balancing service to route and distribute traffic in a way that a single resource isn't
overloaded, and the response time is minimum with maximum throughput.
Do you need to distribute traffic within a region or across regions?

Azure Front Door and Traffic Manager can distribute traffic to backends, clouds, or hybrid on-premises services
that reside in multiple regions. Otherwise, for regional traffic that moves within virtual networks or zonal and
zone-redundant service endpoints within a region, choose Application Gateway or Azure Load Balancer.
What is the type of traffic?

Load balancers such as Azure Application Gateway and Azure Front Door are designed to route and distribute
traffic to web applications or HTTP(S) endpoints. Those services support Layer 7 features such as SSL offload, web
application firewall, path-based load balancing, and session affinity. For non-web traffic, choose Traffic Manager or
Azure Load Balancer. Traffic Manager uses DNS to route traffic. Load Balancer supports Layer 4 features for all UDP
and TCP protocols.
Here’s the matrix for choosing load balancers by considering both dimensions.

SERVIC E GLO B A L / REGIO N A L REC O M M EN DED T RA F F IC

Azure Front Door Global HTTP(S)

Traffic Manager Global non-HTTP(S)

Application Gateway Regional HTTP(S)

Azure Load Balancer Regional non-HTTP(S)

For more information, see Choose a load balancing service.


Example cost analysis
Consider a web application that receives traffic from users across regions over the internet. To minimize the
request response time, the load balancer can delegate the responsibility from the web server to a different
machine. The application is expected to consume 10 TB of data. Routing rules are required to route incoming
requests to various paths. Also, we want to use Web Application Firewall (WAF) to secure the application through
policies.
By using the Azure Pricing Calculator, we can estimate the cost for these two services.

The example is based on the current price and is subject to change. The calculation shown is for information
purposes only.

Azure Front Door (West US), Zone 1.


Application Gateway (West US)
Consider a similar example where the type of traffic is changed. Instead the application is a UDP streaming service
that is deployed across regions and traffic goes over the internet. We can use a combination of Traffic Manager and
Azure Load Balancer. Traffic Manager is a simple routing service that uses DNS to direct clients to specific service.
Here are the cost estimates:
Azure Traffic Manager

IT EM EXA M P L E EST IM AT E

DNS queries 100 million x $0.54/per million = $54.00

Basic health checks 20 endpoints x $0.36/month = $7.20

Fast interval health checks 10 endpoints x $1.00/month = $10.00

External endpoints hosted outside Azure 10 endpoints x $0.54/month = $5.40

Real User Measurements 100 million user measurements x $2.00/month = $200.00

Traffic view 100 million data points processed x $2.00/month = $200.00

Total $486.60

Azure Load Balancer

IT EM EXA M P L E EST IM AT E

Rules First 5 rules: $0.032/rule/hour x 730 = $18.25

Additional 5 rules: $0.013/rule/hour x 730 = $36.50

Data processed 10 TB x $0.005/GB = 51.20


IT EM EXA M P L E EST IM AT E

Total $105.95

Peering
Do you need to connect vir tual networks?

Peering technology is a service used for Azure virtual networks to connect with other virtual networks in the same
or different Azure region. Peering technology is used often in hub and spoke architectures.
An important consideration is the additional costs incurred by peering connections on both egress and ingress
traffic traversing the peering connections.

Keeping the top talking services of a workload within the same virtual network, zone and/or region
unless otherwise required. Use virtual networks as shared resources for multiple workloads against a single
virtual network per workload approach. This approach will localize traffic to a single virtual network and avoid
the additional costs on peering charges.

Example cost analysis


The example is based on the current price and is subject to change. The calculation shown is for information
purposes only.

Peering within the same region is cheaper than peering between regions or Global regions. For instance, you
consume 50 TB per month by connecting two VNETs in Central US. Using the current price here is the incurred cost.

IT EM EXA M P L E EST IM AT E

Ingress traffic 50 TB x $0.0100/GB = $512.50

Egress traffic 50 TB x $0.0100/GB = $512.50

Total $1,025.00

Let’s compare that cost for cross-region peering between Central US and East US.

IT EM EXA M P L E EST IM AT E

Ingress traffic 50 TB x $0.0350/GB = $1,792.00

Egress traffic 50 TB x $0.0350/GB = $1,792.00

Total $3,584.00

Hybrid connectivity
There are two main options for connecting an on-premises datacenter to Azure datacenters:
Azure VPN Gateway can be used to connect a virtual network to an on-premises network through a VPN
appliance or to Azure Stack through a site-to-site VPN tunnel.
Azure ExpressRoute creates private connections between Azure datacenters and infrastructure that’s on premise
or in a colocation environment. What is the required throughput for cross-premises connectivity?

VPN gateway is recommended for development/test cases or small-scale production workloads where throughput
is less than 100 Mbps. Use ExpressRoute for enterprise and mission-critical workloads that access most Azure
services. You can choose bandwidth from 50 Mbps to 10 Gbps.
Another consideration is security. Unlike VPN Gateway traffic, ExpressRoute connections don't go over the public
internet. VPN Gateway traffic is secured by industry standard IPsec.
For both services, inbound transfers are free and outbound transfers are billed per the billing zone.
For more information, see Choose a solution for connecting an on-premises network to Azure.
This blog post provides a comparison of the two services.
Example cost analysis
This example compares the pricing details for VPN Gateway and ExpressRoute.

The example is based on the current price and is subject to change. The calculation shown is for information
purposes only.

Azure VPN Gateway


Suppose, you choose the VpnGw2AZ tier, which supports availability zones; 1 GB/s bandwidth. The workload
needs 15 site-to-site tunnels and 1 TB of outbound transfer. The gateway is provisioned and available for 720
hours.

IT EM EXA M P L E EST IM AT E

VPN Hours 720 x $0.564 = $406.08

Site to Site (S2S) Tunnels The first 10 tunnels are free. For additional tunnels, the cost is
5 tunnels x 720 hours x $0.015 per hour per tunnel = $54.00

Outbound transfer 1024 GB x 0.086 = $88.65

Total cost per month $548.73

ExpressRoute
Let's choose the Metered Data plan in billing zone of Zone 1 .

IT EM EXA M P L E EST IM AT E

Circuit bandwidth 1 GB/s speed has a fixed rate of $436.00 for the standard
price.

Outbound transfer 1 TB x $0.0025 = $25.60

Total cost per month $461.00

The main cost driver is outbound data transfer. ExpressRoute is more cost-effective than VPN Gateway when
consuming large amounts of outbound data. If you consume more than 200 TB per month, consider ExpressRoute
with the Unlimited Data plan where you're charged a flat rate.
For pricing details, see:
Azure VPN Gateway pricing
Azure ExpressRoute pricing
Networking resources provisioning
11/13/2020 • 7 minutes to read • Edit Online

For design considerations, see Networking resource choices.

Azure Front Door


Azure Front Door billing is affected by outbound data transfers, inbound data transfers, and routing rules. The
pricing chart doesn't include the cost of accessing data from the backend services and transferring to Front Door.
Those costs are billed based on data transfer charges, described in Bandwidth Pricing Details.
Another consideration is Web Application Firewall (WAF) settings. Adding policies will drive up the cost.
For more information, see Azure Front Door Pricing.
Reference architecture
Highly available multi-region web application uses Front Door to route incoming requests to the primary region. If
the application running that region becomes unavailable, Front Door fails over to the secondary region.

Azure Application Gateway


There are two main pricing models:
Fixed price
You're charged for the time that the gateway is provisioned and available and the amount of data processed
by the gateway. For more information, see Application Gateway pricing.
Consumption price
This model applies to v2 SKUs that offer additional features such as autoscaling, Azure Kubernetes Service
Ingress Controller, zone redundancy, and others. You're charged based on the consumed capacity units. The
capacity units measure the compute resources, persistent connections, and throughput. Consumption price
is charged in addition to the fixed price.
For more information, see:
Application Gateway v2 pricing
Application Gateway pricing
Reference architecture
Microservices architecture on Azure Kubernetes Service (AKS) uses Application Gateway as the ingress
controller.
Securely managed web applications uses Application Gateway as a web traffic load balancer operating at
Layer 7 that manages traffic to the web application. Web Application Firewall (WAF) is enabled to enhance
security.

Azure ExpressRoute
There are two main pricing models:
Metered Data plan
There are two pricing tiers: Standard and Premium , which is priced higher. The tier pricing is based on the
circuit bandwidth.
If you don't need to access the services globally, choose Standard . With this tier, you can connect to regions
within the same zone at no additional cost. Outbound cross-zonal traffic incurs more cost.
Unlimited Data plan
All inbound and outbound data transfer is included in the flat rate. There are two pricing tiers: Standard and
Premium , which is priced higher.
Calculate your utilization and choose a billing plan. The Unlimited Data plan is recommended if you exceed
about 68% of utilization.
For more information, see Azure ExpressRoute pricing.
Reference architecture
Connect an on-premises network to Azure using ExpressRoute connects an Azure virtual network and an on-
premises network connected using with VPN gateway failover.

Azure Firewall
Azure Firewall usage can be charged at a fixed rate per deployment hour. There's additional cost for the amount of
data transferred.
There aren't additional cost for a firewall deployed in an availability zone. There are additional costs for inbound
and outbound data transfers associated with availability zones.
When compared to network virtual appliances (NVAs), with Azure Firewall you can save up to 30-50%. For more
information see Azure Firewall vs NVA.
Reference architecture
Hub-spoke network topology in Azure
Deploy highly available NVAs

Azure Load Balancer


The service distributes inbound traffic according to the configured rules.
There are two tiers: Basic and Standard .
The Basic tier is free.
For the Standard tier, you are charged only for the number of configured load-balancing and outbound rules.
Inbound NAT rules are free. There's no hourly charge for the load balancer when no rules are configured.
See Azure Load Balancer Pricing for more information.
Reference architecture
Connect an on-premises network to Azure using ExpressRoute: Multiple subnets are connected through
Azure load balancers.
SAP S/4HANA in Linux on Azure: Distribute traffic to virtual machines in the application-tier subnet.
Extend an on-premises network using VPN Internal load balancer. Network traffic from the VPN gateway is
routed to the cloud application through an internal load balancer. The load balancer is located in the front-
end subnet of the application.
Azure VPN Gateway
When provisioning a VPN Gateway resource, choose between two gateway types:
VPN gateway to send encrypted traffic across the public internet. Site-to-Site, Point-to-Site, and VNet-to-VNet
connections all use a VPN gateway.
ExpressRoute gateway - To send network traffic on a private connection. This is used when configuring Azure
ExpressRoute.
For VPN gateway, select Route-based or Policy-based based on your VPN device and the kind of VPN
connection you want to create. Route-based gateway allows point-to-site, inter-virtual network, or multiple site-to-
site connections. Policy based only allows one site-to-site tunnel. Point-to-site isn't supported. So, route-based VPN
is more expensive.
Next, you need to choose the SKU for Route-based VPN. For developer/test workloads, use Basic . For production
workloads, an appropriate Generation1 or Generation2 SKU. Each SKU has a range and pricing depends on the
type of VPN gateway because each type offers different levels of bandwidth, site-to-site, and point-to-site tunnel
options. Some of those types also offer availability zones, which are more expensive. If you need higher bandwidth,
consider Azure ExpressRoute.
VPN gateway can be the cost driver in a workload because charges are based on the amount of time that the
gateway is provisioned and available.
All inbound traffic is free, all outbound traffic is charged as per the bandwidth of the VPN type. Bandwidth also
varies depending on the billing zone.
For more information, see
Hybrid connectivity
VPN Gateway Pricing.
Traffic across zones
Bandwidth Pricing Details.
Reference architecture
Extend an on-premises network using VPN connects the virtual network to the on-premises network through a
VPN device.

Traffic Manager
Traffic manager uses DNS to route and load balance traffic to service endpoints in different Azure regions. So, an
important use case is disaster recovery. In a workload, you can use Traffic Manager to route incoming requests to
the primary region. If that region becomes unavailable, Traffic Manager fails over to the secondary region. There
are other features that can make the application highly responsive and available. Those features cost money.
Determine the best web app to handle request based on geographic location.
Configure caching to reduce the response time.
Traffic Manager isn't charged for bandwidth consumption. Billing is based on the number of DNS queries received,
with a discount for services receiving more than 1 billion monthly queries. You're also charged for each monitored
endpoint.
Reference architecture
Multi-region N-tier application uses Traffic Manager to route incoming requests to the primary region. If that
region becomes unavailable, Traffic Manager fails over to the secondary region. For more information, see the
section Traffic Manager configuration.
DNS query charges
Traffic Manager uses DNS to direct clients to specific service.
Only DNS queries that reach Traffic Manager are charged in million query units. For 100 million DNS queries
month, the charges will be $54.00 a month based on the current Traffic Manager pricing.
Not all DNS queries reach Traffic Manager. Recursive DNS servers run by enterprises and ISPs first attempt to
resolve the query by using cached DNS responses. Those servers query Traffic Manager at a regular interval to get
updated DNS entries. That interval value or TTL is configurable in seconds. TTL can impact cost. Longer TTL
increases the amount of caching and reduces DNS query charges. Conversely, shorter TTL results in more queries.
However, there is a tradeoff. Increased caching also impacts how often the endpoint status is refreshed. For
example, the user failover times, for an endpoint failure, will become longer.
Health monitoring charges
When Traffic Manager receives a DNS request, it chooses an available endpoint based on configured state and
health of the endpoint. To do this, Traffic Manager continually monitors the health of each service endpoint.
The number of monitored endpoints are charged. You can add endpoints for services hosted in Azure and then add
on endpoints for services hosted on-premises or with a different hosting provider. The external endpoints are more
expensive, but health checks can provide high-availability applications that are resilient to endpoint failure,
including Azure region failures.
Real User Measurement charges
Real User Measurements evaluates network latency from the client applications to Azure regions. That influences
Traffic Manager to select the best Azure region in which the application is hosted. The number of measurements
sent to Traffic Manager is billed.
Traffic view charges
By using Traffic View, you can get insight into the traffic patterns where you have endpoints. The charges are based
on the number of data points used to create the insights presented.

Virtual Network
Azure Virtual Network is free. You can create up to 50 virtual networks across all regions within a subscription.
Here are a few considerations:
Inbound and outbound data transfers are charged per the billing zone. Traffic that moves across regions and
billing zones are more expensive. For more information, see:
Traffic across zones
Bandwidth Pricing Details.
VNET Peering has additional cost. Peering within the same region is cheaper than peering between regions
or Global regions. Inbound and outbound traffic is charged at both ends of the peered networks. For more
information, see Peering
Managed services (PaaS) don't always need a virtual network. The cost of networking is included in the
service cost.
Web application cost estimates
11/2/2020 • 4 minutes to read • Edit Online

All web applications (apps) have no up-front cost or termination fees. Some charge only for what you use and
others charge per-second or per-hour. In addition, all web apps run in App Service plans. Together these costs can
help determine the total cost of a web app.
Use the Azure Pricing calculator to help create various cost scenarios.

App Service plans


The App Service plans include the set of compute resources needed for the web app to run. Except for Free tier, an
App Service plan carries a charge on the compute resources it uses. For a description of App Service plans, see
Azure App Service plan overview.
You can potentially save money by choosing one or more App Service plans. To help find the solution that meets
your business requirements, see Should I put an app in a new plan or an existing plan?
The pricing tier of an App Service plan determines what App Service features you get and how much you pay for
the plan. The higher the tier, the higher the cost. For details on the pricing tiers, see App Service Pricing.
Are there exceptions that impact App Ser vices plan cost?

You don't get charged for using the App Service features that are available to you (e.g., configuring custom
domains, TLS/SSL certificates, deployment slots, backups, etc.). For a list of exceptions, see How much does my App
Service plan cost?

App Service cost


App Service plans are billed on a per second basis. If your solution includes several App Service apps, consider
deploying them to separate App Service plans. This approach enables you to scale them independently because
they run on separate instances, which saves unnecessary cost.
Azure App Service supports two types of SSL connections: Server Name Indication (SNI) SSL Connections and IP
Address SSL Connections. SNI-based SSL works on modern browsers while IP-based SSL works on all browsers. If
your business requirements allow, use SNI-based SSL instead of IP-based SSL. There is no charge for SNI-based.
IP-based SSL incurs a cost per connection.
See the SSL Connections section in App Service pricing for details.

API Management cost


If you want to publish APIs hosted on Azure, on-premises, and in other clouds more securely, reliably, and at scale,
use API Management. The pricing tier you choose depends on the features needed based on the business
requirements. For example, depending on your requirements for scalability, the number of units required to scale
out range from 1-10. As the number of units increases, the cost of the pricing tier increases.
Self-hosted gateways are available in the Developer and Premium pricing tiers. There is an additional cost for this
service using the Premium tier. There is no cost for self-hosted gateways if using the Developer tier.
For pricing details, see API Management Pricing.
Content Delivery Network (CDN) cost
If you want to offer users optimal online content delivery, use CDN in your workload. The data size being used has
the most significant impact on cost, so you'll want to choose based on your business requirements. Purchasing
data in increments of TB is significantly higher than purchasing data in increments of GB.
The provider you choose also can impact cost. Choose Microsoft as the provider to get the lowest cost.
Some data, such as shopping carts, search results, and other dynamic content, is not cacheable. CDN offers
Acceleration Data Transfers, also called Dynamic Site Acceleration (DSA), to accelerate this type of web content. The
price for DSA is the same across Standard and Premium profiles.
For pricing details, see Content Delivery Network pricing.

Azure Cognitive Search cost


To set up and scale a search experience quickly and cost-effectively, use Azure Cognitive Search. When you create
an Azure Cognitive Search service, a resource is created at a pricing tier (or SKU) that's fixed for the lifetime of the
service. Billing depends on the type of service. See Search API for a list of services.
The charges are based on the number of transactions for each type of operation specific to a service. A certain
number of transactions are free. If you need additional transactions, choose from the Standard instances.
For pricing details, see Azure Cognitive Search pricing.

Azure SignalR cost


Use SignalR for any scenario that requires pushing data from server to client in real time. For example, Azure
SignalR provides secure and simplified communication between client and app one-to-one (e.g., chat window) or
one-to-many (instant broadcasting, IoT dashboards, or notification to social network). To learn more about Azure
SignalR, see What is Azure SignalR Service?
The Free tier is not recommended for a production environment. With the Standard tier, you pay only for what you
use. A Standard tier is recommended as an enterprise solution because it offers a large number of concurrent
connections and messages.
For pricing details, see Azure SignalR Service pricing.

Notification Hubs cost


To broadcast push notifications to millions of users at once, or tailor notifications to individual users, use
Notification Hubs. Pricing is based on number of pushes. The Free tier is a good starting point for exploring push
capabilities but is not recommended for production apps. If you require more pushes and features such as
scheduled pushes or multi-tenancy, you can buy additional pushes per million for a fee.
See Push notifications with Azure Notification Hubs: Frequently asked questions for more information. For pricing
details, see Notification Hubs pricing.
Checklist - Monitor cost
11/2/2020 • 2 minutes to read • Edit Online

Use this checklist to monitor the cost of the workload.


Gather cost data from diverse sources to create repor ts . Start with tools like Azure Advisor and
Azure Cost Management. Build custom reports relevant for the business by using Consumption APIs.
Cost reports
Review costs in cost analysis
Use resource tag policies to build repor ts . Tags can be used to identify the owners of systems or
applications and create custom reports.
Follow a consistent tagging standard
Video: How to review tag policies with Azure Cost Management
Use RBAC built-in roles for cost . Only give access to users who are intended to view and analyze cost
reports. The roles are defined per scope. For example, use the Cost Management Reader role to enable
users to view costs for their resources in subscriptions or resource groups.
Provide the right level of cost access
Azure RBAC scopes
Respond to aler ts and have a response plan according to the constraints. Respond to alerts quickly
and identify possible causes and any required action.
Budget and alerts
Use cost alerts to monitor usage and spending
Adopt both proactive and reactive approaches for cost reviews . Conduct cost reviews at a regular
cadence to determine the cost trend. Also review reports that are created because of alerts.
Conduct cost reviews
Participate in central governance cost reviews
Analyze the cost at all scopes by using Cost analysis. Identify services that are driving the cost through
different dimensions, such as location, usage meters, and so on. Review whether certain optimizations are
bringing results. For example, analyze costs associated with reserved instances and Spot VMs against
business goals.
Quickstart: Explore and analyze costs with cost analysis
Detect anomalies and identify changes in business or applications that might have contributed changes in
cost. Focus on these factors:
Traffic pattern as the application scales.
Budget for the usage meters on resources.
Performance bottle necks.
CPU utilization and network throughput.
Storage footprint for blobs, backups, archiving.
Use Visualization tools to analyze cost information.
Create visuals and reports with the Azure Cost Management connector in Power BI Desktop
Cost Management App
Set budgets and alerts
11/2/2020 • 2 minutes to read • Edit Online

Azure Cost Management has an alert feature. Alerts are generated when consumption reaches a threshold.
Consider the metrics that each resource in the workload. For each metric, build alerts on baseline thresholds. This
way, the admins can be alerted when the workload is using the services at capacity. The admins can then tune the
resources to target SKUs based on current load.
You can also set alerts on allowed budgets at the resource group or management groups scopes. Both cloud
services performance and budget requirements can be balanced through alerts on metrics and budgets.
Over time, the workload can be optimized to autoheal itself when alerts are triggered. For information about using
alerts, see Use cost alerts to monitor usage and spending.

Respond to alerts
When you receive an alert, check the current consumption data. Budget alerts aren't triggered in real time. There
may be a delay between the alert and the current actual cost. Look for significant difference between cost values
when the alert happened and the current cost. Next, conduct a cost review to discuss the cost trend, possible
causes, and any required action. For information about stakeholders in a cost review, see Cost reviews.
Determine short and long-term actions justified by business value. Can a temporary increase in the alert threshold
be a feasible fix? Does the budget need to be increased longer-term? Any increase in budget must be approved.
If the alert was caused because of unnecessary or expensive resources, you can implement additional Azure Policy
controls. You can also add budget automation to trigger resource scaling or shutdowns.

Revise budgets
After you identify and analyze your spending patterns, you can set budget limits for applications or business units.
You'll want to assign access to view or manage each budget to the appropriate groups. Setting several alert
thresholds for each budget can help track your burn down rate.
Generate cost reports
11/2/2020 • 3 minutes to read • Edit Online

To monitor the cost of the workload, use Azure cost tools or custom reports. The reports can be scoped to business
units, applications, IT infrastructure shared services, and so on. Make sure that the information is consistently
shared with the stakeholders.

Azure cost tools


Azure provides cost tools that can help track cloud spend and make recommendations.
Azure Advisor
Azure Cost Management
Cost analysis
Cost analysis is a tool in Azure Cost Management that allows you to view aggregated costs over a period. This
view can help you understand your spending trends.
View costs at different scopes, such as for a resource group or specific resource tags. Cost Analysis provides built-in
charts and custom views. You can also download the cost data in CSV format to analyze with other tools.
For more information, see: Quickstart: Explore and analyze costs with cost analysis.

NOTE
There are many ways of purchasing Azure services. Not all are supported by Azure Cost Management. For example, detailed
billing information for services purchased through a Cloud Solution Provider (CSP) must be obtained directly from the CSP.
For more information about the supported cost data, see Understand cost management data.

Advisor recommendations
Azure Advisor recommendations for cost can highlight the over-provisioned services and ways to lower cost. For
example, the virtual machines that should be resized to a lower SKU, unprovisioned ExpressRoute circuits, and idle
virtual network gateways.
For more information, see Advisor cost management recommendations.

Consumption APIs
Granular and custom reports can help track cost over time. Azure provides a set of Consumption APIs to generate
such reports. These APIs allow you to query and create various cost data. Data includes usage data for Azure
services and third-party services through Marketplace, balances, budgets, recommendations on reserved
instances, among others. You can configure Azure Role-based Access control policies to allow only a certain set of
users or applications access the data.
For example, you want to determine the cost of all resources used in your workload for a given period. One way of
getting this data is by querying usage meters and the rate of those meters. You also need to know the billing period
of the usage. By combining these APIs, you can estimate the consumption cost.
Billing account API: To get your billing account to manage your invoices, payments, and track costs.
Billing Periods API: To get billing periods that have consumption data.
Usage Detail API: To get the breakdown of consumed quantities and estimated charges.
Marketplace Store Charge API: To get usage-based marketplace charges for third-party services.
Price Sheet API: To get the applicable rate for each meter.
The result of the APIs can be imported into analysis tools.

NOTE
Consumption APIs are supported for Enterprise Enrollments and Web Direct Subscriptions (with exceptions). Check
Consumption APIs for updates to determine support for other types of Azure subscriptions.

For more information about common cost scenarios, see Billing automation scenarios.

Custom scripts
Use Azure APIs to schedule custom scripts that identify orphaned or empty resources. For example, unattached
managed disks, load balancers, application gateways, or Azure SQL Servers with no databases. These resources
incur a flat monthly charge while unused. Other resources may be stale, for example VM diagnostics data in blob or
table storage. To determine if the item should be deleted, check its last use and modification timestamps.

Analyze and visualize


Start with the usage details in the invoice. Review that information against relevant business data and events. If
there are anomalies, evaluate the significant changes in business or applications that might have contributed those
changes.
Power BI Desktop has a connector that allows you to connect billing data from Azure Cost Management. You can
create custom dashboards and reports, ask questions of your data and publish, and share your work.

NOTE
Sharing requires Power BI Premium licenses.

For more information, see Create visuals and reports with the Azure Cost Management connector in Power BI
Desktop.
You can also use the Cost Management App. The app uses Azure Cost Management Template app for Power BI. You
can import and analyze usage data and incurred cost within Power BI.
Conduct cost reviews
11/2/2020 • 2 minutes to read • Edit Online

The goal of cost monitoring is to review cloud spend with the intent of establishing cost controls and preventing
any misuse. The organization should adopt both proactive and reactive review approaches for monitoring cost.
Effective cost reviews must be conducted by accountable stakeholders at a regular cadence. The reviews must be
complemented with reactive cost reviews, for example when a budget limit causes an alert.
Who should be included in a cost review?

The financial stakeholders must understand cloud billing to derive business benefits using financial metrics to
make effective decisions.
Also include members of the technical team. Application owners, systems administrators who monitor and back-up
cloud systems, and business unit representatives must be aware of the decisions. They can provide insights
because they understand cloud cost metering and cloud architecture. One way of identifying owners of systems or
applications is through resource tags.
What should be the cadence of cost reviews?

Cost reviews can be conducted as part of the regular business reviews. It’s recommended that such reviews are
scheduled,
During the billing period. This review is to create an awareness of the estimated pending billing. These reports
can be based on Azure Advisor and Azure Cost Management – Cost analysis.
After the billing period. This review shows the actual cost with activity for that month. Use Balance APIs to
generate monthly reports. The APIs can query data that gets information on balances, new purchases, Azure
Marketplace service charges, adjustments, and overage charges.
Because of a budget alert or Azure Advisor recommendations.
Web Direct (pay-as-you-go) and Cloud Solution Provider (CSP) billing occurs monthly. While Enterprise Agreement
(EA) billing occurs annually, costs should still be reviewed monthly.
Checklist - Optimize cost
11/2/2020 • 2 minutes to read • Edit Online

Continue to monitor and optimize the workload by using the right resources and sizes. Use this checklist to
optimize a workload.
Review the underutilized resources . Evaluate CPU utilization and network throughput over time to
check if the resources are used adequately. Azure Advisor identifies underutilized virtual machines. You can
choose to decommission, resize, or shut down the machine to meet the cost requirements.
Resize virtual machines
Shutdown the under utilized instances
Continuously take action on the cost reviews . Treat cost optimization as a process, rather than a point-
in-time activity. Use tooling in Azure that provides recommendations on usage or cost optimization. Review
the cost management recommendations and take action. Make sure that all stakeholders are in agreement
about the implementation and timing of the change.
Recommended tab in the Azure portal
Recommendations in the Cost Management Power BI app
Recommendations in Azure Advisor
Recommendations using Reservation REST APIs
Use reser ved instances on long running workloads . Reserve a prepaid capacity for a period, generally
one or three years. With reserved instances, there’s a significant discount when compared with pay-as-you-
go pricing.
Reserved instances
Use discount prices . These methods of buying Azure resources can lower costs.
Azure Hybrid Use Benefit
Azure Reservations
There are also payment plans offered at a lower cost:
Microsoft Azure Enterprise Agreement
Enterprise Dev Test Subscription
Cloud Service Provider (Partner Program)
Have a scale-in and scale-out policy . In a cost-optimized architecture, costs scale linearly with demand.
Increasing customer base shouldn't require more investment in infrastructure. Conversely, if demand drops,
scale-down of unused resources. Autoscale Azure resources when possible.
Autoscale instances
Reevaluate design choices . Analyze the cost reports and forecast the capacity needs. You might need to
change some design choices.
Choose the right storage tier . Consider using hot, cold, archive tier for storage account data.
Storage accounts can provide automated tiering and lifecycle management. For more information,
see Review your storage options
Choose the right data store . Instead of using one data store service, use a mix of data store
depending on the type of data you need to store for each workload. For more information, see
Choose the right data store.
Choose Spot VMs for low priority workloads . Spot VMs are ideal for workloads that can be
interrupted, such as highly parallel batch processing jobs.
Spot VMs
Optimize data transfer . Only deploy to multiple regions if your service levels require it for either
availability or geo-distribution. Data going out of Azure datacenters can add cost because pricing is
based on Billing Zones.
Traffic across billing zones and regions
Reduce load on ser vers . Use Azure Content Delivery Network (CDN) and caching service to
reduce load on front-end servers. Caching is suitable for servers that are continually rendering
dynamic content that doesn't change frequently.
Use managed ser vices . Measure the cost of maintaining infrastructure and replace it with Azure
PaaS or SaaS services.
Managed services
Autoscale instances
11/2/2020 • 2 minutes to read • Edit Online

In Azure, it’s easier to grow a service with little to no downtime against downscaling a service, which usually
requires deprovisioning or downtime. In general, opt for scale-out instead of scale up.
For certain application, capacity requirements may swing over time. Autoscaling policies allow for less error-prone
operations and cost savings through robust automation.

Virtual machine instances


For auto scaling, consider the choice of instance size. The size can significantly change the cost of your workload.

Choose smaller instances where workload is highly variable and scale out to get the desired level of
performance, rather than up. This choice will enable you to make your cost calculations and estimates granular.

Stateless applications
Many Azure services can be used to improve the application's ability to scale dynamically. Even where they may not
have been originally designed to do so.
For example, many ASP.NET stateful web applications can be made stateless. Then they can be auto scaled, which
results in cost benefit. You store state using Azure Redis Cache, or Cosmos DB as a back-end session state store
through a Session State Provider.
Reserved instances
11/2/2020 • 2 minutes to read • Edit Online

Azure Reservations are offered on many services as a way to lower cost. You reserve a prepaid capacity for a
period, generally one or three years. With reserved instances, there’s a significant discount when compared with
pay-as-you-go pricing. You can pay up front or monthly, price wise both are same.

Usage pattern
Before opting for reserved instances, analyze the usage data with pay-as-you-go prices over a time duration.
Do the ser vices in the workload run intermittently or follow long-term patterns?

Azure provides several options that can help analyze usage and make recommendations by comparing
reservations prices with pay-as-you-go price. An easy way is to view the Recommended tab in the Azure portal.
Azure Advisor also provides recommendations that are applicable to an entire subscription. If you need a
programmatic way, use the Reservation Recommendations REST APIs.
Reserved instances can lower cost for long running workloads. For intermittent workloads, prepaid capacity might
go unused and it doesn't carry over to the next billing period. Usage exceeding the reserved quantity is charged
using more expensive pay-as-you-go rates. But there are some flexible options. You can exchange or refund a
reservation within limits. For more information, see Self-service exchanges and refunds for Azure Reservations.

Scope
Reservations can be applied to a specific scope—subscription, resource group, or a single resource. Suppose you
set the scope as the resource group, the reservation savings will apply to the applicable resources provisioned in
that group. For more information, see Scope reservations.

Services, subscription, and offer types


Many services are eligible for reservations. This range covers virtual machines and a wide variety of managed
services. For information about the services that are eligible for reservations, see Charges covered by reservation.
Certain subscriptions and offer types support reservations. For a list of such subscriptions, see Discounted
subscription and offer types.
Virtual machines
11/2/2020 • 3 minutes to read • Edit Online

Virtual machines can be deployed in fix-sized blocks. These VMs must be adequately sized to meet capacity
demand and reduce waste.
For example, look at a VM running the SAP on Azure project can show you how initially the VM was sized based on
the size of existing hardware server (with cost around €1 K per month), but the real utilization of VM was not more
than 25% - but simple choosing the right VM size in the cloud we can achieve 75% saving (resize saving). And by
applying the snoozing you can get additional 14% of economy:

It is easy to handle cost comparison when you are well equipped and for this Microsoft provides the set of specific
services and tools that help you to understand and plan costs. These include the TCO Calculator, Azure Pricing
Calculator, Azure Cost Management (Cloudyn), Azure Migrate, Cosmos DB Sizing Calculator, and the Azure Site
Recovery Deployment Planner.
Here are some strategies that you can use to lower cost for virtual machines.

Resize virtual machines


You can lower cost by managing the size and the number of VMs.

Determine the load by analyzing the CPU utilization to make sure that the instance is adequately utilized.

Ideally, with the right size, the current load should fit in a lower SKU of the same tier. Another way is to lower the
number instances and still keep the load below a reasonable utilization. Azure Advisor recommends load less than
80% utilization for non-user facing workloads and 40% when user-facing workload. It also provides current and
target SKU information.
You can identify underutilized machines by adjusting the CPU utilization rule on each subscription.
Resizing a virtual machine does require the machine to be shut down and restarted. There might be a period of
time when it will be unavailable. Make sure you carefully time this action for minimal business impact.

Shut down the under utilized instances


Use the Star t/stop VMs during off-hours feature of virtual machines to minimize waste. There are many
configuration options to schedule start the stop times. The feature is suitable as a low-cost automation option. For
information, see Start/stop VMs during off-hours solution in Azure Automation.
Azure Advisor evaluates virtual machines based on CPU and network utilization over a time period. Then, the
recommended actions are shut down or resize instances and cost saving with both actions.

Spot VMs
Some workloads don't have a business requirement to complete a job within a period.
Can the workload be interrupted?

Spot VMs are ideal for workloads that can be interrupted, such as highly parallel batch processing jobs. These VMs
take advantage of the surplus capacity in Azure at a lower cost. They're also well suited for experimental,
development, and testing of large-scale solutions.
For more information, see Use Spot VMs in Azure.

Reserved VMs
Virtual machines are eligible for Azure Reservations. You can prepay for VM instances if you can commit to one or
three years. Reserved instances are appropriate for workloads that have a long-term usage pattern.
The discount only applies to compute and not the other meters used to measure usage for VMs. The discount can
be extended to other services that emit VM usage, such as Virtual machine scale sets and Container services, to
name a few. For more information, see Software costs not included with Azure Reserved VM Instances and Services
that get VM reservation discounts.
With reserved instances, you need to determine the VM size to buy. Analyze usage data using Reser vations
Consumption APIs and follow the recommendations of Azure portal and Azure Advisor to determine the size.
Reservations also apply to dedicated hosts. The discount is applied to all running hosts that match the reservation
scope and attributes. An important consideration is the SKU for the host. When selecting a SKU, choose the VM
series and type eligible to be a dedicated host. For more information, see Azure Dedicated Hosts pricing.
For information about discounts on virtual machines, see How the Azure reservation discount is applied to virtual
machines.
Caching data
11/2/2020 • 2 minutes to read • Edit Online

Caching is a strategy where you store a copy of the data in front of the main data store. The cache store is typically
located closer to the consuming client than the main store. Advantages of caching include faster response times
and the ability to serve data quickly. In doing so, you can save on the overall cost. Be sure to assess the built-in
caching features of Azure services used in your architecture. Azure also offers caching services, such as Azure
Cache for Redis or Azure CDN.
For information about what type of data is suitable for caching, see Caching.

Lower costs associated with reliability and latency


Caching can be a cheaper way of storing data, providing reliability, and reducing network latency.
Depending on the type of the data, determine if you need the complex capabilities of the backend data store,
such as data consistency. If the data is fully static, you can choose to store it only in a caching store. If the data
doesn't change frequently, consider placing a copy of the data in a caching store and refresh it from time to
time. For example, an application stores images in blob storage. Every time it requests an image, the business
logic generates a thumbnail from the main image and returns it to the caller. If the main image doesn't change
too often, then return the previously generated thumbnails stored in a cache. This way you can save on
resources required to process the image and lower the request response rate.
If the backend is unavailable, the cache continues to handle requests by using the copy until the backend fails
over to the backup data store.
Caching can also be an effective way of reducing network latency. Instead of reaching the central server, the
response is sent by using the cache. That way, the client can receive the response almost instantaneously. If you
need higher network performance and the ability to support more client connections, choose a higher tier of the
caching service. However, higher tiers will incur more costs.

Caching can be expensive


Incorrect use of caching can result in severe business outcomes and higher costs.
Adding a cache will lead to multiple data sources in your architecture. There are added costs to keeping
them in sync. You may need to fill the cache before putting it in production. Filling the cache on the
application's first access can introduce latency or seeding the cache can have impact on application's start
time. If you don't refresh the cache, your customers can get stale data.
Invalidate the cache at the right time when there's latest information in the source system. Use strategies to
age out the cache when appropriate.
To make sure the caching layer is working optimally, add instrumentation. That feature will add complexity
and implementation cost.
Caching services such as Azure Cache for Redis are billed on the tier you choose. Pricing is primarily determined on
the cache size and network performance and they're dependent. A smaller cache will increase latency. Before
choosing a tier, estimate a baseline. An approach can be by load testing the number of users and cache size.
Tradeoffs for costs
11/2/2020 • 3 minutes to read • Edit Online

As you design the workload, consider tradeoffs between cost optimization and other aspects of the design, such as
security, scalability, resilience, and operability.
What is most impor tant for the business: lowest cost, no downtime, high throughput?

An optimal design doesn’t equate to a low-cost design. There might be risky choices made in favor of a cheaper
solution.

Cost versus reliability


Cost has a direct correlation with reliability.
Does the cost of high availability components exceed the acceptable downtime?

Overall Service Level Agreement (SLA), Recovery Time Objective (RTO), and Recovery Point Objective (RPO) may
lead to expensive design choices. If your service SLAs, RTOs, and RPOs times are short, then higher investment is
inevitable for high availability and disaster recovery options.
For example, to support high availability, you choose to host the application across regions. This choice is costlier
than single region because of the replication costs or the need provisioning extra nodes. Data transfer between
regions will also add cost.
If the cost of high availability exceeds the cost of downtime, you can save by using Azure platform-managed
replication and recover data from the backup storage.
For resiliency, availability, and reliability considerations, see the Reliability pillar.

Cost versus performance efficiency


Boosting performance will lead to higher cost.
Many factors impact performance.
Fixed or consumption-based provisioning . Avoid cost estimation of a workload at consistently high
utilization. Consumption-based pricing will be more expensive that the equivalent provisioned pricing. Smooth out
the peaks to get a consistent flow of compute and data. Ideally, use manual and autoscaling to find the right
balance. Scaling up is more expensive than scaling out.
Azure regions . Cost scales directly with number of regions. Locating resources in cheaper regions shouldn't
negate the cost of network ingress and egress or by degraded application performance because of increased
latency.
Caching . Every render cycle of a payload consumes both compute and memory. You can use caching to reduce
load on servers and save with pre-canned storage and bandwidth costs, and the savings can be dramatic,
especially for static content services.
While caching can reduce cost, there are some performance tradeoffs. For example, Azure Traffic Manager pricing
is based on the number of DNS queries that reach the service. You can reduce that number through caching and
configure how often the cache is refreshed. Relying on the cache that isn't frequently updated will cause longer
user failover times if an endpoint is unavailable.
Batch or real-time processing . Using dedicated resources for batch processing long running jobs will increase
the cost. You can lower cost by provisioning Spot VMs but be prepared for the job to be interrupted every time
Azure evicts the VM.
For performance considerations, see the Performance Efficiency pillar.

Cost versus security


Increasing security of the workload will increase cost.
As a rule, don't compromise on security. For certain workloads, you can't avoid security costs. For example, for
specific security and compliance requirements, deploying to differentiated regions will be more expensive.
Premium security features can also increase the cost. There are areas you can reduce cost by using native security
features. For example, avoid implementing custom RBAC roles if you can use built-in roles.
For security considerations, see the Security Pillar.

Cost versus operational excellence


Investing in systems monitoring and automation might increase the cost initially but over time will reduce cost.
IT operations processes like user or application access provisioning, incident response, and disaster recovery
should be integrated with the workload.
Cost of maintaining infrastructure is more expensive. With PaaS or SaaS services, infrastructure, platform
management services, and additional operational efficiencies are included in the service pricing.
For operational considerations, see the Operational Excellence pillar.
Overview of the operational excellence pillar
11/13/2020 • 2 minutes to read • Edit Online

This pillar covers the operations processes that keep an application running in production. Deployments must be
reliable and predictable. They should be automated to reduce the chance of human error. They should be a fast
and routine process, so they don't slow down the release of new features or bug fixes. Equally important, you
must be able to quickly roll back or roll forward if an update has problems.
To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the
Microsoft Azure Well-Architected Review.
These are the disciplines we group in the operational excellence pillar:

O P ERAT IO N A L EXC EL L EN C E DISC IP L IN ES DESC RIP T IO N

Application design Provides guidance on how to design, build, and orchestrate


workloads with DevOps principles in mind

Monitoring Something that enterprises have been doing for years,


enriched with some specifics for applications running in the
cloud

Application performance management The monitoring and management of performance and


availability of software applications through DevOps

Code deployment How you deploy your application code is going to be one of
the key factors that will determine your application stability

Infrastructure provisioning Frequently known as "Automation" or "Infrastructure as


code", this discipline refers to best practices for deploying the
platform where your application will run on

Testing Testing is fundamental to be prepared for the unexpected


and to catch mistakes before they impact users
Application design for DevOps
11/2/2020 • 2 minutes to read • Edit Online

Limits
Azure Resource Manager (ARM) enforces limits and quotas on how many resources of each type you can provision
per Azure Subscription, and even per Azure Region. Some limits are a hard maximum, while others are a soft limit
that can be increased upon request through a support case at no charge. When working with Virtual Machines,
App Service, Storage Accounts, Databases, and other resources in Azure you can easily hit up against these limits,
so it’s important to know they exist and how to work around them.
Azure subscription and service limits, quotas, and constraints

This sample shows how to query limit and quotas for commonly used networking resources, virtual machines,
SQL database, and storage accounts.

Tagging and resource naming


Organizing cloud-based assets in ways that aid operational management and support accounting requirements is
a common challenge that faces large cloud adoption efforts. By applying well-defined naming and metadata
tagging conventions to cloud-hosted resources, IT staff can quickly find and manage resources. Well-defined
names and tags also help to align cloud usage costs with business teams by using chargeback and showback
accounting mechanisms.
Naming rules and restrictions for Azure resources guidance provides general recommendations and platform
limitations.
Recommended naming and tagging conventions

Workload isolation
The term workload is typically defined as an arbitrary unit of functionality, such as an application or service. In the
cloud a workload not only encompasses all the artifacts, but it also includes the cloud resources as well. This
enables you to define a workload in terms of code artifacts and the necessary cloud resources, thus further
enabling you to isolate workloads. You can isolate workloads by the way resources are organized, by network
topology, or by other attributes. A basic workload is typically defined as a single web application or a virtual
network (VNet) with virtual machine (VM).
The goal of workload isolation is to associate a workload's specific resources to a team, so that the team can
independently manage all aspects of those resources. This isolation enables DevOps to perform continuous
integration and continuous delivery (CI/CD).
Deploy a basic workload in Azure

Orchestration system
Because of their small size and application orientation, containers are well suited for agile delivery environments
and microservice-based architectures. The task of automating and managing a large number of containers and
how they interact is known as orchestration. While there are multiple container orchestrators available, it's critical
to select the right container orchestrator based on your workload requirements, the team's skill set, timeline to
ramp up and support production workload. One of the popular container orchestration systems is Kubernetes and
Azure provides a managed kubernetes cluster using Azure Kubernetes Service (AKS) service

Containerization
Containers make it easy for you to continuously build and deploy applications. By orchestrating the deployment of
those containers using Azure Kubernetes Service, you can achieve replicable, manageable clusters of containers.
By setting up a continuous build to produce your container images and orchestration, Azure DevOps increases the
speed and reliability of your deployment.
CI/CD for Containers
Development considerations when enabling DevOps
11/2/2020 • 2 minutes to read • Edit Online

Production like environment for Dev/Test


Its paramount to maintain production like environment especially when you run in to issues in production and you
want to release hot fixes/updates quickly. If you follow some of the best practices mentioned in the app-design
section like Infrastructure as Code (IaC) & CI/CD process you can easily & swiftly spin up/tear down your
environment as needed.
With the use of Resource Manager templates or Terraform you can use Azure DevOps Services to provision
production like environment in minutes to mimic real-life, peak-usage scenarios. This allows you save cost and
provision load testing environment only when needed.

Application instrumentation for insight


Insights provide a customized monitoring experience for particular applications and services. Application Insights, a
feature of Azure Monitor, is an extensible Application Performance Management (APM) service for web developers
on multiple platforms. Use it to monitor your live web application. It will automatically detect performance
anomalies.

Technical debt
Technical debt includes anything the team must do to deploy production quality code and keep it running in
production. Examples are bugs, performance issues, operational issues, not having unit tests (necessary for
refactoring code), accessibility, and others. SonarQube is a set of static analyzers that can be used to identify areas
of improvement in your code. It allows you to analyze the technical debt in your project and keep track of it in the
future. For more info, see Using SonarQube with Azure DevOps.

Continuous deployment / continuous integration


Use feature toggles and canary releases. Utilize App Service deployment slots to safely deploy applications.
Continuous Integration (CI) is a development practice that requires developers to integrate code into a shared
repository several times a day. Each check-in is then verified by an automated build, allowing teams to detect
problems early. The key details to note are that you need to run code integration multiple times a day, every day,
and you need to run the automated verification of the integration. What's the motivation for this? Well, in the
development process, the earlier we surface errors, the better. And one source of frequently occurring errors is the
code integration step.
Employ Azure DevOps Services continuous integration to build, test, and deploy applications quickly.

Feature flags
Using feature flags is a technique that will help you integrate code into a shared repository at least once a day and
ship it, even if you haven't finished the feature yet. You'll be able to deploy at any time but defer the decision to
release for another day.
Deployment considerations for DevOps
11/2/2020 • 5 minutes to read • Edit Online

As you provision and update Azure resources, application code, and configuration settings, a repeatable and
predictable process will help you avoid errors and downtime. We recommend automated processes for
deployment that you can run on demand and rerun if something fails. After your deployment processes are
running smoothly, process documentation can keep them that way.

Automation
To activate resources on demand, deploy solutions rapidly, minimize human error, and produce consistent and
repeatable results, be sure to automate deployments and updates.
Automate as many processes as possible
The most reliable deployment processes are automated and idempotent — that is, repeatable to produce the same
results.
To automate provisioning of Azure resources, you can use Terraform, Ansible, Chef, Puppet, Azure
PowerShell, Azure CLI, or Azure Resource Manager templates.
To configure VMs, you can use cloud-init (for Linux VMs) or Azure Automation State Configuration (DSC).
To automate application deployment, you can use Azure DevOps Services, Jenkins, or other CI/CD solutions.
As a best practice, create a repository of categorized automation scripts for quick access, documented with
explanations of parameters and examples of script use. Keep this documentation in sync with your Azure
deployments, and designate a primary person to manage the repository.
Automation scripts can also activate resources on demand for disaster recovery.
Use code to provision and configure infrastructure
This practice, called infrastructure as code, may use a declarative approach or an imperative approach (or a
combination of both).
Resource Manager templates are an example of a declarative approach.
PowerShell scripts are an example of an imperative approach.
Practice immutable infrastructure
In other words, don't modify infrastructure after it's deployed to production. After improvised changes have been
applied, you might not know exactly what has changed, so it can be difficult to troubleshoot the system.
Automate and test deployment and maintenance tasks
Distributed applications consist of multiple parts that must work together. Deployment should take advantage of
proven mechanisms, such as scripts, that can update and validate configuration and automate the deployment
process. Test all processes fully to ensure that errors don't cause additional downtime.
Implement deployment security measures
All deployment tools must incorporate security restrictions to protect the deployed application. Define and enforce
deployment policies carefully, and minimize the need for human intervention.

Release process
One of the challenges with automating deployment is the cut-over itself, taking software from the final stage of
testing to live production. You usually need to do this quickly in order to minimize downtime. The blue-green
deployment approach does this by ensuring you have two production environments, as identical as possible.

Document release process


Without detailed release process documentation, an operator might deploy a bad update or might improperly
configure settings for your application. Clearly define and document your release process, and ensure that it's
available to the entire operations team.

Stage your workloads


Deployment to various stages and running tests/validations at each stage before moving on to the next ensures
friction free production deployment.
With good use of staging and production environments, you can push updates to the production environment in a
highly controlled way and minimize disruption from unanticipated deployment issues.
Blue-green deployment involves deploying an update into a production environment that's separate from the
live application. After you validate the deployment, switch the traffic routing to the updated version. One way to
do this is to use the staging slots available in Azure App Service to stage a deployment before moving it to
production.
Canary releases are similar to blue-green deployments. Instead of switching all traffic to the updated
application, you route only a small portion of the traffic to the new deployment. If there's a problem, revert to
the old deployment. If not, gradually route more traffic to the new version. If you're using Azure App Service,
you can use the Testing in production feature to manage a canary release.

Logging and auditing


To capture as much version-specific information as possible, implement a robust logging strategy. If you use
staged deployment techniques, more than one version of your application will be running in production. If a
problem occurs, determine which version is causing it.

Rollback plan
Use App Service deployment slots to fall back on last-known good menu.
The most important step is to implement an architecture that supports the need to rollback. For instance,
componentized, service-based architectures lend themselves well to this. Persistent message queues and
asynchronous services allow you to bring components down for rollback without affecting the main user base.
Work towards something like the Blue-Green release pattern such that your application can stay available whilst
you are working on one half of the system.
If a deployment fails, your application could become unavailable. To minimize downtime, design a rollback process
to go back to a last-known good version. Include a strategy to roll back changes to databases and any other
services your app depends on.
If you're using Azure App Service, you can set up a last-known good site slot and use it to roll back from a web or
API app deployment.

High availability considerations


An application that depends on a single instance of a service creates a single point of failure. To improve resiliency
and scalability, provision multiple instances.
For Azure App Service, select an App Service plan that offers multiple instances.
For Azure Cloud Services, configure each of your roles to use multiple instances.
For Azure Virtual Machines, ensure that your architecture includes more than one VM and that each VM is
included in an availability set.
Consider deploying across multiple regions
We recommend deploying all but the least critical applications and application services across multiple regions. If
your application is deployed to a single region, in the rare event that the entire region becomes unavailable, the
application will also be unavailable. If you choose to deploy to a single region, consider preparing to redeploy to a
secondary region as a response to an unexpected failure.
Redeploy to a secondary region
If you run applications and databases in a single, primary region with no replication, your recovery strategy might
be to redeploy to another region. This solution is affordable but most appropriate for non-critical applications that
can tolerate longer recovery times. If you choose this strategy, automate the redeployment process as much as
possible and include redeployment scenarios in your disaster response testing.
To automate your redeployment process, consider using Azure Site Recovery.
Monitoring for DevOps
11/2/2020 • 7 minutes to read • Edit Online

What you cannot see, you cannot measure. What you cannot measure, you cannot improve. This classic
management axiom is true in the cloud as well. Traditional application and infrastructure monitoring is based on
whether the application is running or not, or what response time it is giving. However, cloud-based monitoring
offer many more opportunities that you should be leveraging in order to give your users the best experience.

Application Monitoring
Application Insights is the Azure Service that allows not only to verify that your application is running correctly, but
it makes application troubleshooting easier, and can be used for custom business telemetry that will tell you
whether your application is being used as intended.
Make sure you leverage all the rich information that Application Insights can provide about your application. This
list is not exhaustive, but here you can find some of the visibility that Application Insights can give you:
Application Insights offers you a default dashboard with an educated guess of the most important metrics you
will be interested in. You can then modify it and customize it to your own needs.
By instrumenting your application correctly, Application Insights will give you performance statistics both from
a client and a server perspective
The Application Map will show you application dependencies in other services such as backend APIs or
databases, allowing to determine visually where performance problems lie
Smart Detection will warn you when anomalies in performance or utilization patterns happen
Usage Analysis can give you telemetry on which features of your application are most frequently used, or
whether all your application functionality is being used. This feature is especially useful after changes to the
application functionality, to verify whether those changes were successful
Release annotations are visual indicators in your Application Insights charts of new builds and other events, so
that you can visually correlate changes in application performance to code releases, being able to quickly
pinpoint performance problems.
Cross-component transaction diagnostics allow you to follow failed transactions to find the point in the
architecture where the fault was originated.
Snapshot Debugger, to automatically collect a snapshot of a live application in case of an exception, to analyze it
at a later stage.
To use Application Insights you have two options: you can use codeless monitoring , where onboarding your app
to Application Insights does not require any code change, or code-based monitoring , where you instrument
your code to send telemetry to Application Insights using the Software Development Kit for your programming
language of choice.
You can certainly use other Application Performance Management tools to monitor your application on Azure, such
as NewRelic or AppDynamics, but Application Insights will give you the most seamless and integrated experience.

Platform Monitoring
Application Insights is actually one of the components of Azure Monitor, which gives you rich metrics and logs to
verify the state of your complete Azure landscape. No matter whether your application is running on Virtual
Machines, App Services, or Kubernetes, Azure Monitor will help you to follow the state of your infrastructure, and
to react promptly if there are any issues.
Make sure not only to monitor your compute elements supporting your application code, but your data platform
as well: databases, storage accounts, or data lakes should be closely monitored, since a low performance of the
data tier of an application could have serious consequences.
Container Insights
Should your application run on Azure Kubernetes Service, Azure Monitor allows you to easily monitor the state of
your cluster, nodes, and pods. Easy to configure for AKS clusters, Container Insights delivers quick, visual, and
actionable information: from the CPU and memory pressure of your nodes to the logs of individual Kubernetes
pods.
Additionally, for operators that prefer using the open-source Kubernetes monitoring tool Prometheus but still like
the ease of use of Azure Monitor Container Insights, both solutions can integrate with each other.

The Sidecar Pattern adds a separate container with responsibilities that are required by the main container. A
common use case is for running logging utilities and monitoring agents.
Network monitoring
Regardless the form factor or programming language your application is based on, the network connecting your
code to your users can make or break the experience that your application provides. As a consequence monitoring
and troubleshooting the network can be decisive for an operations team. The component of Azure Monitor that
manages the network components is called Network Watcher, a collection of network monitoring and
troubleshooting tools. Some of these tools are:
Traffic Analytics will give you an overview of the traffic in your Virtual Networks, as well as the percentage
coming from malicious IP addresses, leveraging Microsoft Threat Intelligence databases. This tool will show you
as well the systems in your virtual networks that generate most traffic, so that you can visually identify
bottlenecks before they degenerate into problems.
Network Performance Manager can generate synthetic traffic to measure the performance of network
connections over multiple links, giving you a perspective on the evolution of WAN and Internet connections
over time, as well as offering valuable monitoring information about Microsoft ExpressRoute circuits.
VPN diagnostics can help troubleshooting site-to-site VPN connections connecting your applications to users
on-premises.
Connection Monitor allows you to measure the network availability between sets of endpoints.
Other information sources
Not only your application components are producing data, but there are many other signals that you need to track
to effectively operate a cloud environment:
Activity Log : this is a trail audit that lets you see every change that has gone through Azure APIs. It can be
critical to understand sudden performance changes or problems, that might have been due to a
misconfiguration of the Azure platform.
Azure Ser vice Health : sometimes outages are provoked not by configuration changes, but by glitches in the
Azure platform itself. You can find information about any Azure-related problem impacting your application in
the Azure Service Health logs.
Azure Advisor : find here recommendations about how to optimize your Azure platform to reduce costs,
improve your security posture, or increase the availability of your environment.
Azure Security Center : not a focus of this pillar, but to be included for completeness: Azure Security Center
can help you to understand whether your Azure resources are configured according to security best practices

Monitoring best practices


Event correlation
One critical advantage of Azure Monitor is that it is the monitoring tool for the whole Azure platform. As the
previous sections have shown, Azure Monitor holds metrics and logs relevant to your application code, the
platform where it is running, the data components, as well as the network connecting the application to its users.
This enables operators to compare metrics of different application components to each other, and find out
dependencies that might have been hidden otherwise.
Dashboards in Azure offer a great way of exposing the rich information contained in Azure Monitor to other users.
Make sure to create shared dashboards in order to expose relevant information to the different groups involved in
operating your application, including Developers and Operators. If more complex visualizations are required, Azure
Monitor data can be exported to Power BI for advanced data analysis.
Notifications
Whether it is for application, network or platform monitoring, you should not expect operators to constantly look
at dashboards. Instead, alerts should be used to send proactive notifications to the relevant individuals that will
react on them. Action groups in Azure Monitor can be used to notify multiple recipients, to trigger automated
actions, or even to automatically open tickets in IT Service Management Tools such as ServiceNow.
Automation around alerts is critical due to the highly collaborative nature of DevOps and the inherent speed
needed for effective incident management. Earlier this year, a report from DevOps.com came out stating that 80%
of IT teams are alerted to critical incidents via email. Email is an effective form of communication, but it shouldn’t
be the most common notification method for a critical issue. Instead, if you can define actions to be executed upon
receiving certain alerts (such as scaling up or down) your system will be self-healing.
Other monitoring tasks
Beyond Azure Monitor, you will want to keep an eye on certain events to make sure that your application is
running smoothly:
Review Azure subscription limits for your resources, and make sure you are not coming too close.
Understand Azure support plans. Refer to Azure support FAQs. Familiarize your team with Azure support.
Make sure that you monitor expiration dates of digital certificates, or even better, configure automatic digital
certificate renewal with Azure Key Vault.

Summary
You can use any monitoring platform to manage your Azure resources. Microsoft's first party offering is Azure
Monitor, a comprehensive solution for metrics and logs from the infrastructure to the application code, including
the ability to trigger alerts and automated actions as well as data visualization.
Performance considerations for your deployment
infrastructure
11/2/2020 • 3 minutes to read • Edit Online

Build status shows if your product is in a deployable state, so builds are the heartbeat of your continuous delivery
system. It's important to have a build process up and running from the first day of your product development.
Since builds provide such a crucial information about the status of your product, you should always strive for fast
builds.
It will be hard to fix build problem if it takes longer to build, and the team will suffer from a broken window
disorder. Eventually, nobody cares if the build is broken since it's always broken and it takes lot of effort to fix it.

Build times
Here are few ways you can achieve faster builds.
Selecting right size VMs: Speeding up your builds starts with selecting the right size VMs. Fast machines
can make the difference between hours and minutes. If your pipelines are in Azure Pipelines, then you've got
a convenient option to run your jobs using a Microsoft-hosted agent. With Microsoft-hosted agents,
maintenance and upgrades are taken care of for you. For more info see, Microsoft-hosted agents.
Build ser ver location: When you're building your code, a lot of bits are moved across the wire, so inputs
to the builds are fetched from a source control repository and the artifact repository, such as source code,
NuGet packages, etc. At the end, the output from the build process needs to be copied, not only the
compiled artifacts, but also test reports, code coverage results, and debug symbols. So it is important that
these copy actions are fast. If you are using your own build server, ensure that the build server is located
near the sources and a target location, and it can reduce the duration of your build considerably.
Scaling out build ser vers: A single build server may be sufficient for a small product, but as the size and
the scope of the product and the number of teams working on the product increases, the single server may
not be enough. Scale your infrastructure horizontally over multiple machines when you reach the limit. For
more info see, how you can leverage Azure DevOps Agent Pools.
Optimizing the build:
Add parallel build execution so we can speed up the build process. For more info see, Azure DevOps
parallel jobs.
Enable parallel execution of test suites, which is often a huge time saver, especially when executing
integration and UI tests. For more info see, Run tests in parallel using Azure Pipeline.
Use the notion of a multiplier, where you can scale out your builds over multiple build agents. For
more info see, Organizing Azure Pipeline into Jobs.
Move a part of the test feedback loop to the release pipeline. This improves the build speed, and
hence the speed of the build feedback loop.
Publish the build artifacts to the package management system solution, and hence publish to a
NuGet feed at the end of a build.

Human intervention
It's important to select different builds for different purpose.
CI builds: Purpose of this build is to ensure it compiles and unit tests run. This build gets triggered at each
commit or set of commits over a period of time. It serves as the heartbeat of the project, provides quality
feedback to the team immediately. For more info see, CI triggers or Batching CI builds.
Nightly build: Purpose of this build is not only to compile but also ensure necessary
integration/regression tests are run. This build can take up some more time, because we need to do some
extra steps to get additional information about the product. For example, metrics about the state of the
software using SonarQube. It may also contain a set of regression tests and integration tests and it may also
deploy the solution to a temporary machine to verify the solution is continuing to work. For more info see,
scheduling builds using cron syntax
Release build: Besides compiling, running test this build additionally compiles the API documentation,
compliance reports, code signing, and other steps which are not required every time the code is built. Finally
this build provide the golden copy that will be pushed to the release pipeline to finally deploy in the
production environment. Generally release build is gets triggered manually instead of a CI trigger.
Testing your Application and Azure Environment
11/13/2020 • 7 minutes to read • Edit Online

Testing is one of the fundamental components and DevOps and agile development in general. If automation gives
DevOps the required speed and agility to deploy software quickly, only through extensive testing those
deployments will achieve the required reliability that customers demand.
A main tenet of a DevOps practice to achieve system reliability is the "Shift Left" principle. If developing and
deploying an application is a process depicted as a series of steps going from left to right, testing should not only
be performed at the very end of the process (at the right). It should be shifted as much to the beginning (to the left)
as possible, errors are cheaper to repair when caught early and can be expensive or impossible to fix later in the
application life cycle.
Another aspect to consider is that testing should occur on both application code as well as infrastructure code and
they should be subject to the same quality controls. As described in Infrastructure as Code, the environment where
applications are running should be version-controlled and deployed through the same mechanisms as application
code, and hence can be tested and validated using DevOps testing paradigms too.
You can use your favorite testing tool to run your tests, including Azure Pipelines for automated testing and Azure
Testing Plans for manual testing.
There are multiple stages at which tests can be performed in the life cycle of code, and each of them has some
particularities that is important to understand. In this guide, you can find a summary of the different tests that you
should consider while developing and deploying applications.

Automated Testing
Automating tests is the best way to make sure that they are executed. Depending on how frequently tests are
performed, they are typically limited in duration and scope, as the different types of automated tests will show:
Unit Testing
Unit tests are tests typically run by each new version of code committed into version control. Unit Tests should be
extensive (should cover ideally 100% of the code) and quick (typically under 30 seconds, although this number is
not a rule set in stone). Unit testing could verify things like the syntax correctness of application code, Resource
Manager templates or Terraform configurations, that the code is following best practices, or that they produce the
expected results when provided certain inputs.
Unit tests should be applied both to application code and infrastructure code.
Smoke Testing
Smoke tests are more exhaustive than unit tests, but still not as much as integration tests. They normally run in less
than 15 minutes. Still not verifying the interoperability of the different components with each other, smoke tests
verify that each of them can be correctly built and offers the expected functionality and performance.
Smoke tests usually involve building the application code, and if infrastructure, possibly testing the deployment in
a test environment.
Integration Testing
After making sure that the different application components operate correctly individually, integration testing has
as goal determine whether they can interact with each other as they should. Integration tests usually take longer
than smoke testing, and as a consequence they are sometimes executed not as frequently. For example, running
integration tests every night still offers a good compromise, detecting interoperability issues between application
components no later than one day after they were introduced.

Application Manual Testing


Manual testing is much more expensive than automated testing, and as a consequence it is run much less
frequently. However, manual testing is fundamental for the correct functioning of the DevOps feedback loop, to
correct errors before they become too expensive to repair, or cause customer dissatisfaction.
Acceptance Testing
There are many different ways of confirming that the application is doing what it should.
Blue/Green deployments : when deploying a new application version, you can deploy it in parallel to the
existing one. This way you can start redirecting clients to the new version, and if everything goes well you will
decommission the old version. If there is any problem with the new deployment, you can always redirect the
users back to the old one.
Canar y releases : you can expose new functionality of your application (ideally using feature flags) to a select
group of users. If users are satisfied with the new functionality, you can extend it to the rest of the user
community. In this case we are talking about releasing functionality, and not necessarily about deploying a new
version of the application.
A/B testing : A/B testing is similar to canary release-testing, but while canary releases focus on mitigate risk,
A/B testing focus on evaluating the effectiveness of two similar ways of achieving different goals. For example, if
you have two versions of the layout of a certain area of your application, you could send half of your users to
one, the other half to the other, and use some metrics to see which layout works better for the application goals.
An important aspect to consider is how to measure the effectiveness of new features in the application. A way to do
that is through Application Insights User Behavior Analytic, with which you can determine how people are using
your application. This way you can decide whether a new feature has improved your applications without bringing
effects such as decreasing usability.
Certain services in Azure offer functionality that can help with this kind of tests, such as the slot functionality in the
Azure App Service, that allows having two different versions of the same application running at the same time, and
redirect part of the users to one or the other.
Stress tests
As other sections of this framework have explained, designing your application code and infrastructure for
scalability is of paramount importance. Testing that you can increase the application load and that both the code
and the infrastructure will react to it is critical, so that your environment will adapt to changing load conditions.
During these stress tests, it is critical monitoring all the components of the system to identify whether there is any
bottleneck. Every component of the system not able to scale out can turn into a scale limitation, such as
active/passive network components or databases. It is hence important knowing their limits so that you can
mitigate their impact into the application scale. This exercise might drive you to changing some of those
components for more scalable counterparts.
It is equally important verifying that after the stress test is concluded, the infrastructure scales back down to its
normal condition in order to keep costs under control.
Business Continuity Drills
Certain infrastructure test scenarios can be considered under the category of acceptance testing, such as Business
Continuity drills. In particular Disaster Recovery scenarios are difficult to test on-premises, but the public cloud
makes this kind of tests easier. Tools such as Azure Site Recovery make it possible starting an isolated copy of the
primary location in a secondary environment, so that it can be verified that the applications have come up as they
should.
In case there is any problem, the Disaster Recovery procedure can be optimized, and the infrastructure in the
secondary environment can be deleted.
Exploratory Testing
Experts explore the application in its entirety trying to find faults or suboptimal implementations of functionality.
These experts could be developers, UX specialists, product owners, actual users, and other profiles. Test plans are
typically not used, since testing is left to the ability of the individual tester.
Fault injection
The same concept can be applied to the infrastructure. If the application should be resilient to infrastructure
failures, introducing faults in the underlying infrastructure and observing how the application behaves is
fundamental for increasing the trust in your redundancy mechanisms. Shutting down ungracefully infrastructure
components, degrading the performance of certain elements such as network equipment or introducing faults
purposely in the environment are ways of verifying that the application is going to react as expected when these
situations occur in real life.
Most companies use a controlled way of injecting faults in the system, although if confident with the application
resiliency, automated frameworks could be used. A new science has been developed around fault injection, called
Chaos Engineering.

Summary
In order to deploy software quickly and reliably, testing is a fundamental component of the development and
deployment life cycle. Not only application code should be tested, but infrastructure automation and resiliency
should equally be put to the test, to make sure that the application is going to perform as expected in every
situation.
DevOps Checklist
11/2/2020 • 14 minutes to read • Edit Online

DevOps is the integration of development, quality assurance, and IT operations into a unified culture and set of
processes for delivering software. Use this checklist as a starting point to assess your DevOps culture and process.

Culture
Ensure business alignment across organizations and teams. Conflicts over resources, purpose, goals, and
priorities within an organization can be a risk to successful operations. Ensure that the business, development, and
operations teams are all aligned.
Ensure the entire team understands the software lifecycle. Your team needs to understand the overall
lifecycle of the application, and which part of the lifecycle the application is currently in. This helps all team
members know what they should be doing now, and what they should be planning and preparing for in the future.
Reduce cycle time. Aim to minimize the time it takes to move from ideas to usable developed software. Limit the
size and scope of individual releases to keep the test burden low. Automate the build, test, configuration, and
deployment processes whenever possible. Clear any obstacles to communication among developers, and between
developers and operations.
Review and improve processes. Your processes and procedures, both automated and manual, are never final.
Set up regular reviews of current workflows, procedures, and documentation, with a goal of continual
improvement.
Do proactive planning. Proactively plan for failure. Have processes in place to quickly identify issues when they
occur, escalate to the correct team members to fix, and confirm resolution.
Learn from failures. Failures are inevitable, but it's important to learn from failures to avoid repeating them. If an
operational failure occurs, triage the issue, document the cause and solution, and share any lessons that were
learned. Whenever possible, update your build processes to automatically detect that kind of failure in the future.
Optimize for speed and collect data. Every planned improvement is a hypothesis. Work in the smallest
increments possible. Treat new ideas as experiments. Instrument the experiments so that you can collect production
data to assess their effectiveness. Be prepared to fail fast if the hypothesis is wrong.
Allow time for learning. Both failures and successes provide good opportunities for learning. Before moving on
to new projects, allow enough time to gather the important lessons, and make sure those lessons are absorbed by
your team. Also give the team the time to build skills, experiment, and learn about new tools and techniques.
Document operations. Document all tools, processes, and automated tasks with the same level of quality as your
product code. Document the current design and architecture of any systems you support, along with recovery
processes and other maintenance procedures. Focus on the steps you actually perform, not theoretically optimal
processes. Regularly review and update the documentation. For code, make sure that meaningful comments are
included, especially in public APIs, and use tools to automatically generate code documentation whenever possible.
Share knowledge. Documentation is only useful if people know that it exists and can find it. Ensure the
documentation is organized and easily discoverable. Be creative: Use brown bags (informal presentations), videos,
or newsletters to share knowledge.

Development
Provide developers with production-like environments. If development and test environments don't match
the production environment, it is hard to test and diagnose problems. Therefore, keep development and test
environments as close to the production environment as possible. Make sure that test data is consistent with the
data used in production, even if it's sample data and not real production data (for privacy or compliance reasons).
Plan to generate and anonymize sample test data.
Ensure that all authorized team members can provision infrastructure and deploy the application.
Setting up production-like resources and deploying the application should not involve complicated manual tasks or
detailed technical knowledge of the system. Anyone with the right permissions should be able to create or deploy
production-like resources without going to the operations team.

This recommendation doesn't imply that anyone can push live updates to the production deployment. It's about
reducing friction for the development and QA teams to create production-like environments.

Instrument the application for insight. To understand the health of your application, you need to know how it's
performing and whether it's experiencing any errors or problems. Always include instrumentation as a design
requirement, and build the instrumentation into the application from the start. Instrumentation must include event
logging for root cause analysis, but also telemetry and metrics to monitor the overall health and usage of the
application.
Track your technical debt. In many projects, release schedules can get prioritized over code quality to one degree
or another. Always keep track when this occurs. Document any shortcuts or other suboptimal implementations, and
schedule time in the future to revisit these issues.
Consider pushing updates directly to production. To reduce the overall release cycle time, consider pushing
properly tested code commits directly to production. Use feature toggles to control which features are enabled. This
allows you to move from development to release quickly, using the toggles to enable or disable features. Toggles
are also useful when performing tests such as canary releases, where a particular feature is deployed to a subset of
the production environment.

Testing
Automate testing. Manually testing software is tedious and susceptible to error. Automate common testing tasks
and integrate the tests into your build processes. Automated testing ensures consistent test coverage and
reproducibility. Integrated UI tests should also be performed by an automated tool. Azure offers development and
test resources that can help you configure and execute testing. For more information, see Development and test.
Test for failures. If a system can't connect to a service, how does it respond? Can it recover once the service is
available again? Make fault injection testing a standard part of review on test and staging environments. When your
test process and practices are mature, consider running these tests in production.
Test in production. The release process doesn't end with deployment to production. Have tests in place to ensure
that deployed code works as expected. For deployments that are infrequently updated, schedule production testing
as a regular part of maintenance.
Automate performance testing to identify performance issues early. The impact of a serious performance
issue can be as severe as a bug in the code. While automated functional tests can prevent application bugs, they
might not detect performance problems. Define acceptable performance goals for metrics like latency, load times,
and resource usage. Include automated performance tests in your release pipeline, to make sure the application
meets those goals.
Perform capacity testing. An application might work fine under test conditions, and then have problems in
production due to scale or resource limitations. Always define the maximum expected capacity and usage limits.
Test to make sure the application can handle those limits, but also test what happens when those limits are
exceeded. Capacity testing should be performed at regular intervals.
After the initial release, you should run performance and capacity tests whenever updates are made to production
code. Use historical data to fine-tune tests and to determine what types of tests need to be performed.
Perform automated security penetration testing. Ensuring your application is secure is as important as
testing any other functionality. Make automated penetration testing a standard part of the build and deployment
process. Schedule regular security tests and vulnerability scanning on deployed applications, monitoring for open
ports, endpoints, and attacks. Automated testing does not remove the need for in-depth security reviews at regular
intervals.
Perform automated business continuity testing. Develop tests for large-scale business continuity, including
backup recovery and failover. Set up automated processes to perform these tests regularly.

Release
Automate deployments. Automate deploying the application to test, staging, and production environments.
Automation enables faster and more reliable deployments, and ensures consistent deployments to any supported
environment. It removes the risk of human error caused by manual deployments. It also makes it easy to schedule
releases for convenient times, to minimize any effects of potential downtime. Have systems in place to detect any
problems during rollout, and have an automated way to roll forward fixes or roll back changes.
Use continuous integration. Continuous integration (CI) is the practice of merging all developer code into a
central codebase on a regular schedule, and then automatically performing standard build and test processes. CI
ensures that an entire team can work on a codebase at the same time without having conflicts. It also ensures that
code defects are found as early as possible. Preferably, the CI process should run every time that code is committed
or checked in. At the very least, it should run once per day.

Consider adopting a trunk based development model. In this model, developers commit to a single branch (the
trunk). There is a requirement that commits never break the build. This model facilitates CI, because all feature
work is done in the trunk, and any merge conflicts are resolved when the commit happens.

Consider using continuous deliver y. Continuous delivery (CD) is the practice of ensuring that code is always
ready to deploy, by automatically building, testing, and deploying code to production-like environments. Adding
continuous delivery to create a full CI/CD pipeline will help you detect code defects as soon as possible, and ensures
that properly tested updates can be released in a very short time.

Continuous deployment is an additional process that automatically takes any updates that have passed through
the CI/CD pipeline and deploys them into production. Continuous deployment requires robust automatic
testing and advanced process planning, and may not be appropriate for all teams.

Make small incremental changes. Large code changes have a greater potential to introduce bugs. Whenever
possible, keep changes small. This limits the potential effects of each change, and makes it easier to understand and
debug any issues.
Control exposure to changes. Make sure you're in control of when updates are visible to your end users.
Consider using feature toggles to control when features are enabled for end users.
Implement release management strategies to reduce deployment risk . Deploying an application update to
production always entails some risk. To minimize this risk, use strategies such as canary releases or blue-green
deployments to deploy updates to a subset of users. Confirm the update works as expected, and then roll the
update out to the rest of the system.
Document all changes. Minor updates and configuration changes can be a source of confusion and versioning
conflict. Always keep a clear record of any changes, no matter how small. Log everything that changes, including
patches applied, policy changes, and configuration changes. (Don't include sensitive data in these logs. For example,
log that a credential was updated, and who made the change, but don't record the updated credentials.) The record
of the changes should be visible to the entire team.
Consider making infrastructure immutable. Immutable infrastructure is the principle that you shouldn't
modify infrastructure after it's deployed to production. Otherwise, you can get into a state where ad hoc changes
have been applied, making it hard to know exactly what changed. Immutable infrastructure works by replacing
entire servers as part of any new deployment. This allows the code and the hosting environment to be tested and
deployed as a block. Once deployed, infrastructure components aren't modified until the next build and deploy
cycle.

Monitoring
Make systems obser vable. The operations team should always have clear visibility into the health and status of a
system or service. Set up external health endpoints to monitor status, and ensure that applications are coded to
instrument the operations metrics. Use a common and consistent schema that helps you correlate events across
systems. Azure Diagnostics and Application Insights are the standard method of tracking the health and status of
Azure resources. Microsoft Operation Management Suite also provides centralized monitoring and management
for cloud or hybrid solutions.
Aggregate and correlate logs and metrics . A properly instrumented telemetry system will provide a large
amount of raw performance data and event logs. Make sure that telemetry and log data is processed and correlated
in a short period of time, so that operations staff always have an up-to-date picture of system health. Organize and
display data in ways that give a cohesive view of any issues, so that whenever possible it's clear when events are
related to one another.

Consult your corporate retention policy for requirements on how data is processed and how long it should be
stored.

Implement automated aler ts and notifications. Set up monitoring tools like Azure Monitor to detect patterns
or conditions that indicate potential or current issues, and send alerts to the team members who can address the
issues. Tune the alerts to avoid false positives.
Monitor assets and resources for expirations. Some resources and assets, such as certificates, expire after a
given amount of time. Make sure to track which assets expire, when they expire, and what services or features
depend on them. Use automated processes to monitor these assets. Notify the operations team before an asset
expires, and escalate if expiration threatens to disrupt the application.

Management
Automate operations tasks. Manually handling repetitive operations processes is error-prone. Automate these
tasks whenever possible to ensure consistent execution and quality. Code that implements the automation should
be versioned in source control. As with any other code, automation tools must be tested.
Take an infrastructure-as-code approach to provisioning. Minimize the amount of manual configuration
needed to provision resources. Instead, use scripts and Azure Resource Manager templates. Keep the scripts and
templates in source control, like any other code you maintain.
Consider using containers. Containers provide a standard package-based interface for deploying applications.
Using containers, an application is deployed using self-contained packages that include any software, dependencies,
and files needed to run the application, which greatly simplifies the deployment process.
Containers also create an abstraction layer between the application and the underlying operating system, which
provides consistency across environments. This abstraction can also isolate a container from other processes or
applications running on a host.
Implement resiliency and self-healing. Resiliency is the ability of an application to recover from failures.
Strategies for resiliency include retrying transient failures, and failing over to a secondary instance or even another
region. For more information, see Designing reliable Azure applications . Instrument your applications so that issues
are reported immediately and you can manage outages or other system failures.
Have an operations manual. An operations manual or runbook documents the procedures and management
information needed for operations staff to maintain a system. Also document any operations scenarios and
mitigation plans that might come into play during a failure or other disruption to your service. Create this
documentation during the development process, and keep it up to date afterwards. This is a living document, and
should be reviewed, tested, and improved regularly.
Shared documentation is critical. Encourage team members to contribute and share knowledge. The entire team
should have access to documents. Make it easy for anyone on the team to help keep documents updated.
Document on-call procedures. Make sure on-call duties, schedules, and procedures are documented and shared
to all team members. Keep this information up-to-date at all times.
Document escalation procedures for third-par ty dependencies. If your application depends on external
third-party services that you don't directly control, you must have a plan to deal with outages. Create
documentation for your planned mitigation processes. Include support contacts and escalation paths.
Use configuration management. Configuration changes should be planned, visible to operations, and recorded.
This could take the form of a configuration management database, or a configuration-as-code approach.
Configuration should be audited regularly to ensure that what's expected is actually in place.
Get an Azure suppor t plan and understand the process. Azure offers a number of support plans. Determine
the right plan for your needs, and make sure the entire team knows how to use it. Team members should
understand the details of the plan, how the support process works, and how to open a support ticket with Azure. If
you are anticipating a high-scale event, Azure support can assist you with increasing your service limits. For more
information, see the Azure Support FAQs.
Follow least-privilege principles when granting access to resources. Carefully manage access to resources.
Access should be denied by default, unless a user is explicitly given access to a resource. Only grant a user access to
what they need to complete their tasks. Track user permissions and perform regular security audits.
Use role-based access control. Assigning user accounts and access to resources should not be a manual
process. Use role-based access control (RBAC) grant access based on Azure Active Directory identities and groups.
Use a bug tracking system to track issues. Without a good way to track issues, it's easy to miss items,
duplicate work, or introduce additional problems. Don't rely on informal person-to-person communication to track
the status of bugs. Use a bug tracking tool to record details about problems, assign resources to address them, and
provide an audit trail of progress and status.
Manage all resources in a change management system. All aspects of your DevOps process should be
included in a management and versioning system, so that changes can be easily tracked and audited. This includes
code, infrastructure, configuration, documentation, and scripts. Treat all these types of resources as code throughout
the test/build/review process.
Use checklists. Create operations checklists to ensure processes are followed. It's common to miss something in a
large manual, and following a checklist can force attention to details that might otherwise be overlooked. Maintain
the checklists, and continually look for ways to automate tasks and streamline processes.
For more about DevOps, see What is DevOps? on the Visual Studio site.
Automation overview: Goals, best practices, and
types
11/13/2020 • 4 minutes to read • Edit Online

Automation has revolutionized how businesses operate and this trend continues to evolve. Businesses have moved
to automating manual processes so that engineers can focus attention on tasks that add business value.
Automating business solutions allow you to:
Activate resources on demand.
Deploy solutions rapidly.
Minimize human error in configuring repetitive tasks.
Produce consistent and repeatable results.
To learn more, see Deployment considerations for automation.

Goals of automation
When automating technical processes and configurations, a common approach for some organizations is to
automate what they can and leave the more difficult processes for humans to perform manually.
A goal of automation is to make tools that do what humans can do, only better. For example, a human can perform
any given task once. But when the task requires repetitive runs, especially over long time periods, an automated
system is better equipped to do this with more predictable results that are error-free. Increasing speed is another
goal in automation. When you practice these automation goals, you can build systems that are faster, repetitive,
and can run on a daily basis.

Automate toil to improve efficiency


Most automation involves a percentage of toil. Toil is the operational work that is related to a process that is
manual, repetitive, can be automated, and has minimal value. It is counterproductive to automation but in many
organizations, a small amount of toil is unavoidable. It becomes an issue when too much toil slows progress. A
project's production velocity will decrease if engineers are continuously interrupted by manual tasks attributed to
toil, either planned or unplanned. Too much toil can impact job satisfaction. Engineers become dissatisfied when
they find themselves spending too much time on operational toil rather than on other projects.
Automation should be developed, and increased, so engineers can eliminate future toil. By reducing toil, engineers
can concentrate on innovating business solutions.
For more information, see Toil automation.

Automation best practices


Ensure consistency – The more manual processes are involved, the more prone to human error. This can lead
to mistakes, oversights, reduction of data quality, and reliability problems.
Centralize mistakes – Choose a platform to allow you to fix bugs in one place in order to fix them
everywhere. This eliminates the need for communicating with other humans which increases chance of error
and the possibility of the bug being re-introduced.
Identify issues quickly – Complex issue may not always be able to be identified in a timely manner. However,
with good automation, detection of these issues should occur quickly.
Maximize employee productivity – Automation leads to more innovative solutions, additional automation,
and in general provide more value to the business. This in turn raises morale and job satisfaction. Automation
does not replace jobs with technology. For example, once a process is automated, training and maintenance can
be greatly reduced or eliminated. This frees engineers to spend less time on manual processes and more time
on automating business solutions.

Types of automation
Three types of automation are described in this article: infrastructure deployment, infrastructure configuration, and
operational tasks. These categories share the same goals and best practices mentioned previously. They differ in
areas where Azure provides solutions that help achieve optimal automation. Other types of automation such as
continuous deployment and continuous delivery are described further in the Operational Excellence pillar.
Infrastructure deployment
As businesses move to the cloud, they need to repeatedly deploy their solutions and know that their infrastructure
is in a reliable state. To meet these challenges, you can automate deployments using a practice referred to as
infrastructure as code. In code, you define the infrastructure that needs to be deployed.
Although there are many deployment technologies you can use with Azure, this article describes two of the more
popular technologies:
Azure Resource Manager (ARM) templates
Terraform
These technologies use a declarative approach. This lets you state what you intend to deploy without having to
write the sequence of programming commands to create it. You can deploy not only virtual machines, but also the
network infrastructure, storage systems, and any other resources you may need.
Infrastructure configuration
If you don't manage configuration carefully, your business could encounter disruptions such as systems outages
and security issues. Optimal configuration can enable you to quickly detect and correct configurations that could
interrupt or slow performance.
Configurations such as installing software on a virtual machine, adding data to a database, or starting pods in an
Azure Kubernetes Service cluster need to access Azure through an endpoint that is specific to your instance,
outside of the exposed REST APIs. This enables the configuration tools to use agents, networking, or other access
methods to provide resource-specific configuration options.
For example, when deploying to Azure, you may need to run post-deployment virtual machine configuration or run
other arbitrary code to bootstrap the deployed Azure resources. Another example is configuration tools that can be
used to configure and manage the ongoing configuration and state of Azure virtual machines.
Operational tasks
As the demand for speed in performing operational tasks increases over time, you are expected to deliver things
faster and faster. The old way of manually performing operational tasks won't scale to that type of demand now.
This is where automation can help. To meet on-demand delivery using an automation platform, you need to
develop automation components (such runbooks and configurations), create integrations to systems that are
already in place efficiently, and operate and troubleshoot.
Advantages of automating operational tasks include:
Optimize and extend existing processes (for example, using a PowerShell module or REST API).
Deliver flexible and reliable services.
Lower costs.
Improve predictability.
Two popular options for automating operational tasks are:
Azure functions - Run code without managing the underlying infrastructure on where the code is run.
Azure automation - Uses a programming language to automate operational tasks in code and executed on
demand.
For more information, see Automation. To see a Microsoft Ignite video, see Automating Operational and
Management Tasks.
Next steps
Automate Repeatable Infrastructure
Repeatable Infrastructure
11/13/2020 • 4 minutes to read • Edit Online

Historically, deploying a new service or application involves manual work such as procuring and preparing
hardware, configuring operating environments, and enabling monitoring solutions. Ideally, an organization would
have multiple environments in which to test deployments. These test environments should be similar enough to
production that deployment and run time issues are detected before deployment to production. This manual work
takes time, is error-prone, and can produce inconsistencies between the environments if not done well.
Cloud computing changes the way we procure infrastructure. No longer are we unboxing, racking, and cabling
physical infrastructure. We have internet accessible management portals and REST interfaces to help us. We can
now provision virtual machines, databases, and other cloud services on demand and globally. When we no longer
need cloud services, they can be easily deleted. However, cloud computing alone does not remove the effort and
risk in provisioning infrastructure. When using a cloud portal to build systems, many of the same manual
configuration tasks remain. Application servers require configuration, databases need networking, and firewalls
need firewalling.

Deploy infrastructure with code


To fully realize deployment optimization, reduce configuration effort, and automate full environments'
deployment, something more is required. One option is referred to as infrastructure as code.
Cloud computing changes so much about deploying and provisioning infrastructure. Not only can we procure
compute, data, and so many other service types on demand, we have APIs for doing so. Because of cloud service's
API-driven nature, programmatically deploying and configuring cloud services makes sense. The concept known
as infrastructure as code involves using a declarative framework to describe your desired service configuration.
Infrastructure as code solutions translate the declared configuration into the proper cloud provider API requests,
which, once deployed result in usable cloud services. Benefits of using infrastructure as code include:
Deploy similarly configured infrastructure across multiple environments e.g., test and production.
Deploy all required components as a single unit (infrastructure, monitoring solutions, and configured alerts).
Version control infrastructure in a source control solution.
Use continuous integration solutions to manage and test infrastructure deployments.
You can use many declarative infrastructure deployment technologies with Azure, here we detail two of the most
common.

Automate deployments with ARM Templates


Azure Resource Manager (ARM) Templates provide an Azure native infrastructure as code solution. ARM
Templates are written in a language derived from JavaScript Object Notation (JSON), and they define the
infrastructure and configurations for Azure deployments. An ARM template is declarative, you state what intend to
deploy, provide configuration values, and the Azure engine takes care of making the necessary Azure REST API put
requests. Additional benefits of using ARM templates for infrastructure deployments include:
Parallel resource deployment - the Azure deployment engine sequences resource deployments based on
defined dependencies. If dependencies do not exist between two resources, they are deployed at the same
time.
Modular deployments - ARM templates can be broken up into multiple template files for reusability and
modularization.
Day one resource suppor t - ARM templates support all Azure resources and resource properties as they
are released.
Extensibility - Azure deployments can be extended using deployment scripts and other automation solutions.
Validation - Azure deployments are evaluated against a validation API to catch configuration mistakes.
Testing - the ARM template test toolkit provides a static code analysis framework for testing ARM templates.
Change preview - ARM template what-if allows you to see what will be changed before deploying an ARM
template.
Tooling - Language service extensions are available for both Visual Studio Code and Visual Studio to assist in
authoring ARM templates.
The following example demonstrates a simple ARM template that deploys a single Azure Storage account. In this
example, a single parameter is defined to take in a name for the storage account. Under the resources section, a
storage account is defined, the storageName parameter is used to provide a name, and the storage account details
are defined. See the included documentation for an in-depth explanation of the different sections and
configurations for ARM templates.

{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"storageName": {
"type": "string",
"defaultValue": "newStorageAccount"
}
},
"resources": [
{
"name": "[parameters('storageName')]",
"type": "Microsoft.Storage/storageAccounts",
"apiVersion": "2019-06-01",
"location": "[resourceGroup().location]",
"kind": "StorageV2",
"sku": {
"name": "Premium_LRS",
"tier": "Premium"
}
}
]
}

Learn more
Documentation: What are ARM templates
Learn module: Deploy consistent infrastructure with ARM Templates
Code Samples: ARM templates

Automate deployments with Terraform


Terraform is a declarative framework for deploying and configuring infrastructure that supports many private and
public clouds, Azure being one of them. It has the main advantage of offering a cloud-agnostic framework. While
Terraform configurations are specific to each cloud, the framework itself is the same for all of them. Terraform
configurations are written in a domain-specific language (DSL) called Hashicorp Configuration Language.
The following example demonstrates a simple Terraform configuration that deploys an Azure resource group and
a single Azure Storage account.
resource "azurerm_resource_group" "example" {
name = "newStorageAccount"
location = "eastus"
}

resource "azurerm_storage_account" "example" {


name = "storageaccountname"
resource_group_name = azurerm_resource_group.example.name
location = azurerm_resource_group.example.location
account_tier = "Standard"
account_replication_type = "GRS"
}

Take note, the Terraform provider for Azure is an abstraction on top of Azure APIs. This abstraction is beneficial
because the API complexities are obfuscated. This abstraction comes at a cost; the Terraform provider for Azure
does not always provide parity with the Azure APIs' capabilities.
Learn more
Documentation: Using Terraform on Azure
Next steps
Automate infrastructure configuration
Configure infrastructure
11/13/2020 • 6 minutes to read • Edit Online

When working with Azure, many services can be created and configured programmatically using automation or
infrastructure as code tooling. These tools access Azure through the exposed REST APIs or what we refer to as the
Azure control plane. For example, an Azure Network Security Group can be deployed, and security group rules
created using an Azure Resource Manager template. The Network Security Group and its configuration are
exposed through the Azure control plane and natively accessible.
Other configurations, such as installing software on a virtual machine, adding data to a database, or starting pods
in an Azure Kubernetes Service cluster cannot be accessed through the Azure control plane. These actions require a
different set of configuration tools. We consider these configurations as being on the Azure data plane side, or not
exposed through Azure REST APIs. These data plane enables tools to use agents, networking, or other access
methods to provide resource-specific configuration options.
For example, when deploying a set of virtual machined to Azure, you may also want to install and configure a web
server, stage content, and then make the content available on the internet. Furthermore, if the virtual machine
configuration changes and no longer aligns with the configuration definition, you may want a configuration
management system to remediate the configuration. Many options are available for these data plane
configurations. This document details several and provides links for in-depth information.

Bootstrap automation
When deploying to Azure, you may need to run post-deployment virtual machine configuration or run other
arbitrary code to bootstrap the deployed Azure resources. Several options are available for these bootstrapping
tasks and are detailed in the following sections of this document.
Azure VM extensions
Azure virtual machine extensions are small packages that run post-deployment configuration and automation on
Azure virtual machines. Several extensions are available for many different configuration tasks, such as running
scripts, configuring antimalware solutions, and configuring logging solutions. These extensions can be installed
and run on virtual machines using an ARM template, the Azure CLI, Azure PowerShell module, or the Azure portal.
Each Azure VM has a VM Agent installed, and this agent manages the lifecycle of the extension.
A typical VM extension use case would be to use a custom script extension to install software, run commands, and
perform configurations on a virtual machine or virtual machine scale set. The custom script extension uses the
Azure virtual machine agent to download and execute a script. The custom script extensions can be configured to
run as part of infrastructure as code deployments such that the VM is created, and then the script extension is run
on the VM. Extensions can also be run outside of an Azure deployment using the Azure CLI, PowerShell module, or
the Azure portal.
In the following example, the Azure CLI is used to deploy a custom script extension to an existing virtual machine,
which installs a Nginx webserver.

az vm extension set \
--resource-group myResourceGroup \
--vm-name myVM --name customScript \
--publisher Microsoft.Azure.Extensions \
--settings '{"commandToExecute": "apt-get install -y nginx"}'

Learn more
Use the included code sample to deploy a virtual machine and configure a web server on that machine with the
custom script extension.
Documentation: Azure virtual machine extensions
Code Samples: Configure VM with script extension during Azure deployment
cloud-init
cloud-init is a known industry tool for configuring Linux virtual machines on first boot. Much like the Azure custom
script extension, cloud-init allows you to install packages and run commands on Linux virtual machines. cloud-init
can be used for things like software installation, system configurations, and content staging. Azure includes many
cloud-init enable Marketplace virtual machine images across many of the most well-known Linux distributions. For
a full list, see cloud-init support for virtual machines in Azure.
To use cloud-init, create a text file named cloud-init.txt and enter your cloud-init configuration. In this example, the
Nginx package is added to the cloud-init configuration.

#cloud-config
package_upgrade: true
packages:
- nginx

Create a resource group for the virtual machine.

az group create --name myResourceGroupAutomate --location eastus

Create the virtual machine, specifying the --custom-data property with the cloud-inti configuration name.

az vm create \
--resource-group myResourceGroupAutomate \
--name myAutomatedVM \
--image UbuntuLTS \
--admin-username azureuser \
--generate-ssh-keys \
--custom-data cloud-init.txt

On boot, cloud-init will use the systems native package management tool to install Nginx.
Learn more
Documentation: cloud-init support for virtual machines in Azure
Azure deployment script resource
When performing Azure deployments, you may need to run arbitrary code for bootstrapping things like managing
user accounts, Kubernetes pods, or querying data from a non-Azure system. Because none of these operations are
accessible through the Azure control plane, some other mechanism is required for performing this automation. To
run arbitrary code with an Azure deployment, check out the Microsoft.Resources/deploymentScripts Azure
resource.
The deployment script resource behaves similar to any other Azure resource:
Can be used in an ARM template.
Contain ARM template dependencies on other resources.
Consume input, produce output.
Use a user-assigned managed identity for authentication.
When deployed, the deployment script runs PowerShell or Azure CLI commands and scripts. Script execution and
logging can be observed in the Azure portal or with the Azure CLI and PowerShell module. Many options can be
configured like environment variables for the execution environment, timeout options, and what to do with the
resource after a script failure.
The following example shows an ARM template snippet with the deployment script resource configured to run a
PowerShell script.

{
"type": "Microsoft.Resources/deploymentScripts",
"apiVersion": "2019-10-01-preview",
"name": "runPowerShellScript",
"location": "[resourceGroup().location]",
"kind": "AzurePowerShell",
"identity": {
"type": "UserAssigned",
"userAssignedIdentities": {"[parameters('identity')]": {}}
},
"properties": {
"forceUpdateTag": "1",
"azPowerShellVersion": "3.0",
"arguments": "[concat('-sqlServer ', parameters('sqlServer'))]",
"primaryScriptUri": "[variables('script')]",
"timeout": "PT30M",
"cleanupPreference": "OnSuccess",
"retentionInterval": "P1D"
}
}

Learn more
Documentation: Use deployment scripts in templates

Configuration Management
Configuration management tools can be used to configure and manage the ongoing configuration and state of
Azure virtual machines. Three popular options are Azure Automation State Configuration, Chef, and Puppet.
Azure Automation State Configuration
Azure Automation State Configuration is a configuration management solution built on top of PowerShell Desired
State Configuration (DSC). State configuration works with Azure virtual machines, on-premises machines, and
machines in a cloud other than Azure. Using state configuration, you can import PowerShell DSC resources and
assign them to many virtual from a central location. Once each endpoint has evaluated and / or applied the desired
state, state compliance is reported to Azure and can be seen on a built-in dashboard.
The following example uses PowerShell DSC to ensure the NGINX has been installed on Linux systems.

configuration linuxpackage {

Import-DSCResource -Module nx

Node "localhost" {
nxPackage nginx {
Name = "nginx"
Ensure = "Present"
}
}
}

Once imported into Azure State Configuration and assigned to nodes, the state configuration dashboard provides
compliance results.
Learn more
Use the included code sample to deploy Azure Automation State Configuration and several Azure virtual machines.
The virtual machines are also onboarded to state configuration, and a configuration applied.
Documentation: Get started with Azure Automation State Configuration
Code Sample: Deploy DSC and VMs with an ARM template
Chef
Chef is an automation platform that helps define how your infrastructure is configured, deployed, and managed.
Additional components included Chef Habitat for application lifecycle automation rather than the infrastructure
and Chef InSpec that helps automate compliance with security and policy requirements. Chef Clients are installed
on target machines, with one or more central Chef Servers that store and manage the configurations.
Learn more
Documentation: An Overview of Chef
Puppet
Puppet is an enterprise-ready automation platform that handles the application delivery and deployment process.
Agents are installed on target machines to allow Puppet Master to run manifests that define the desired
configuration of the Azure infrastructure and virtual machines. Puppet can integrate with other solutions such as
Jenkins and GitHub for an improved DevOps workflow.
Learn more
Documentation: How Puppet works
Next steps
Automate operational tasks
Automate operational tasks
11/13/2020 • 7 minutes to read • Edit Online

Operational tasks can include any action or activity you may perform while managing systems, system access, and
processes. Some examples include rebooting servers, creating accounts, and shipping logs to a data store. These
tasks may occur on a schedule, as a response to an event or monitoring alert, or ad-hock based on external factors.
Like many other activities related to managing computer systems, these activities are often performed manually,
which takes time, and are error-prone.
Many of these operational tasks can and should be automated. Using scripting technologies and related solutions,
you can shift effort from manually performing operational tasks towards building automation for these tasks. In
doing so, you achieve so much, including:
Reduce time to perform an action.
Reduce risk in performing the action.
Automated response to events and alerts.
Increased human capacity for further innovation.
When working in Azure, you have many options for automating operational tasks. This document details some of
the more popular.

Azure Functions
Azure Functions allows you to run code without managing the underlying infrastructure on where the code is run.
Functions provide a cost-effective, scalable, and event-driven platform for building applications and running
operational tasks. Functions support running code written in C#, Java, JavaScript, Python, and PowerShell.
When creating a Function, a hosting plan is selected. Hosting plans controls how a function app is scaled, resource
availability, and availability of advanced features such as virtual network connectivity and startup time. The hosting
plan also influences the cost.
Functions hosting plans:
Consumption - Default hosting plan, pay only for Function execution time, configurable timeout period,
automatic scale.
Premium - Faster start, VNet connectivity, unlimited execution duration, premium instance sizes, more
predictable pricing.
App Ser vice Plan - Functions run on dedicated virtual machines and can use custom images.
For full details on consumption plans, see Azure Functions scale and hosting.
Functions provide event-driven automation; each function has a trigger associated with it. These triggers are what
run the functions. Common triggers include:
HTTP / Webhook - Function is run when an HTTP request is received.
Queue - Function is run when a new message arrives in a message queue.
Blob storage - Function is run when a new blob is created in a storage container.
Timer - Function is run on a schedule.
Below are example triggers seen in the Azure portal when creating a new function
Once an event has occurred that initiates a Function, you may also want to consume data from this event or from
another source. Once a Function has been completed, you may want to publish or push data to an Azure service
such as a Blob Storage. Input and output are achieved using input and output bindings. For more information
about triggers and bindings, see Azure Functions triggers and binding concepts.
Both PowerShell and Python are common languages for automating everyday operational tasks. Because Azure
Functions supports both of these languages, it is an excellent platform for hosting, running, and auditing
automated operational tasks. For example, let's assume that you would like to build a solution to facilitate self-
service account creation. An Azure PowerShell Function could be used to create the account in Azure Active
Directory. An HTTP trigger can be used to initiate the Function, and an input binding configured to pull the account
details from the HTTP request body. The only remaining piece would be a solution that consumes the account
details and creates the HTTP requests against the Function.
Learn more
Documentation: Azure Functions PowerShell developer guide
Documentation: Azure Functions Python developer guide

Azure Automation
PowerShell and Python are popular programming languages for automating operational tasks. Using these
languages, performing operations like restarting services, moving logs between data stores, and scaling
infrastructure to meet demand can be expressed in code and executed on demand. Alone, these languages do not
offer a platform for centralized management, version control, or execution history. The languages also lack a native
mechanism for responding to events like monitoring driven alerts. To provide these capabilities, an automation
platform is needed.
Azure Automation provides an Azure-hosted platform for hosting and running PowerShell and Python code across
Azure, non-Azure cloud, and on-premises environments. PowerShell and Python code is stored in an Azure
Automation Runbook, which has the following attributes:
Execute Runbooks on demand, on a schedule, or through a webhook.
Execution history and logging.
Integrated secrets store.
Source Control integration.
As seen in the following image, Azure Automation provides a portal experience for managing Azure Automation
Runbooks. Use the included code sample (ARM template) to deploy an Azure automation account, automation
runbook, and explore Azure Automation for yourself.

Learn more
Documentation: What is Azure Automation?

Scale operations
So far, this document has detailed options for scripting operational tasks; however, many Azure services come with
built-in automation, particularly in scale operations. As application demand increases (transactions, memory
consumption, and compute availability), you may need to scale the application hosting platform so that requests
continue to be served. As demand decreases, scaling back not only appropriately resizes your application but also
reduces operational cost.
In cloud computing, scale activities are classified into two buckets:
Scale-up - Adding additional resources to an existing system to meet demand.
Scale-out - Adding additional infrastructure to meet demand.
Many Azure services can be scaled up by changing the pricing tier of that service. Generally, this operation would
need to be performed manually or using detection logic and custom automation.
Some Azure services support automatic scale-out, which is the focus of this section.
Azure Monitor autoscale
Azure Monitor autoscale can be used to autoscale Virtual Machine Scale Sets, Cloud Services, App Service Web
Apps, and API Management service. To configure scale-out operations for these services, while in the Azure portal,
select the service, and then scale-out under the resource settings. Select Custom To configure autoscaling rules.
Automatic scale operations can also be configured using an Azure Resource Manager Template, the Azure
PowerShell module, and the Azure CLI.
When creating the autoscale rules, configure minimum and maximum instance counts. These settings prevent
inadvertent costly scale operations. Next, configure autoscale rules, at minimum one to add more instances, and
one to remove instances when no longer needed. Azure Monitor autoscale rules give you fine-grain control over
when a scale operation is initiated. See the Learn more section below for more information on configuring these
rules.

Learn more
Documentation: Azure Monitor autoscale overview
Azure Kubernetes Service
Azure Kubernetes Service (AKS) offers an Azure managed Kubernetes cluster. When considering scale operations
in Kubernetes, there are two components:
Pod scaling - Increase or decrease the number of load balanced pods to meet application demand.
Node scaling - Increase or decrease the number of cluster nodes to meet cluster demand.
Azure Kubernetes Service includes automation to facilitate both of these scale operation types.
Horizontal pod autoscaler
Horizontal pod autoscaler (HPA) monitors resource demand and automatically scales pod replicas. When
configuring horizontal pod autoscaling, you provide the minimum and maximum pod replicas that a cluster can
run and the metrics and thresholds that initiate the scale operation. To use horizontal pod autoscaling, each pod
must be configured with resource requests and limits, and a HorizontalPodAutoscaler Kubernetes object must
be created.
The following Kubernetes manifest demonstrates resource requests on a Kubernetes pod and also the definition of
a horizontal pod autoscaler object.

apiVersion: apps/v1
kind: Deployment
metadata:
name: azure-vote-back
spec:
replicas: 1
selector:
matchLabels:
app: azure-vote-back
template:
metadata:
labels:
app: azure-vote-back
spec:
nodeSelector:
"beta.kubernetes.io/os": linux
containers:
- name: azure-vote-back
image: redis
# Resource requests and limits
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 250m
memory: 256Mi
ports:
- containerPort: 6379
name: redis
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: azure-vote-back-hpa
spec:
maxReplicas: 10 # define max replica count
minReplicas: 3 # define min replica count
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: azure-vote-back
targetCPUUtilizationPercentage: 50 # target CPU utilization

Cluster autoscaler
Where horizontal pod autoscaling is a response to demand on a specific application of service running in a
Kubernetes cluster, cluster autoscaling responds to demand on the entire Kubernetes cluster itself. If a Kubernetes
cluster does not have enough compute resources or nodes to facilitate all requested pods' resource requests,
some of these pods will enter a non-scheduled or pending state. In response to this situation, additional nodes can
be automatically added to the cluster. Conversely, once compute resources have been freed up, the cluster nodes
can automatically be removed to match steady-state demand.
Cluster autoscaler can be configured when creating an AKS cluster. The following example demonstrates this
operation with the Azure CLI. This operation can also be completed with an Azure Resource Manager template.
az aks create \
--resource-group myResourceGroup \
--name myAKSCluster \
--node-count 1 \
--vm-set-type VirtualMachineScaleSets \
--load-balancer-sku standard \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 3

Cluster autoscaler can also be configured on an existing cluster using the following Azure CLI command.

az aks update \
--resource-group myResourceGroup \
--name myAKSCluster \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 3

See the included documentation for information on fine-grain control for cluster autoscale operations.
Learn more
Documentation: Horizontal pod autoscaling
Documentation: AKS cluster autoscaler
Overview of the performance efficiency pillar
11/13/2020 • 3 minutes to read • Edit Online

Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner. Before the cloud became popular, when it came to planning how a system would handle
increases in load, many organizations intentionally provisioned workloads to be oversized to meet business
requirements. This might make sense in on-premises environments because it ensured capacity during peak
usage. Capacity reflects resource availability (CPU and memory). This was a major consideration for processes that
would be in place for a number of years.
Just as you needed to anticipate increases in load in on-premises environments, you need to anticipate increases
in cloud environments to meet business requirements. One difference is that you may no longer need to make
long-term predictions for anticipated changes to ensure that you will have enough capacity in the future. Another
difference is in the approach used to manage performance.

What is scalability and why is it important?


An important consideration in achieving performance efficiency is to consider how your application scales and to
implement PaaS offerings that have built-in scaling operations. Scalability is the ability of a system to handle
increased load. Services covered by Azure Autoscale can scale automatically to match demand to accommodate
workload. They will scale out to ensure capacity during workload peaks and scaling will return to normal
automatically when the peak drops.
In the cloud, the ability to take advantage of scalability depends on your infrastructure and services. Some
platforms, such as Kubernetes, were built with scaling in mind. Virtual machines, on the other hand, may not scale
as easily although scale operations are possible. With virtual machines, you may want to plan ahead to avoid
scaling infrastructure in the future to meet demand. Another option is to select a different platform such as Azure
virtual machines scale sets.
When using scalability, you need only predict the current average and peak times for your workload. Payment plan
options allow you to manage this. You pay either per minute or per-hour depending on the service for a
designated time period.

Principles
Follow these principles to guide you through improving performance efficiency:
Become Data-driven - Embrace a data-driven culture to deliver timely insights to everyone in your
organization across all your data. To harness this culture, get the best performance from your analytics
solution across all your data, ensure data has the security and privacy needed for your business
environment, and make sure you have tools that enable everyone in your organization to gain insights
from your data.
Avoid antipatterns - A performance antipattern is a common practice that is likely to cause scalability
problems when an application is under pressure. For example, you can have an application that behaves as
expected during performance testing. However, when it is released to production and starts to handle live
workloads, performance decreases. Scalability problems such as rejecting user requests, stalling, or
throwing exceptions may arise. To learn how to identify and fix these antipatterns, see Performance
antipatterns for cloud applications.
Perform load testing to set limits - Load testing helps ensure that your applications can scale and do
not go down during peak traffic. Load test each application to understand how it performs at various scales.
To learn about Azure service limits, see Managing limits.
Understand billing for metered resources - Your business requirements will determine the tradeoffs
between cost and level of performance efficiency. Azure doesn't directly bill based on the resource cost.
Charges for a resource, such as a virtual machine, are calculated by using one or more meters. Meters are
used to track a resource’s usage over time. These meters are then used to calculate the bill.
Monitor and optimize - Lack of monitoring new services and the health of current workloads are major
inhibitors in workload quality. The overall monitoring strategy should take into consideration not only
scalability, but resiliency (infrastructure, application, and dependent services) and application performance
as well. For purposes of scalability, looking at the metrics would allow you to provision resources
dynamically and scale with demand.

Next steps
Performance efficiency impacts the entire architecture spectrum. Bridge gaps in you knowledge of Azure by
reviewing the 5 pillars in the Microsoft Azure Well-Architected Framework.
To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the
Microsoft Azure Well-Architected Review.
Designing scalable Azure applications
11/2/2020 • 11 minutes to read • Edit Online

Application design is critical to handling scale as load increases.

Choosing the right database


The choice of database and the overall design of the data tier can greatly affect an application's performance and
scalability. Database reads and writes involve a network call and storage I/O, both of which are expense operations.
Choosing the right database service to store and retrieve data is therefore a critical decision and must be carefully
considered to ensure overall application scalability. Azure has many first party database services that will fit most
needs. In addition, there are a series of third-party options that can be considered from Azure Marketplace.
The first decision to consider is whether the application storage requirements fit a relational design (SQL) vs a key-
value/document/graph design (NO-SQL). Some applications may have both a relational database and a NO-SQL
database for different storage needs.
If an RDBMS is considered optimal, Azure offers several PaaS options that fully manage hosting and operations of
the database. Azure SQL Database can host single databases or multiple databases (Azure SQL Database Managed
Instance). The suite of offerings span requirements crossing performance, scale, size, resiliency, disaster recovery,
and migration compatibility. Azure offers the following PaaS relational database services:
Azure SQL Database
Azure Database for MySQL
Azure Database for PostgreSQL
Azure Database for MariaDB
If the application requirements are best suited to a key/value format or document storage (i.e JSON, XML, YAML,
etc.), a NO-SQL solution should be considered first to optimize for performance, scale, and cost savings.
Azure provides two managed services that optimize for NO-SQL solutions. Azure Cosmos DB and Azure Cache for
Redis. For document and graph databases, Cosmos DB is a highly recommended first party service offering
extreme scale and performance.

Dynamic service discovery for microservices applications


Clients can use either client-side discovery or server-side discovery to determine the location of a service instance
to send requests. Service discovery is a process of figuring out how to connect to a service, and if done correctly,
can lead to software that scales out as usage increases. Decomposing an application into microservices is a practice
that can directly lead to native scalability given a service's ability to be scaled and updated individually. This level of
decoupling is an architectural best practice. However, when many microservices are in play, managing, executing,
and operationalizing the services becomes a bottleneck, leading to the incorporation of an orchestration platform
like Kubernetes or Service Fabric. Both of these platforms provide built-in services for executing, scaling, and
operating a microservices architecture; and one of those key services is discovery and finding where a particular
service is running.
The client of the service should not need to know or care where the service executes or how many instances of the
service are executing. It is the job of the orchestration platform to connect the client to the service through a
decoupled discovery service, such as DNS, that major orchestration platforms all provide. An example of how this
works in Kubernetes. Dynamic service discovery can make an application more scalable, because each service can
be scaled independently, without the client having knowledge of the individual instances.
Use the service registry pattern.

Connection pooling
Establishing connections to databases is typically an expensive operation that involves establishing an
authenticated network connection to the remote database server. Pooling and reusing cached connections for new
database operations is a common best practice for improving both application performance and scalability. In many
cases, a default pool size might only consist of a small handful of connections that performs quickly in basic test
scenarios, but become a bottleneck under scale when the pool is exhausted. Establishing a pool size that maps to
the number of concurrent transactions supported on each application instance is a best practice. Each database and
application platform will have slightly different requirements for the right way to set up and leverage the pool.
Review SQL Server Connection Pooling for a .NET code example using SQL Server and Azure Database. In all cases,
testing is paramount to ensure a connection pool is properly established and working as designed under load.

Data Compression
A well-known optimization best practice for scaling web applications is to use a compression strategy to compress
and bundle web pages or api responses. The idea here is to shrink the results returned from a page or api back to
the browser or client app. Compressing the results returned to clients optimizes network traffic and accelerating the
web application. .NET has built in framework support for this technique with GZip compression. Review Response
compression in ASP.NET Core of using response compression with ASP.NET Core. The reason this can dramatically
impact scale is the network optimization benefit means less socket processing for the web server reducing web
server cpu utilization, shortening connections, and thus allowing the web server to handle more concurrent
requests.

Data Locking
When using a relational database behind an Azure hosted application, the app owners must ensure an appropriate
consistency and isolation level when making database connections to achieve maximum performance and scale
out. Consistency and isolation are two of the ACID database properties. Consistency is all about ensuring a
transaction can roll back to a state before the transaction started. Isolation is all about ensuring an executing
transaction remains isolated from other executing transactions. The settings of these ACID properties can
dramatically affect the number of users that can concurrently use a cloud web application in parallel, and therefore
needs to be carefully considered across the various exposed web pages and/or APIs. Review An excellent detail
summary of these concepts in respect to Azure SQL Database.

Asynchronous Programming
When an application calls a remote service leveraging a synchronous call, some of resources on the client (i.e caller
thread, network socket, etc.) are blocked and unavailable for use elsewhere until the remote service invocation
completes. This pattern is simple, but leads to obvious inefficiencies. Asynchronous programming is an alternative
approach that enables a remote service to be executed without waiting and blocking resources on the client. This is
a critical pattern for enabling cloud scalable software and is available in most modern programming languages and
platforms.
There are many ways to inject asynchronous programming into an application design. In a simplest form, remote
calls be asynchronously executed using built-in language constructs like "async/await" in .NET C#. Review a
language construct example. .NET has other built-in platform support for asynchronous programming with task
and event based asynchronous patters.
Asynchronous patterns are critical for cloud scale, and Azure enables the implementation of cloud patterns for
async request/reply appropriate for long running worker type processing. These patterns simulate what a
programming language like C# offers natively for request/response, but optimized for moving back-end processing
out of a front end. Review the asynchronous Request-Reply pattern for additional information.
Microservices
Building software with microservices is a software engineering technique to decompose software into isolated, fine
grained, and loosely coupled modules that are independently deployed and versioned. Microservices refine and
build on the "component" patterns from the 1990s and early 2000s such as COM (Component Object Model) for
breaking up logic into interface exposed modules. With the internet taking center stage, component models were
combined with web protocols and transitioned into Service-Oriented Architecture, and ultimately Web Services.
Microservices is a type of web service that is optimized for high scale requirements. This is because microsevices
are designed to be independent with both deployment and versioning, and therefore can be scaled independently.
So when combined with an orchestration platform designed to execute and manage microservices such as
Kubernetes or Service Fabric, individual services can be right sized, scaled up, scaled down, and dynamically
configured to match user demand. Additionally, microservices can be built and run using cloud serverless
platforms such as Azure Functions that can dynamically scale with a pay-per-use model.
To learn more about these topics, see Building microservices on Azure. For information about a reference
architecture that uses microservices, see Microservices on AKS.

Queuing and batching requests


A variant to asynchronous programming discussed above, queuing services have long been used as a scalable
mechanism to hand-off processing work to a service. Called the Load Leveling pattern, highly scalable queuing
services are natively supported in Azure. The queue is a storage buffer that sits between the caller and the
processing service, takes requests, stores and queues the request, and provides services around the reliable
delivery and management of the queued data.
Using a queue is often the best way to hand off work to a processor service. The processor service receives work by
listening on a queue and dequeuing messages. If items to be processed enter too quickly, the queuing service will
keep them in the queue until the processing service has available resources and asks for a new work item
(message). By leveraging the dynamic nature of Azure Functions, the processor service can easily autoscale on
demand as the queue builds up to meet the intake pressure. Developing processor logic with Azure Functions to
run task logic from a queue is a common, scalable, and cost effective way to using queuing between a client and a
processor.
Azure provides a few native first-party queueing services with Azure Storage Queues (simple queuing service
based on Azure Storage) and Azure Service Bus (message broker service supporting transactions and reduced
latency). Many other third-party options are also available through Marketplace. Review the queue-based load
leveling pattern to learn more.

Session affinity
To design a system to be horizontally scalable, avoid designing a web application that assumes requests have
instance affinity. Stateless, or non-affinity applications, allow the hosting platform (App Services, Kubernetes,
Service Fabric, other) to dynamically add or remove compute instances based on service metrics. This is called
autoscale and works differently depending on the service. A critical requirement for a web application or API to
work in an autoscalable environment is a stateless nature where the application saves its state to an external data
store allowing the app platform to load-balance and route requests across any of the clustered instances. This is
advantageous for service self-healing, removing problematic compute instances, and enabling the latest platform
innovation that leads to higher scale. For older applications running in App Services v1 platforms, avoid using the
Application Request Routing (ARR) session affinity flag to force all requests back to the initial instance.
If the solution implements a long-running task, route requests to this task with a queuing mechanisms discussed
earlier.
Migrate an Azure Cloud Services application to Azure Service Fabric describes best practices about stateless
services for an application that is migrated from old Azure Cloud Services to Azure Service Fabric.
Autoscaling
Many of the compute offerings in Azure offer autoscale to ensure the right amount of resources are available
mapping to service user demand. With autoscale, your service can scale out to additional compute instances during
busy periods and scale back in during silent periods. This is critical for both the service user experience as well as
offering cost savings for the service implementor. The way it works is metrics are collected for the resource (that is,
cpu, memory utilization) and the application (requests queued, requests per second). Rules can then be created off
those metrics and/or time schedules to add/remove instances depending on how the rule evaluates.
Many of the common Azure compute offerings support autoscale. An App Services App Plan allows autoscale rules
to be set for scale-out and scale-in. Azure Kubernetes Service (AKS) offers two levels of autoscale. Firstly, AKS
provides horizontal autoscale can be enabled on service containers to add more or less pod instances within the
cluster. Secondly, AKS provides a cluster autoscale on the agent VM instances running an agent node-pool to add
more ore remove VM instances dynamically. Service Fabric has similar offerings and virtual machine scale sets
offers autoscale capabilities for true IaaS scenarios. In addition, Azure App Gateway and Azure API Management are
PaaS offerings for ingress services that also enable autoscale. And lastly, Azure Functions, Azure Logic Apps, and
App Services offer serverless pay-per-use consumption modeling that inherently provide autoscale.
Each service documents its autoscale capabilities. Review Autoscale overview for a general discussion on Azure
platform autoscale.

Background jobs
Back in the 1990s and early 2000's, Enterprise Application Integration became a major topic for hooking systems
together, creating and scheduling jobs and tasks. These platforms offered a series of connectors for working with
common formats and protocols like EDI/X12/EDIFACT, SFTP, HL7, SWIFT, and SOAP; and managing long running
processes that may span minutes/hours/days through workflow orchestration. Along with workflow, these
integration tools solved app-to-app and business-to-business integration needs. Not surprisingly, these needs still
exist in the cloud, but a scalable cloud native way of enabling these capabilities is paramount, and that's what the
Logic Apps service enables in Azure. Logic Apps is a serverless consumption (pay-per-use) service that enables a
vast set of out-of-the-box ready-to-use connectors and a long-running workflow engine to quickly enable cloud-
native integration needs. Logic Apps is flexible enough for a plethora of sceneries like running tasks/jobs, advanced
scheduling, and triggering. It includes many of the format and protocol capabilities that existed in Microsoft's
enterprise EAI product called BizTalk Server, and has advanced hosting options to allow it run within enterprise
restricted cloud environments. Logic Apps compliments and can be combined with all other Azure services, or it
can be used independently.
Like all serverless services, Logic Apps doesn't require VM instances to be purchased, enabled, and scaled up and
down. Instead, Logic Apps scale automatically on serverless PaaS provided instances, and a consumer only pays
based on usage.
Capacity
11/2/2020 • 2 minutes to read • Edit Online

Content Delivery Networks (CDN)


With CDNs, you can cache static objects loaded from Azure Blob storage, a web application, or any publicly
accessible web server, by using the closest point of presence (POP) server. CDNs can also accelerate dynamic
content, which cannot be cached, by leveraging various network and routing optimizations. If you are not familiar
with CDN's review What is a content delivery network on Azure to get an introduction to the concept.

Large-scale event management


Work with your business and marketing teams to prepare for large-scale events. Knowing if there will be sudden
spikes in traffic (Superbowl, Black Friday, or Marketing pushes) can allow you to prepare your infrastructure ahead
of time.

Choosing the right resources


Right sizing your infrastructure to meet the needs of your applications can save you considerably as opposed to a
'one size fits all' solution often employed with on-premises hardware. Identify the needs of your application and
choose the resources that best fit those needs. Review sizes for Windows virtual machines in Azure to learn more.

Choosing metrics for scaling policies


Autoscaling rules that use a detection mechanism based on a measured trigger attribute (such as CPU usage or
queue length) use an aggregated value over time, rather than instantaneous values, to trigger an autoscaling action.
By default, the aggregate is an average of the values. This prevents the system from reacting too quickly, or causing
rapid oscillation. To learn more review autoscaling guidance.

Preemptively scaling based on trends


Preemptively scaling based on historical data can ensure your application has consistent performance, even though
your metrics have not yet indicated the need to scale. If you can predict the load on the application, consider using
scheduled autoscaling, which adds and removes instances to meet anticipated peaks in demand. To learn more
review autoscaling guidance.
Load Testing
11/2/2020 • 2 minutes to read • Edit Online

Testing at expected peak load


Load test your application at the expected peak load to ensure there are no challenges around performance or
stability when operating at full capacity. Review Changes to load test functionality in Visual Studio and cloud load
testing in Azure DevOps to get an overview of the Microsoft provided load testing tools, along with a
comprehensive list of third-party tools you can leverage as well.

Azure service limits


Different Azure services have soft and hard limits associated with them. Understand the limits for the services you
consume so that you are not blocked if you need to exceed them. Review Azure subscription and service limits,
quotas, and constraints to get a list of the most common Azure limits.

The ResourceLimits sample shows how to query the limits and quotas for commonly used resources.

Understanding application behavior under load


Load test your application to understand how it performs at various scales. Review Autoscale best practices to get
additional insight into how to evaluate your application as the amount of traffic sent to it increases.

Measuring typical loads


Knowing the typical and maximum loads on your system help you understand when something is operating
outside of its designed limits. Monitor traffic to your application to understand user behavior.

Caching
Applications should implement a strategy that helps to ensure that the data in the cache is as up to date as possible
but can also detect and handle situations that arise when the data in the cache has become stale. Review the Cache-
Aside pattern to learn how to load data on demand into a cache from a data store. This can improve performance
and also helps to maintain consistency between data held in the cache and data in the underlying data store.

Availability of SKUs
Certain Azure SKUs are only available in certain regions. Understand which SKUs are available in the regions you
operate in so you can plan accordingly. Read about global infrastructure services.

Related sections
Performance tuning scenario: Event streaming with Azure Functions
Monitoring for Scalability
11/2/2020 • 2 minutes to read • Edit Online

Monitoring for scalability should be part of your overall monitoring strategy that utilizes Azure Monitor. The overall
monitoring strategy should take into consideration not only scalability, but resiliency (infrastructure, application,
and dependent services) and application performance as well. Most services in Azure offer the ability to turn on
both data and management plane logs as well as metrics. For purposes of scalability, looking at the metrics would
allow you to scale based scale up, scale out, scale in, and scale down. The ability to scale dynamically is one of the
biggest values of moving to the cloud.

What are some of the reasons for setting up auto scaling


Are the systems able to handle the current number of requests?
Are the systems meeting the performance NFRs?
Are the systems out of resources (CPU, Memory, I/O)?
Is there an opportunity to have cost savings by not overprovisioning resources?

What is the goal of auto scaling


Scale up or out or in and down resources to meet the current demand.
Decrease costs by scaling down when resources are not needed.
Improve customer experience be offering uninterrupted service even when the demand peaks of valleys.

How can metrics be used to auto scale


As stated before, most Azure services offer the ability to export logs and metrics to services such
as Log Analytics and external service like Splunk through Azure Event Hubs. Furthermore, application leveraging
technologies such as Application Insights can further enhance the telemetry coming out of the applications.
The metrics coming out of Azure services include metrics such as CPU and memory,
utilization, bandwidth information, current storage utilization information, and much more. You can refer to the
supported metrics for Azure Monitor

How do Azure services auto scale


App Services, ASE, and virtual machine scale sets be configured with autoscaling rules that can be based on
several metrics including CPU, memory, bandwidth, etc. These rules can create new instances (scale out) or
remove instances (scale in) or a running service. This capability can be enhanced by generating custom
events from technologies like Application Insight that could be based on some other custom metrics.
Azure Kubernetes Services offers both the ability to scale pods as well as to auto scale nodes. Scaling rules
can be based on internal metrics or can leverage metrics from systems like Prometheus.
Other services, such as Application Gateway, can be scaled manually. In this case, it is important to leverage
services such as Log Analytics to raise alerts when the service is no longer able to handle the load.
Monitor Metrics and auto scale on performance and schedule for VMs and virtual machine scale sets
For Container workloads, container monitoring solution in Azure Monitor should be utilized.

The AppServiceAutoscalingSample sample shows autoscaling scenarios based on CPU utilization.


Monitoring best practices
Know the minimum number of instances that should run at any given time.
Determine what metrics are best for your solution to base your auto scaling rules.
Configure the auto scaling rules for those services that include it.
Create alert rules for the services that could be scaled manually.
Monitor your environment to make sure that auto scaling is working as expected. For example, watch out for
scaling events from the telemetry coming out of the management plane.
Monitor web applications using Azure Application Insights.
Monitor network performance.
Consider reviewing as applicable, network performance monitor, service connectivity monitor,
ExpressRoute monitor
For long-term storage, consider archiving of the Monitoring Data.
Track activities using Azure Security and Audit Logs.

Related Useful Resources


Azure Monitor Data Platform
Auto scaling best practices
Manage log data and workspaces in Azure Monitor.
Azure Diagnostic Logs and Schemas
Performance efficiency checklist
11/2/2020 • 14 minutes to read • Edit Online

Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner, and is one of the pillars of the Microsoft Azure Well-Architected Framework. Use this checklist to
review your application architecture from a performance efficiency standpoint.

Application design
Par tition the workload . Design parts of the process to be discrete and decomposable. Minimize the size of each
part, while following the usual rules for separation of concerns and the single responsibility principle. This allows
the component parts to be distributed in a way that maximizes use of each compute unit (such as a role or
database server). It also makes it easier to scale the application by adding instances of specific resources. For
complex domains, consider adopting a microservices architecture.
Design for scaling . Scaling allows applications to react to variable load by increasing and decreasing the
number of instances of roles, queues, and other services they use. However, the application must be designed
with this in mind. For example, the application and the services it uses must be stateless, to allow requests to be
routed to any instance. This also prevents the addition or removal of specific instances from adversely affecting
current users. You should also implement configuration or autodetection of instances as they are added and
removed, so that code in the application can perform the necessary routing. For example, a web application might
use a set of queues in a round-robin approach to route requests to background services running in worker roles.
The web application must be able to detect changes in the number of queues, to successfully route requests and
balance the load on the application.
Scale as a unit . Plan for additional resources to accommodate growth. For each resource, know the upper
scaling limits, and use sharding or decomposition to go beyond these limits. Determine the scale units for the
system in terms of well-defined sets of resources. This makes applying scale-out operations easier, and less prone
to negative impact on the application through limitations imposed by lack of resources in some part of the overall
system. For example, adding x number of web and worker roles might require y number of additional queues and
z number of storage accounts to handle the additional workload generated by the roles. So a scale unit could
consist of x web and worker roles, y queues, and z storage accounts. Design the application so that it's easily
scaled by adding one or more scale units.
Avoid client affinity . Where possible, ensure that the application does not require affinity. Requests can thus be
routed to any instance, and the number of instances is irrelevant. This also avoids the overhead of storing,
retrieving, and maintaining state information for each user.
Take advantage of platform autoscaling features . Where the hosting platform supports an autoscaling
capability, such as Azure Autoscale, prefer it to custom or third-party mechanisms unless the built-in mechanism
can't fulfill your requirements. Use scheduled scaling rules where possible to ensure resources are available
without a start-up delay, but add reactive autoscaling to the rules where appropriate to cope with unexpected
changes in demand. You can use the autoscaling operations in the classic deployment model to adjust
autoscaling, and to add custom counters to rules. For more information, see Auto-scaling guidance.
Offload CPU-intensive and I/O-intensive tasks as background tasks . If a request to a service is expected
to take a long time to run or absorb considerable resources, offload the processing for this request to a separate
task. Use worker roles or background jobs (depending on the hosting platform) to execute these tasks. This
strategy enables the service to continue receiving further requests and remain responsive. For more information,
see Background jobs guidance.
Distribute the workload for background tasks . Where there are many background tasks, or the tasks
require considerable time or resources, spread the work across multiple compute units (such as worker roles or
background jobs). For one possible solution, see the Competing Consumers pattern.
Consider moving toward a shared-nothing architecture . A shared-nothing architecture uses independent,
self-sufficient nodes that have no single point of contention (such as shared services or storage). In theory, such a
system can scale almost indefinitely. While a fully shared-nothing approach is generally not practical for most
applications, it may provide opportunities to design for better scalability. For example, avoiding the use of server-
side session state, client affinity, and data partitioning are good examples of moving toward a shared-nothing
architecture.

Data management
Use data par titioning . Divide the data across multiple databases and database servers, or design the
application to use data storage services that can provide this partitioning transparently (examples include Azure
SQL Database Elastic Database, and Azure Table storage). This approach can help to maximize performance and
allow easier scaling. There are different partitioning techniques, such as horizontal, vertical, and functional. You
can use a combination of these to achieve maximum benefit from increased query performance, simpler
scalability, more flexible management, better availability, and to match the type of store to the data it will hold.
Also, consider using different types of data store for different types of data, choosing the types based on how well
they are optimized for the specific type of data. This may include using table storage, a document database, or a
column-family data store, instead of, or as well as, a relational database. For more information, see Data
partitioning guidance.
Design for eventual consistency . Eventual consistency improves scalability by reducing or removing the time
needed to synchronize related data partitioned across multiple stores. The cost is that data is not always
consistent when it is read, and some write operations may cause conflicts. Eventual consistency is ideal for
situations where the same data is read frequently but written infrequently. For more information, see the Data
Consistency Primer.
Reduce chatty interactions between components and ser vices . Avoid designing interactions in which an
application is required to make multiple calls to a service (each of which returns a small amount of data), rather
than a single call that can return all of the data. Where possible, combine several related operations into a single
request when the call is to a service or component that has noticeable latency. This makes it easier to monitor
performance and optimize complex operations. For example, use stored procedures in databases to encapsulate
complex logic, and reduce the number of round trips and resource locking.
Use queues to level the load for high velocity data writes . Surges in demand for a service can overwhelm
that service and cause escalating failures. To prevent this, consider implementing the Queue-Based Load Leveling
pattern. Use a queue that acts as a buffer between a task and a service that it invokes. This can smooth
intermittent heavy loads that may otherwise cause the service to fail or the task to time out.
Minimize the load on the data store . The data store is commonly a processing bottleneck, a costly resource,
and often not easy to scale out. Where possible, remove logic (such as processing XML documents or JSON
objects) from the data store, and perform processing within the application. For example, instead of passing XML
to the database (other than as an opaque string for storage), serialize or deserialize the XML within the application
layer and pass it in a form that is native to the data store. It's typically much easier to scale out the application
than the data store, so you should attempt to do as much of the compute-intensive processing as possible within
the application.
Minimize the volume of data retrieved . Retrieve only the data you require by specifying columns and using
criteria to select rows. Make use of table value parameters and the appropriate isolation level. Use mechanisms
like entity tags to avoid retrieving data unnecessarily.
Aggressively use caching . Use caching wherever possible to reduce the load on resources and services that
generate or deliver data. Caching is typically suited to data that is relatively static, or that requires considerable
processing to obtain. Caching should occur at all levels where appropriate in each layer of the application,
including data access and user interface generation. For more information, see the Caching Guidance.
Handle data growth and retention . The amount of data stored by an application grows over time. This growth
increases storage costs as well as latency when accessing the data, affecting application throughput and
performance. It may be possible to periodically archive some of the old data that is no longer accessed, or move
data that is rarely accessed into long-term storage that is more cost efficient, even if the access latency is higher.
Optimize Data Transfer Objects (DTOs) using an efficient binar y format . DTOs are passed between the
layers of an application many times. Minimizing the size reduces the load on resources and the network. However,
balance the savings with the overhead of converting the data to the required format in each location where it is
used. Adopt a format that has the maximum interoperability to enable easy reuse of a component.
Set cache control . Design and configure the application to use output caching or fragment caching where
possible, to minimize processing load.
Enable client side caching . Web applications should enable cache settings on the content that can be cached.
This is commonly disabled by default. Configure the server to deliver the appropriate cache control headers to
enable caching of content on proxy servers and clients.
Use Azure blob storage and the Azure Content Deliver y Network to reduce the load on the
application . Consider storing static or relatively static public content, such as images, resources, scripts, and style
sheets, in blob storage. This approach relieves the application of the load caused by dynamically generating this
content for each request. Additionally, consider using the Content Delivery Network to cache this content and
deliver it to clients. Using the Content Delivery Network can improve performance at the client because the
content is delivered from the geographically closest datacenter that contains a Content Delivery Network cache.
For more information, see the Content Delivery Network Guidance.
Optimize and tune SQL queries and indexes . Some T-SQL statements or constructs may have an adverse
effect on performance that can be reduced by optimizing the code in a stored procedure. For example, avoid
converting datetime types to a varchar before comparing with a datetime literal value. Use date/time
comparison functions instead. Lack of appropriate indexes can also slow query execution. If you use an
object/relational mapping framework, understand how it works and how it may affect performance of the data
access layer. For more information, see Query Tuning.
Consider denormalizing data . Data normalization helps to avoid duplication and inconsistency. However,
maintaining multiple indexes, checking for referential integrity, performing multiple accesses to small chunks of
data, and joining tables to reassemble the data imposes an overhead that can affect performance. Consider if
some additional storage volume and duplication is acceptable in order to reduce the load on the data store. Also
consider if the application itself (which is typically easier to scale) can be relied on to take over tasks such as
managing referential integrity in order to reduce the load on the data store. For more information, see Data
partitioning guidance.

Implementation
Review the performance antipatterns . See Performance antipatterns for cloud applications for common
practices that are likely to cause scalability problems when an application is under pressure.
Use asynchronous calls . Use asynchronous code wherever possible when accessing resources or services that
may be limited by I/O or network bandwidth, or that have a noticeable latency, in order to avoid locking the
calling thread.
Avoid locking resources, and use an optimistic approach instead . Never lock access to resources such as
storage or other services that have noticeable latency, because this is a primary cause of poor performance.
Always use optimistic approaches to managing concurrent operations, such as writing to storage. Use features of
the storage layer to manage conflicts. In distributed applications, data may be only eventually consistent.
Compress highly compressible data over high latency, low bandwidth networks . In the majority of
cases in a web application, the largest volume of data generated by the application and passed over the network
is HTTP responses to client requests. HTTP compression can reduce this considerably, especially for static content.
This can reduce cost as well as reducing the load on the network, though compressing dynamic content does
apply a fractionally higher load on the server. In other, more generalized environments, data compression can
reduce the volume of data transmitted and minimize transfer time and costs, but the compression and
decompression processes incur overhead. As such, compression should only be used when there is a
demonstrable gain in performance. Other serialization methods, such as JSON or binary encodings, may reduce
the payload size while having less impact on performance, whereas XML is likely to increase it.
Minimize the time that connections and resources are in use . Maintain connections and resources only
for as long as you need to use them. For example, open connections as late as possible, and allow them to be
returned to the connection pool as soon as possible. Acquire resources as late as possible, and dispose of them as
soon as possible.
Minimize the number of connections required . Service connections absorb resources. Limit the number
that are required and ensure that existing connections are reused whenever possible. For example, after
performing authentication, use impersonation where appropriate to run code as a specific identity. This can help
to make best use of the connection pool by reusing connections.

NOTE
APIs for some services automatically reuse connections, provided service-specific guidelines are followed. It's important that
you understand the conditions that enable connection reuse for each service that your application uses.

Send requests in batches to optimize network use . For example, send and read messages in batches when
accessing a queue, and perform multiple reads or writes as a batch when accessing storage or a cache. This can
help to maximize efficiency of the services and data stores by reducing the number of calls across the network.
Avoid a requirement to store ser ver-side session state where possible. Server-side session state
management typically requires client affinity (that is, routing each request to the same server instance), which
affects the ability of the system to scale. Ideally, you should design clients to be stateless with respect to the
servers that they use. However, if the application must maintain session state, store sensitive data or large
volumes of per-client data in a distributed server-side cache that all instances of the application can access.
Optimize table storage schemas . When using table stores that require the table and column names to be
passed and processed with every query, such as Azure table storage, consider using shorter names to reduce this
overhead. However, do not sacrifice readability or manageability by using overly compact names.
Create resource dependencies during deployment or at application star tup . Avoid repeated calls to
methods that test the existence of a resource and then create the resource if it does not exist. Methods such as
CloudTable.CreateIfNotExists and CloudQueue.CreateIfNotExists in the Azure Storage Client Library follow this
pattern. These methods can impose considerable overhead if they are invoked before each access to a storage
table or storage queue. Instead:
Create the required resources when the application is deployed, or when it first starts (a single call to
CreateIfNotExists for each resource in the startup code for a web or worker role is acceptable). However, be
sure to handle exceptions that may arise if your code attempts to access a resource that doesn't exist. In these
situations, you should log the exception, and possibly alert an operator that a resource is missing.
Under some circumstances, it may be appropriate to create the missing resource as part of the exception
handling code. But you should adopt this approach with caution as the non-existence of the resource might be
indicative of a programming error (a misspelled resource name for example), or some other infrastructure-
level issue.
Use lightweight frameworks . Carefully choose the APIs and frameworks you use to minimize resource usage,
execution time, and overall load on the application. For example, using Web API to handle service requests can
reduce the application footprint and increase execution speed, but it may not be suitable for advanced scenarios
where the additional capabilities of Windows Communication Foundation are required.
Consider minimizing the number of ser vice accounts . For example, use a specific account to access
resources or services that impose a limit on connections, or perform better where fewer connections are
maintained. This approach is common for services such as databases, but it can affect the ability to accurately
audit operations due to the impersonation of the original user.
Carr y out performance profiling and load testing during development, as part of test routines, and before
final release to ensure the application performs and scales as required. This testing should occur on the same type
of hardware as the production platform, and with the same types and quantities of data and user load as it will
encounter in production. For more information, see Testing the performance of a cloud service.
Overview of the reliability pillar
11/2/2020 • 13 minutes to read • Edit Online

Building a reliable application in the cloud is different from traditional application development. While
historically you may have purchased levels of redundant higher-end hardware to minimize the chance of an
entire application platform failing, in the cloud, we acknowledge up front that failures will happen. Instead of
trying to prevent failures altogether, the goal is to minimize the effects of a single failing component.
To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the
Microsoft Azure Well-Architected Review.
Reliable applications are:
Resilient and recover gracefully from failures, and they continue to function with minimal downtime and
data loss before full recovery.
Highly available (HA) and run as designed in a healthy state with no significant downtime.
Understanding how these elements work together — and how they affect cost — is essential to building a
reliable application. It can help you determine how much downtime is acceptable, the potential cost to your
business, and which functions are necessary during a recovery.
This article provides a brief overview of building reliability into each step of the Azure application design process.
Each section includes a link to an in-depth article on how to integrate reliability into that specific step in the
process. If you're looking for reliability considerations for individual Azure services, review the Resiliency
checklist for specific Azure services.

Build for reliability


This section describes six steps for building a reliable Azure application. Each step links to a section that further
defines the process and terms.
1. Define requirements. Develop availability and recovery requirements based on decomposed workloads
and business needs.
2. Use architectural best practices. Follow proven practices, identify possible failure points in the
architecture, and determine how the application will respond to failure.
3. Test with simulations and forced failovers. Simulate faults, trigger forced failovers, and test detection
and recovery from these failures.
4. Deploy the application consistently. Release to production using reliable and repeatable processes.
5. Monitor application health. Detect failures, monitor indicators of potential failures, and gauge the health
of your applications.
6. Respond to failures and disasters. Identify when a failure occurs, and determine how to address it based
on established strategies.

Define requirements
Identify your business needs, and build your reliability plan to address them. Consider the following:
Identify workloads and usage. A workload is a distinct capability or task that is logically separated
from other tasks, in terms of business logic and data storage requirements. Each workload has different
requirements for availability, scalability, data consistency, and disaster recovery.
Plan for usage patterns. Usage patterns also play a role in requirements. Identify differences in
requirements during critical and non-critical periods. For example, a tax-filing application can't fail during
a filing deadline. To ensure uptime, plan redundancy across several regions in case one fails. Conversely, to
minimize costs during non-critical periods, you can run your application in a single region.
Identify critical components and paths. Not all components of your system might be as important as
others. For example, you might have an optional component that adds incremental value, but that the
workload can run without if necessary. Understand where these components are, and conversely, where
the critical parts of your workload are. This will help to scope your availability and reliablity metrics and to
plan your recovery strategies to prioritise the highest-importance components.
Establish availability metrics — mean time to recover y (MTTR) and mean time between
failures (MTBF). MTTR is the average time it takes to restore a component after a failure. MTBF is how
long a component can reasonably expect to last between outages. Use these measures to determine
where to add redundancy and to determine service-level agreements (SLAs) for customers.
Establish recover y metrics — recover y time objective (RTO) and recover y point objective
(RPO). RTO is the maximum acceptable time an application can be unavailable after an incident. RPO is
the maximum duration of data loss that is acceptable during a disaster. To derive these values, conduct a
risk assessment and make sure you understand the cost and risk of downtime or data loss in your
organization.

NOTE
If the MTTR of any critical component in a highly available setup exceeds the system RTO, a failure in the system
might cause an unacceptable business disruption. That is, you can't restore the system within the defined RTO.

Determine workload availability targets. To ensure that application architecture meets your business
requirements, define target SLAs for each workload. Account for the cost and complexity of meeting
availability requirements, in addition to application dependencies.
Understand ser vice-level agreements. In Azure, the SLA describes the Microsoft commitments for
uptime and connectivity. If the SLA for a particular service is 99.9 percent, you should expect the service to
be available 99.9 percent of the time.
Define your own target SLAs for each workload in your solution, so you can determine whether the
architecture meets the business requirements. For example, if a workload requires 99.99 percent uptime
but depends on a service with a 99.9 percent SLA, that service can't be a single point of failure in the
system.
For more information about developing requirements for reliable applications, see Application design for
resiliency.

Use architectural best practices


During the architectural phase, focus on implementing practices that meet your business requirements, identify
failure points, and minimize the scope of failures.
Perform a failure mode analysis (FMA). FMA builds resiliency into an application early in the design
stage. It helps you identify the types of failures your application might experience, the potential effects of
each, and possible recovery strategies.
Create a redundancy plan. The level of redundancy required for each workload depends on your
business needs and factors into the overall cost of your application.
Design for scalability. A cloud application must be able to scale to accommodate changes in usage.
Begin with discrete components, and design the application to respond automatically to load changes
whenever possible. Keep scaling limits in mind during design so you can expand easily in the future.
Plan for subscription and ser vice requirements. You might need additional subscriptions to
provision enough resources to meet your business requirements for storage, connections, throughput,
and more.
Use load-balancing to distribute requests. Load-balancing distributes your application's requests to
healthy service instances by removing unhealthy instances from rotation.
Implement resiliency strategies. Resiliency is the ability of a system to recover from failures and
continue to function. Implement resiliency design patterns, such as isolating critical resources, using
compensating transactions, and performing asynchronous operations whenever possible.
Build availability requirements into your design. Availability is the proportion of time your system
is functional and working. Take steps to ensure that application availability conforms to your service-level
agreement. For example, avoid single points of failure, decompose workloads by service-level objective,
and throttle high-volume users.
Manage your data. How you store, back up, and replicate data is critical.
Choose replication methods for your application data. Your application data is stored in various
data stores and might have different availability requirements. Evaluate the replication methods and
locations for each type of data store to ensure that they satisfy your requirements.
Document and test your failover and failback processes. Clearly document instructions to fail
over to a new data store, and test them regularly to make sure they are accurate and easy to follow.
Protect your data. Back up and validate data regularly, and make sure no single user account has
access to both production and backup data.
Plan for data recover y. Make sure that your backup and replication strategy provides for data
recovery times that meet your service-level requirements. Account for all types of data your
application uses, including reference data and databases.

Azure service dependencies


Microsoft Azure services are available globally to drive your cloud operations at an optimal level. You can choose
the best region for your needs based on technical and regulatory considerations: service capabilities, data
residency, compliance requirements, and latency.
Azure services deployed to Azure regions are listed on the Azure global infrastructure products page. To better
understand regions and Availability Zones in Azure, see Regions and Availability Zones in Azure.
Azure services are built for resiliency including high availability and disaster recovery. There are no services that
are dependent on a single logical data center (to avoid single points of failure). Non-regional services listed on
Azure global infrastructure products are services for which there is no dependency on a specific Azure region.
Non-regional services are deployed to two or more regions and if there is a regional failure, the instance of the
service in another region continues servicing customers. Certain non-regional services enable customers to
specify the region where the underlying virtual machine (VM) on which service runs will be deployed. For
example, Windows Virtual Desktop enables customers to specify the region location where the VM resides. All
Azure services that store customer data allow the customer to specify the specific regions in which their data will
be stored. The exception is Azure Active Directory (Azure AD), which has geo placement (such as Europe or North
America). For more information about data storage residency, see the Data residency map.

Test with simulations and forced failovers


Testing for reliability requires measuring how the end-to-end workload performs under failure conditions that
only occur intermittently.
Test for common failure scenarios by triggering actual failures or by simulating them. Use fault
injection testing to test common scenarios (including combinations of failures) and recovery time.
Identify failures that occur only under load. Test for peak load, using production data or synthetic data
that is as close to production data as possible, to see how the application behaves under real-world
conditions.
Run disaster recover y drills. Have a disaster recovery plan in place, and test it periodically to make sure it
works.
Perform failover and failback testing. Ensure that your application's dependent services fail over and fail
back in the correct order.
Run simulation tests. Testing real-life scenarios can highlight issues that need to be addressed. Scenarios
should be controllable and non-disruptive to the business. Inform management of simulation testing plans.
Test health probes. Configure health probes for load balancers and traffic managers to check critical system
components. Test them to make sure that they respond appropriately.
Test monitoring systems. Be sure that monitoring systems are reliably reporting critical information and
accurate data to help identify potential failures.
Include third-par ty ser vices in test scenarios. Test possible points of failure due to third-party service
disruption, in addition to recovery.
Testing is an iterative process. Test the application, measure the outcome, analyze and address any failures, and
repeat the process.
For more information about testing for application reliability, see Testing Azure applications for resiliency and
availability.

Deploy the application consistently


Deployment includes provisioning Azure resources, deploying application code, and applying configuration
settings. An update may involve all three tasks or a subset of them.
After an application is deployed to production, updates are a possible source of errors. Minimize errors with
predictable and repeatable deployment processes.
Automate your application deployment process. Automate as many processes as possible.
Design your release process to maximize availability. If your release process requires services to go
offline during deployment, your application is unavailable until they come back online. Take advantage of
platform staging and production features. Use blue-green or canary releases to deploy updates, so if a failure
occurs, you can quickly roll back the update.
Have a rollback plan for deployment. Design a rollback process to return to a last known good version
and to minimize downtime if a deployment fails.
Log and audit deployments. If you use staged deployment techniques, more than one version of your
application is running in production. Implement a robust logging strategy to capture as much version-specific
information as possible.
Document the application release process. Clearly define and document your release process, and
ensure that it's available to the entire operations team.
For more information about application reliability and deployment, see Deploying Azure applications for
resiliency and availability.

Monitor application health


Implement best practices for monitoring and alerts in your application so you can detect failures and alert an
operator to fix them.
Implement health probes and check functions. Run them regularly from outside the application to
identify degradation of application health and performance.
Check long-running workflows. Catching issues early can minimize the need to roll back the entire
workflow or to execute multiple compensating transactions.
Maintain application logs.
Log applications in production and at service boundaries.
Use semantic and asynchronous logging.
Separate application logs from audit logs.
Measure remote call statistics, and share the data with the application team. To give your
operations team an instantaneous view into application health, summarize remote call metrics, such as
latency, throughput, and errors in the 99 and 95 percentiles. Perform statistical analysis on the metrics to
uncover errors that occur within each percentile.
Track transient exceptions and retries over an appropriate time frame. A trend of increasing
exceptions over time indicates that the service is having an issue and may fail.
Set up an early warning system. Identify the key performance indicators (KPIs) of an application's
health, such as transient exceptions and remote call latency, and set appropriate threshold values for each
of them. Send an alert to operations when the threshold value is reached.
Operate within Azure subscription limits. Azure subscriptions have limits on certain resource types,
such as the number of resource groups, cores, and storage accounts. Watch your use of resource types.
Monitor third-par ty ser vices. Log your invocations and correlate them with your application's health
and diagnostic logging using a unique identifier.
Train multiple operators to monitor the application and to perform manual recover y steps.
Make sure there is always at least one trained operator active.
For more information about monitoring for application reliability, see Monitoring Azure application health.

Respond to failures and disasters


Create a recovery plan, and make sure that it covers data restoration, network outages, dependent service
failures, and region-wide service disruptions. Consider your VMs, storage, databases, and other Azure platform
services in your recovery strategy.
Plan for Azure suppor t interactions. Before the need arises, establish a process for contacting Azure
support.
Document and test your disaster recover y plan. Write a disaster recovery plan that reflects the
business impact of application failures. Automate the recovery process as much as possible, and document
any manual steps. Regularly test your disaster recovery process to validate and improve the plan.
Fail over manually when required. Some systems can't fail over automatically and require a manual
failover. If an application fails over to a secondary region, perform an operational readiness test. Verify that
the primary region is healthy and ready to receive traffic again before failing back. Determine what the
reduced application functionality is and how the app informs users of temporary problems.
Prepare for application failure. Prepare for a range of failures, including faults that are handled
automatically, those that result in reduced functionality, and those that cause the application to become
unavailable. The application should inform users of temporary issues.
Recover from data corruption. If a failure happens in a data store, check for data inconsistencies when the
store becomes available again, especially if the data was replicated. Restore corrupt data from a backup.
Recover from a network outage. You might be able to use cached data to run locally with reduced
application functionality. If not, consider application downtime or fail over to another region. Store your data
in an alternate location until connectivity is restored.
Recover from a dependent ser vice failure. Determine which functionality is still available and how the
application should respond.
Recover from a region-wide ser vice disruption. Region-wide service disruptions are uncommon, but
you should have a strategy to address them, especially for critical applications. You might be able to redeploy
the application to another region or redistribute traffic.
For more information about responding to failures and disaster recovery, see Failure and disaster recovery for
Azure applications.
Designing resilient Azure applications
11/2/2020 • 11 minutes to read • Edit Online

Building resiliency (recovering from failures) and availability (running in a healthy state without significant
downtime) into your apps begins with gathering requirements. For example, how much downtime is acceptable?
How much does potential downtime cost your business? What are your customer's availability requirements?
How much do you invest in making your application highly available? What is the risk versus the cost?

Determine subscription and service requirements


Choose the right subscription and service features for your app by working through these tasks:
Evaluate requirements against Azure subscription and ser vice limits . Azure subscriptions have limits
on certain resource types, such as number of resource groups, cores, and storage accounts. If your application
requirements exceed Azure subscription limits, create another Azure subscription and provision sufficient
resources there. Individual Azure services have consumption limits — for example, limits on storage,
throughput, number of connections, requests per second, and other metrics. Your application will fail if it
attempts to use resources beyond these limits, resulting in service throttling and possible downtime for
affected users. Depending on the specific service and your application requirements, you can often avoid these
limits by scaling up (for example, choosing another pricing tier) or scaling out (such as adding new instances).
Determine how many storage accounts you need. Azure allows a specific number of storage accounts
per subscription. For more information, see Azure subscription and service limits, quotas, and constraints.
Select the right ser vice tier for Azure SQL Database. If your application uses Azure SQL Database, select
the appropriate service tier. If the tier cannot handle your application's database transaction unit (DTU)
requirements, your data use will be throttled. For more information on selecting the correct service plan, see
SQL Database options and performance: Understand what's available in each service tier.
Provision sufficient request units (RUs) in Azure Cosmos DB . With Azure Cosmos DB, you pay for the
throughput you provision and the storage you consume on an hourly basis. The cost of all database operations
is normalized as RUs, which abstracts the system resources such as CPU, IOPS, and memory. For more
information, see Request Units in Azure Cosmos DB.

Resiliency strategies
This section describes some common resiliency strategies. Most of these strategies are not limited to a particular
technology. The descriptions summarize the general idea behind each technique and include links to further
reading.
Implement resiliency patterns for remote operations, where appropriate. If your application depends
on communication between remote services, follow design patterns for dealing with transient failures.
Retr y transient failures. These can be caused by momentary loss of network connectivity, a dropped
database connection, or a timeout when a service is busy. Often, a transient failure can be resolved by
retrying the request.
For many Azure services, the client software development kit (SDK) implements automatic retries in a
way that is transparent to the caller. See Retry guidance for specific services.
Or implement the Retry pattern to help the application handle anticipated, temporary failures
transparently when it tries to connect to a service or network resource.

The RetryPatternSample shows an implementation of the retry pattern.


Use a circuit breaker to handle faults that might take a variable amount of time to fix. The Circuit
Breaker pattern can prevent an application from repeatedly trying an operation that is likely to fail. The
circuit breaker wraps calls to a service and tracks the number of recent failures. If the failure count exceeds
a threshold, the circuit breaker starts returning an error code without calling the service. This gives the
service time to recover and helps avoid cascading failures.
Isolate critical resources. Failures in one subsystem can sometimes cascade, resulting in failures in
other parts of the application. This can happen if a failure prevents resources such as threads or sockets
from being freed, leading to resource exhaustion. To avoid this, you can partition a system into isolated
groups so that a failure in one partition does not bring down the entire system.
Here are some examples of this technique, which is sometimes called the Bulkhead pattern:
Partition a database (for example, by tenant), and assign a separate pool of web server instances for
each partition.
Use separate thread pools to isolate calls to different services. This helps to prevent cascading
failures if one of the services fails. For an example, see the Netflix Hystrix library.
Use containers to limit the resources available to a particular subsystem.

Apply compensating transactions . A compensating transaction is a transaction that undoes the effects
of another completed transaction. In a distributed system, it can be difficult to achieve strong transactional
consistency. Compensating transactions help to achieve consistency by using a series of smaller, individual
transactions that can be undone at each step. For example, to book a trip, a customer might reserve a car, a
hotel room, and a flight. If one of these steps fails, the entire operation fails. Instead of trying to use a single
distributed transaction for the entire operation, you can define a compensating transaction for each step.
Implement asynchronous operations, whenever possible. Synchronous operations can monopolize
resources and block other operations while the caller waits for the process to complete. Design each part
of your application to allow for asynchronous operations, whenever possible. For more information on
how to implement asynchronous programming in C#, see Asynchronous Programming.

Plan for usage patterns


Identify differences in requirements during critical and non-critical periods. Are there certain critical periods when
the system must be available? For example, a tax-filing application can't fail during a filing deadline and a video
streaming service shouldn't lag during a live event. In these situations, weigh the cost against the risk.
To ensure uptime and meet service-level agreements (SLAs) in critical periods, plan redundancy across several
regions in case one fails, even if it costs more.
Conversely, during non-critical periods, run your application in a single region to minimize costs.
In some cases, you can mitigate additional expenses by using modern serverless techniques that have
consumption-based billing.

Identify distinct workloads


Cloud solutions typically consist of multiple application workloads. A workload is a distinct capability or task that
is logically separated from other tasks in terms of business logic and data storage requirements. For example, an
e-commerce app might have the following workloads:
Browse and search a product catalog.
Create and track orders.
View recommendations.
Each workload has different requirements for availability, scalability, data consistency, and disaster recovery. Make
your business decisions by balancing cost versus risk for each workload.
Also decompose workloads by service-level objective. If a service is composed of critical and less-critical
workloads, manage them differently and specify the service features and number of instances needed to meet
their availability requirements.

Managing third party services


If your application has dependencies on third-party services, identify how these services can fail and what effect
failures will have on your application.
A third-party service might not include monitoring and diagnostics. Log calls to these services and correlate them
with your application's health and diagnostic logging using a unique identifier. For more information on proven
practices for monitoring and diagnostics, see Monitoring and diagnostics guidance.
See the Health Endpoint Monitoring pattern for a solution to track this with code samples.

Monitoring third-party services


If your application has dependencies on third-party services, identify where and how these services can fail and
what effect those failures will have on your application. Keep in mind the service-level agreement (SLA) for the
third-party service and the effect it might have on your disaster recovery plan.
A third-party service might not provide monitoring and diagnostics capabilities, so it's important to log your
invocations of them and to correlate them with your application's health and diagnostic logging using a unique
identifier. For more information on proven practices for monitoring and diagnostics, see Monitoring and
diagnostics guidance.

Load balancing
Proper load-balancing allows you to meet availability requirements and to minimize costs associated with
availability.
Use load-balancing to distribute requests. Load-balancing distributes your application's requests to
healthy service instances by removing unhealthy instances from rotation. If your service uses Azure App
Service or Azure Cloud Services, it's already load-balanced for you. However, if your application uses Azure
VMs, you need to provision a load-balancer. For more information, see What is Azure Load Balancer?
You can use Azure Load Balancer to:
Load-balance incoming Internet traffic to your VMs. This configuration is known as a public Load
Balancer.
Load-balance traffic across VMs inside a virtual network. You can also reach a Load Balancer front end
from an on-premises network in a hybrid scenario. Both scenarios use a configuration that is known as
an internal Load Balancer.
Port forward traffic to an itemized port on specific VMs with inbound network address translation (NAT)
rules.
Provide outbound connectivity for VMs inside your virtual network by using a public Load Balancer.
Balance loads across regions with a traffic manager, such as Azure Traffic Manager. To load-
balance traffic across regions requires a traffic management solution, and Azure provides Traffic Manager.
You can also take advantage of third-party services that provide similar traffic-management capabilities.

Failure mode analysis


Failure mode analysis (FMA) builds resiliency into a system by identifying possible failure points and defining how
the application responds to those failures. The FMA should be part of the architecture and design phases, so
failure recovery is built into the system from the beginning. The goals of an FMA are to:
Determine what types of failures an application might experience and how the application detects those
failures.
Capture the potential effects of each type of failure and determine how the app responds.
Plan for logging and monitoring the failure and identify recovery strategies.
Here are some examples of failure modes and detection strategies for a specific failure point — a call to an
external web service:

FA IL URE M O DE DET EC T IO N ST RAT EGY

Service is unavailable HTTP 5xx

Throttling HTTP 429 (Too Many Requests)

Authentication HTTP 401 (Unauthorized)

Slow response Request times out

For more information about the FMA process, with specific recommendations for Azure, see Failure mode
analysis.

Related samples are here.

Operating in multiple regions


If your application is deployed to a single region, in the rare event the entire region becomes unavailable, your
application will also be unavailable. This may be unacceptable under the terms of your application's SLA. If so,
consider deploying your application and its services across multiple regions. A multi-region deployment can use
an active-active pattern (distributing requests across multiple active instances) or an active-passive pattern
(keeping a "warm" instance in reserve, in case the primary instance fails)
Many failures are manageable within the same Azure region. However, in the unlikely event of a region-wide
service disruption, the locally redundant copies of your data aren't available. If you've enabled geo-replication,
there are three additional copies of your blobs and tables in a different region. If Microsoft declares the region
lost, Azure remaps all the DNS entries to the secondary region.
NOTE
This process occurs only for region-wide service disruptions and is not within your control. Consider using Azure Site
Recovery to achieve better RPO and RTO. Using Site Recovery, you decide what is an acceptable outage and when to fail
over to the replicated VMs.

NOTE
The selection of the Resource Group location is important. In the event of a regional outage, you will be unable to control
resources inside that Resource Group, regardless of what region those resources are actually in (i.e., the resources in the
other region(s) will continue to function, but management plane operations will be unavailable.

Your response to a region-wide service disruption depends on your deployment and your disaster recovery plan.
As a cost-control strategy, for non-critical applications that don't require a guaranteed recovery time, it might
make sense to redeploy to a different region.
For applications that are hosted in another region with deployed roles but don't distribute traffic across
regions (active/passive deployment), switch to the secondary hosted service in the alternate region.
For applications that have a full-scale secondary deployment in another region (active/active deployment),
route traffic to that region.
To learn more about recovering from a region-wide service disruption, see Recover from a region-wide service
disruption.
VM recovery
For critical apps, plan for recovering VMs in the event of a region-wide service disruption.
Use Azure Backup or another backup method to create cross-region backups that are application consistent.
(Replication of the Backup vault must be configured at the time of creation.)
Use Site Recovery to replicate across regions for one-click application failover and failover testing.
Use Traffic Manager to automate user traffic failover to another region.
To learn more, see Recover from a region-wide service disruption, Virtual machines.
Storage recovery
To protect your storage in the event of a region-wide service disruption:
Use geo-redundant storage.
Know where your storage is geo-replicated. This affects where you deploy other instances of your data that
require regional affinity with your storage.
Check data for consistency after failover and, if necessary, restore from a backup.
To learn more, see Designing highly available applications using RA-GRS.
SQL Database and SQL Server
Azure SQL Database provides two types of recovery:
Use geo-restore to restore a database from a backup copy in another region. For more information, see
Recover an Azure SQL database using automated database backups.
Use active geo-replication to fail over to a secondary database. For more information, see Creating and using
active geo-replication.
For SQL Server running on VMs, see High availability and disaster recovery for SQL Server in Azure Virtual
Machines.
Error handling for resilient applications in Azure
11/2/2020 • 2 minutes to read • Edit Online

Ensuring your application can recover from errors is critical when working in a distributed system

Transient fault handling


Track the number of transient exceptions and retries over time to uncover issues or failures in your application's
retry logic. A trend of increasing exceptions over time may indicate that the service is having an issue and may fail.
For more information, see Retry service specific guidance.
Use the Retry pattern, paying particular attention to issues and considerations. Avoid overwhelming dependent
services by implementing the Circuit Breaker pattern. Review and incorporate additional best practices guidance for
Transient fault handling. While calling systems that have Throttling pattern implemented, ensure that your retries
are not counter productive.

A reference implementation is available here. It uses Polly and IHttpClientBuilder to implement the Circuit
Breaker pattern.

Request timeouts
When making a service call or a database call ensure that appropriate request timeouts are set. Database
Connection timeouts are typically set to 30s. Use guidance on troubleshoot, diagnose, and prevent SQL connection
errors and transient errors for SQL Database.
Leverage design patterns that encapsulate robust timeout strategies like Choreography pattern or Compensating
Transaction pattern.

A reference implementation is available on GitHub.

Cascading Failures
The Circuit Breaker pattern provides stability while the system recovers from a failure and minimizes the impact on
performance. It can help to maintain the response time of the system by quickly rejecting a request for an operation
that's likely to fail, rather than waiting for the operation to time out, or never return.
Retry pattern. Describes how an application can handle anticipated temporary failures when it tries to connect to a
service or network resource by transparently retrying an operation that has previously failed.
Health Endpoint Monitoring pattern. A circuit breaker might be able to test the health of a service by sending a
request to an endpoint exposed by the service. The service should return information indicating its status.

Samples related to this pattern are here.

Application Health Probes


Configure and test health probes for your load balancers and traffic managers. Ensure that your health endpoint
checks the critical parts of the system and responds appropriately.
For Azure Traffic Manager, the health probe determines whether to fail over to another region. Your health
endpoint should check any critical dependencies that are deployed within the same region.
For Azure Load Balancer, the health probe determines whether to remove a VM from rotation. The health
endpoint should report the health of the VM. Don't include other tiers or external services. Otherwise, a failure
that occurs outside the VM will cause the load balancer to remove the VM from rotation.
For guidance on implementing health monitoring in your application, see Health Endpoint Monitoring pattern.

Samples related to heath probes are here.


ARM template that deploys an Azure Load Balancer and health probes that detect the health of the sample
service endpoint.
An ASP.NET Core Web API that shows configuration of health checks at startup.

Command and Query Responsibility Segregation (CQRS)


Achieve levels of scale and performance needed for your solution by segregating read and write interfaces by
implementing the CQRS pattern.
Failure mode analysis for Azure applications
11/2/2020 • 16 minutes to read • Edit Online

Failure mode analysis (FMA) is a process for building resiliency into a system, by identifying possible failure points
in the system. The FMA should be part of the architecture and design phases, so that you can build failure recovery
into the system from the beginning.
Here is the general process to conduct an FMA:
1. Identify all of the components in the system. Include external dependencies, such as identity providers,
third-party services, and so on.
2. For each component, identify potential failures that could occur. A single component may have more than
one failure mode. For example, you should consider read failures and write failures separately, because the
impact and possible mitigation steps will be different.
3. Rate each failure mode according to its overall risk. Consider these factors:
What is the likelihood of the failure. Is it relatively common? Extremely rare? You don't need exact
numbers; the purpose is to help rank the priority.
What is the impact on the application, in terms of availability, data loss, monetary cost, and business
disruption?
4. For each failure mode, determine how the application will respond and recover. Consider tradeoffs in cost
and application complexity.
As a starting point for your FMA process, this article contains a catalog of potential failure modes and their
mitigation steps. The catalog is organized by technology or Azure service, plus a general category for application-
level design. The catalog is not exhaustive, but covers many of the core Azure services.

App Service
App Service app shuts down.
Detection . Possible causes:
Expected shutdown
An operator shuts down the application; for example, using the Azure portal.
The app was unloaded because it was idle. (Only if the Always On setting is disabled.)
Unexpected shutdown
The app crashes.
An App Service VM instance becomes unavailable.
Application_End logging will catch the app domain shutdown (soft process crash) and is the only way to catch the
application domain shutdowns.
Recover y:
If the shutdown was expected, use the application's shutdown event to shut down gracefully. For example, in
ASP.NET, use the Application_End method.
If the application was unloaded while idle, it is automatically restarted on the next request. However, you will
incur the "cold start" cost.
To prevent the application from being unloaded while idle, enable the Always On setting in the web app. See
Configure web apps in Azure App Service.
To prevent an operator from shutting down the app, set a resource lock with ReadOnly level. See Lock resources
with Azure Resource Manager.
If the app crashes or an App Service VM becomes unavailable, App Service automatically restarts the app.
Diagnostics . Application logs and web server logs. See Enable diagnostics logging for web apps in Azure App
Service.
A particular user repeatedly makes bad requests or overloads the system.
Detection . Authenticate users and include user ID in application logs.
Recover y:
Use Azure API Management to throttle requests from the user. See Advanced request throttling with Azure API
Management
Block the user.
Diagnostics . Log all authentication requests.
A bad update was deployed.
Detection . Monitor the application health through the Azure portal (see Monitor Azure web app performance) or
implement the health endpoint monitoring pattern.
Recover y:. Use multiple deployment slots and roll back to the last-known-good deployment. For more
information, see Basic web application.

Azure Active Directory


OpenID Connect authentication fails.
Detection . Possible failure modes include:
1. Azure AD is not available, or cannot be reached due to a network problem. Redirection to the authentication
endpoint fails, and the OpenID Connect middleware throws an exception.
2. Azure AD tenant does not exist. Redirection to the authentication endpoint returns an HTTP error code, and the
OpenID Connect middleware throws an exception.
3. User cannot authenticate. No detection strategy is necessary; Azure AD handles login failures.
Recover y:
1. Catch unhandled exceptions from the middleware.
2. Handle AuthenticationFailed events.
3. Redirect the user to an error page.
4. User retries.

Azure Search
Writing data to Azure Search fails.
Detection . Catch Microsoft.Rest.Azure.CloudException errors.
Recover y:
The Search .NET SDK automatically retries after transient failures. Any exceptions thrown by the client SDK should
be treated as nontransient errors.
The default retry policy uses exponential back-off. To use a different retry policy, call SetRetryPolicy on the
SearchIndexClient or SearchServiceClient class. For more information, see Automatic Retries.
Diagnostics . Use Search Traffic Analytics.
Reading data from Azure Search fails.
Detection . Catch Microsoft.Rest.Azure.CloudException errors.
Recover y:
The Search .NET SDK automatically retries after transient failures. Any exceptions thrown by the client SDK should
be treated as nontransient errors.
The default retry policy uses exponential back-off. To use a different retry policy, call SetRetryPolicy on the
SearchIndexClient or SearchServiceClient class. For more information, see Automatic Retries.

Diagnostics . Use Search Traffic Analytics.

Cassandra
Reading or writing to a node fails.
Detection . Catch the exception. For .NET clients, this will typically be System.Web.HttpException . Other client may
have other exception types. For more information, see Cassandra error handling done right.
Recover y:
Each Cassandra client has its own retry policies and capabilities. For more information, see Cassandra error
handling done right.
Use a rack-aware deployment, with data nodes distributed across the fault domains.
Deploy to multiple regions with local quorum consistency. If a nontransient failure occurs, fail over to another
region.
Diagnostics . Application logs

Cloud Service
Web or worker roles are unexpectedly being shut down.
Detection . The RoleEnvironment.Stopping event is fired.
Recover y . Override the RoleEntryPoint.OnStop method to gracefully clean up. For more information, see The
Right Way to Handle Azure OnStop Events (blog).

Cosmos DB
Reading data fails.
Detection . Catch System.Net.Http.HttpRequestException or Microsoft.Azure.Documents.DocumentClientException .
Recover y:
The SDK automatically retries failed attempts. To set the number of retries and the maximum wait time,
configure ConnectionPolicy.RetryOptions . Exceptions that the client raises are either beyond the retry policy or
are not transient errors.
If Cosmos DB throttles the client, it returns an HTTP 429 error. Check the status code in the
DocumentClientException . If you are getting error 429 consistently, consider increasing the throughput value of
the collection.
If you are using the MongoDB API, the service returns error code 16500 when throttling.
Replicate the Cosmos DB database across two or more regions. All replicas are readable. Using the client SDKs,
specify the PreferredLocations parameter. This is an ordered list of Azure regions. All reads will be sent to the
first available region in the list. If the request fails, the client will try the other regions in the list, in order. For
more information, see How to set up Azure Cosmos DB global distribution using the SQL API.
Diagnostics . Log all errors on the client side.
Writing data fails.
Detection . Catch System.Net.Http.HttpRequestException or Microsoft.Azure.Documents.DocumentClientException .
Recover y:
The SDK automatically retries failed attempts. To set the number of retries and the maximum wait time,
configure ConnectionPolicy.RetryOptions . Exceptions that the client raises are either beyond the retry policy or
are not transient errors.
If Cosmos DB throttles the client, it returns an HTTP 429 error. Check the status code in the
DocumentClientException . If you are getting error 429 consistently, consider increasing the throughput value of
the collection.
Replicate the Cosmos DB database across two or more regions. If the primary region fails, another region will
be promoted to write. You can also trigger a failover manually. The SDK does automatic discovery and routing,
so application code continues to work after a failover. During the failover period (typically minutes), write
operations will have higher latency, as the SDK finds the new write region. For more information, see How to set
up Azure Cosmos DB global distribution using the SQL API.
As a fallback, persist the document to a backup queue, and process the queue later.
Diagnostics . Log all errors on the client side.

Queue storage
Writing a message to Azure Queue storage fails consistently.
Detection . After N retry attempts, the write operation still fails.
Recover y:
Store the data in a local cache, and forward the writes to storage later, when the service becomes available.
Create a secondary queue, and write to that queue if the primary queue is unavailable.
Diagnostics . Use storage metrics.
The application cannot process a particular message from the queue.
Detection . Application specific. For example, the message contains invalid data, or the business logic fails for
some reason.
Recover y:
Move the message to a separate queue. Run a separate process to examine the messages in that queue.
Consider using Azure Service Bus Messaging queues, which provides a dead-letter queue functionality for this
purpose.

NOTE
If you are using Storage queues with WebJobs, the WebJobs SDK provides built-in poison message handling. See How to use
Azure queue storage with the WebJobs SDK.

Diagnostics . Use application logging.


Azure Cache for Redis
Reading from the cache fails.
Detection . Catch StackExchange.Redis.RedisConnectionException .
Recover y:
1. Retry on transient failures. Azure Cache for Redis supports built-in retry. For more information, see Azure Cache
for Redis retry guidelines.
2. Treat nontransient failures as a cache miss, and fall back to the original data source.
Diagnostics . Use Azure Cache for Redis diagnostics.
Writing to the cache fails.
Detection . Catch StackExchange.Redis.RedisConnectionException .
Recover y:
1. Retry on transient failures. Azure Cache for Redis supports built-in retry. For more information, see Azure Cache
for Redis retry guidelines.
2. If the error is nontransient, ignore it and let other transactions write to the cache later.
Diagnostics . Use Azure Cache for Redis diagnostics.

SQL Database
Cannot connect to the database in the primary region.
Detection . Connection fails.
Recover y:
Prerequisite: The database must be configured for active geo-replication. See SQL Database Active Geo-Replication.
For queries, read from a secondary replica.
For inserts and updates, manually fail over to a secondary replica. See Initiate a planned or unplanned failover
for Azure SQL Database.
The replica uses a different connection string, so you will need to update the connection string in your application.
Client runs out of connections in the connection pool.
Detection . Catch System.InvalidOperationException errors.
Recover y:
Retry the operation.
As a mitigation plan, isolate the connection pools for each use case, so that one use case can't dominate all the
connections.
Increase the maximum connection pools.
Diagnostics . Application logs.
Database connection limit is reached.
Detection . Azure SQL Database limits the number of concurrent workers, logins, and sessions. The limits depend
on the service tier. For more information, see Azure SQL Database resource limits.
To detect these errors, catch System.Data.SqlClient.SqlException and check the value of SqlException.Number for
the SQL error code. For a list of relevant error codes, see SQL error codes for SQL Database client applications:
Database connection error and other issues.
Recover y . These errors are considered transient, so retrying may resolve the issue. If you consistently hit these
errors, consider scaling the database.
Diagnostics . - The sys.event_log query returns successful database connections, connection failures, and
deadlocks.
Create an alert rule for failed connections.
Enable SQL Database auditing and check for failed logins.

Service Bus Messaging


Reading a message from a Service Bus queue fails.
Detection . Catch exceptions from the client SDK. The base class for Service Bus exceptions is MessagingException.
If the error is transient, the IsTransient property is true.
For more information, see Service Bus messaging exceptions.
Recover y:
1. Retry on transient failures. See Service Bus retry guidelines.
2. Messages that cannot be delivered to any receiver are placed in a dead-letter queue. Use this queue to see
which messages could not be received. There is no automatic cleanup of the dead-letter queue. Messages
remain there until you explicitly retrieve them. See Overview of Service Bus dead-letter queues.
Writing a message to a Service Bus queue fails.
Detection . Catch exceptions from the client SDK. The base class for Service Bus exceptions is MessagingException.
If the error is transient, the IsTransient property is true.
For more information, see Service Bus messaging exceptions.
Recover y:
1. The Service Bus client automatically retries after transient errors. By default, it uses exponential back-off.
After the maximum retry count or maximum timeout period, the client throws an exception. For more
information, see Service Bus retry guidelines.
2. If the queue quota is exceeded, the client throws QuotaExceededException. The exception message gives
more details. Drain some messages from the queue before retrying, and consider using the Circuit Breaker
pattern to avoid continued retries while the quota is exceeded. Also, make sure the
BrokeredMessage.TimeToLive property is not set too high.
3. Within a region, resiliency can be improved by using partitioned queues or topics. A non-partitioned queue
or topic is assigned to one messaging store. If this messaging store is unavailable, all operations on that
queue or topic will fail. A partitioned queue or topic is partitioned across multiple messaging stores.
4. For additional resiliency, create two Service Bus namespaces in different regions, and replicate the
messages. You can use either active replication or passive replication.
Active replication: The client sends every message to both queues. The receiver listens on both queues.
Tag messages with a unique identifier, so the client can discard duplicate messages.
Passive replication: The client sends the message to one queue. If there is an error, the client falls back to
the other queue. The receiver listens on both queues. This approach reduces the number of duplicate
messages that are sent. However, the receiver must still handle duplicate messages.
For more information, see GeoReplication sample and Best practices for insulating applications against
Service Bus outages and disasters.
Duplicate message.
Detection . Examine the MessageId and DeliveryCount properties of the message.
Recover y:
If possible, design your message processing operations to be idempotent. Otherwise, store message IDs of
messages that are already processed, and check the ID before processing a message.
Enable duplicate detection, by creating the queue with RequiresDuplicateDetection set to true. With this
setting, Service Bus automatically deletes any message that is sent with the same MessageId as a previous
message. Note the following:
This setting prevents duplicate messages from being put into the queue. It doesn't prevent a receiver
from processing the same message more than once.
Duplicate detection has a time window. If a duplicate is sent beyond this window, it won't be detected.
Diagnostics . Log duplicated messages.
The application can't process a particular message from the queue.
Detection . Application specific. For example, the message contains invalid data, or the business logic fails for
some reason.
Recover y:
There are two failure modes to consider.
The receiver detects the failure. In this case, move the message to the dead-letter queue. Later, run a separate
process to examine the messages in the dead-letter queue.
The receiver fails in the middle of processing the message — for example, due to an unhandled exception. To
handle this case, use PeekLock mode. In this mode, if the lock expires, the message becomes available to other
receivers. If the message exceeds the maximum delivery count or the time-to-live, the message is automatically
moved to the dead-letter queue.
For more information, see Overview of Service Bus dead-letter queues.
Diagnostics . Whenever the application moves a message to the dead-letter queue, write an event to the
application logs.

Service Fabric
A request to a service fails.
Detection . The service returns an error.
Recover y:
Locate a proxy again ( ServiceProxy or ActorProxy ) and call the service/actor method again.
Stateful ser vice . Wrap operations on reliable collections in a transaction. If there is an error, the transaction
will be rolled back. The request, if pulled from a queue, will be processed again.
Stateless ser vice . If the service persists data to an external store, all operations need to be idempotent.
Diagnostics . Application log
Service Fabric node is shut down.
Detection . A cancellation token is passed to the service's RunAsync method. Service Fabric cancels the task before
shutting down the node.
Recover y . Use the cancellation token to detect shutdown. When Service Fabric requests cancellation, finish any
work and exit RunAsync as quickly as possible.
Diagnostics . Application logs

Storage
Writing data to Azure Storage fails
Detection . The client receives errors when writing.
Recover y:
1. Retry the operation, to recover from transient failures. The retry policy in the client SDK handles this
automatically.
2. Implement the Circuit Breaker pattern to avoid overwhelming storage.
3. If N retry attempts fail, perform a graceful fallback. For example:
Store the data in a local cache, and forward the writes to storage later, when the service becomes
available.
If the write action was in a transactional scope, compensate the transaction.
Diagnostics . Use storage metrics.
Reading data from Azure Storage fails.
Detection . The client receives errors when reading.
Recover y:
1. Retry the operation, to recover from transient failures. The retry policy in the client SDK handles this
automatically.
2. For RA-GRS storage, if reading from the primary endpoint fails, try reading from the secondary endpoint. The
client SDK can handle this automatically. See Azure Storage replication.
3. If N retry attempts fail, take a fallback action to degrade gracefully. For example, if a product image can't be
retrieved from storage, show a generic placeholder image.
Diagnostics . Use storage metrics.

Virtual machine
Connection to a backend VM fails.
Detection . Network connection errors.
Recover y:
Deploy at least two backend VMs in an availability set, behind a load balancer.
If the connection error is transient, sometimes TCP will successfully retry sending the message.
Implement a retry policy in the application.
For persistent or nontransient errors, implement the Circuit Breaker pattern.
If the calling VM exceeds its network egress limit, the outbound queue will fill up. If the outbound queue is
consistently full, consider scaling out.
Diagnostics . Log events at service boundaries.
VM instance becomes unavailable or unhealthy.
Detection . Configure a Load Balancer health probe that signals whether the VM instance is healthy. The probe
should check whether critical functions are responding correctly.
Recover y . For each application tier, put multiple VM instances into the same availability set, and place a load
balancer in front of the VMs. If the health probe fails, the Load Balancer stops sending new connections to the
unhealthy instance.
Diagnostics . - Use Load Balancer log analytics.
Configure your monitoring system to monitor all of the health monitoring endpoints.
Operator accidentally shuts down a VM.
Detection . N/A
Recover y . Set a resource lock with ReadOnly level. See Lock resources with Azure Resource Manager.
Diagnostics . Use Azure Activity Logs.

WebJobs
Continuous job stops running when the SCM host is idle.
Detection . Pass a cancellation token to the WebJob function. For more information, see Graceful shutdown.
Recover y . Enable the Always On setting in the web app. For more information, see Run Background tasks with
WebJobs.

Application design
Application can't handle a spike in incoming requests.
Detection . Depends on the application. Typical symptoms:
The website starts returning HTTP 5xx error codes.
Dependent services, such as database or storage, start to throttle requests. Look for HTTP errors such as HTTP
429 (Too Many Requests), depending on the service.
HTTP queue length grows.
Recover y:
Scale out to handle increased load.
Mitigate failures to avoid having cascading failures disrupt the entire application. Mitigation strategies
include:
Implement the Throttling pattern to avoid overwhelming backend systems.
Use queue-based load leveling to buffer requests and process them at an appropriate pace.
Prioritize certain clients. For example, if the application has free and paid tiers, throttle customers on the
free tier, but not paid customers. See Priority queue pattern.
Diagnostics . Use App Service diagnostic logging. Use a service such as Azure Log Analytics, Application Insights,
or New Relic to help understand the diagnostic logs.

A sample is available here. It uses Polly for these exceptions:


429 - Throttling
408 - Timeout
401 - Unauthorized
503 or 5xx - Service unavailable
One of the operations in a workflow or distributed transaction fails.
Detection . After N retry attempts, it still fails.
Recover y:
As a mitigation plan, implement the Scheduler Agent Supervisor pattern to manage the entire workflow.
Don't retry on timeouts. There is a low success rate for this error.
Queue work, in order to retry later.
Diagnostics . Log all operations (successful and failed), including compensating actions. Use correlation IDs, so
that you can track all operations within the same transaction.
A call to a remote service fails.
Detection . HTTP error code.
Recover y:
1. Retry on transient failures.
2. If the call fails after N attempts, take a fallback action. (Application specific.)
3. Implement the Circuit Breaker pattern to avoid cascading failures.
Diagnostics . Log all remote call failures.

Next steps
For more information about the FMA process, see Resilience by design for cloud services (PDF download).
Backup and disaster recover for Azure applications
11/2/2020 • 14 minutes to read • Edit Online

Disaster recovery is the process of restoring application functionality in the wake of a catastrophic loss.
Your tolerance for reduced functionality during a disaster is a business decision that varies from one application to
the next. It might be acceptable for some applications to be unavailable or to be partially available with reduced
functionality or delayed processing for a period of time. For other applications, any reduced functionality is
unacceptable.

Dependent service outage


For each dependent service, you should understand the implications of a service disruption and the way that the
application will respond. Many services include features that support resiliency and availability, so evaluating each
service independently is likely to improve your disaster recovery plan. For example, Azure Event Hubs supports
failing over to the secondary namespace.

Network outage
When parts of the Azure network are inaccessible, you might not be able to access your application or data. In this
situation, we recommend designing the disaster recovery strategy to run most applications with reduced
functionality.
If reducing functionality isn't an option, the remaining options are application downtime or failover to an alternate
region.
In a reduced functionality scenario:
If your application can't access its data because of an Azure network outage, you might be able to run locally
with reduced application functionality by using cached data.
You might be able to store data in an alternate location until connectivity is restored.

Manual responses
Although automation is ideal, some strategies for disaster recovery require manual responses.
Alerts
Monitor your application for warning signs that may require proactive intervention. For example, if Azure SQL
Database or Azure Cosmos DB consistently throttles your application, you might need to increase your database
capacity or optimize your queries. Even though the application might handle the throttling errors transparently,
your telemetry should still raise an alert so that you can follow up.
For service limits and quota thresholds, we recommend configuring alerts on Azure resources metrics and
diagnostics logs. When possible, set up alerts on metrics, which are lower latency than diagnostics logs.
Through Resource Health, Azure provides some built-in health status checks that can help you diagnose Azure
service throttling issues.
Failover
Configure a disaster recovery strategy for each Azure application and its Azure services. Acceptable deployment
strategies to support disaster recovery may vary based on the SLAs required for all components of each
application.
Azure provides different features within many Azure services to allow for manual failover, such as redis cache geo-
replicas, or for automated failover, such as SQL auto-failover groups. For example:
For an application that mainly uses virtual machines, you can use Azure Site Recovery for the web and logic
tiers. For more information, see Azure to Azure disaster recovery architecture. For SQL Server on VMs, use SQL
Server Always On availability groups.
For an application that uses App Service and Azure SQL Database, you can use a smaller tier App Service plan
configured in the secondary region, which autoscales when a failover occurs. Use failover groups for the
database tier.
In either scenario, an Azure Traffic Manager profile provides for the automated failover across regions. Load
balancers or application gateways should be set up in the secondary region to support faster availability on
failover.
Operational readiness testing
Perform an operational readiness test for failover to the secondary region and for failback to the primary region.
Many Azure services support manual failover or test failover for disaster recovery drills. Alternatively, you can
simulate an outage by shutting down or removing Azure services.

Data corruption and restoration


If a data store fails, there might be data inconsistencies when it becomes available again, especially if the data was
replicated. Understanding the recovery time objective (RTO) and recovery point objective (RPO) of replicated data
stores can help you predict the amount of data loss.
To understand whether the cross-regional failover is started manually or by Microsoft, review the Azure service
SLAs. For services with no SLAs for cross-regional failover, Microsoft typically decides when to fail over and
usually prioritizes recovery of data in the primary region. If data in the primary region is deemed unrecoverable,
Microsoft fails over to the secondary region.
Restoring data from backups
Backups protect you from losing a component of the application because of accidental deletion or data corruption.
They preserve a functional version of the component from an earlier time, which you can use to restore it.
Disaster recovery strategies are not a replacement for backups, but regular backups of application data support
some disaster recovery scenarios. Your backup storage choices should be based on your disaster recovery
strategy.
The frequency of running the backup process determines your RPO. For example, if you perform hourly backups
and a disaster occurs two minutes before the backup, you will lose 58 minutes of data. Your disaster recovery plan
should include how you will address lost data.
It's common for data in one data store to reference data in another store. For example, consider a SQL Database
with a column that links to a blob in Azure Storage. If backups don't happen simultaneously, the database might
have a pointer to a blob that wasn't backed up before the failure. The application or the disaster recovery plan
must implement processes to handle this inconsistency after a recovery.

NOTE
In some scenarios, such as that of VMs backed up using Azure Backup, you can restore only from a backup in the same
region. Other Azure services, such as Azure Cache for Redis, provide geo-replicated backups, which you can use to restore
services across regions.

Azure Storage and Azure SQL Database


Azure automatically stores Azure Storage and SQL Database data three times within different fault domains in the
same region. If you use geo-replication, the data is stored three additional times in a different region. However, if
the data is corrupted or deleted in the primary copy (for example, because of user error), the changes replicate to
the other copies.
You have two options for managing potential data corruption or deletion:
Create a custom backup strategy. You can store your backups in Azure or on-premises, depending on your
business requirements and governance regulations.
Use the point-in-time restore option to recover a SQL Database.
Azure Storage recovery
You can develop a custom backup process for Azure Storage or use one of many third-party backup tools.
Azure Storage provides data resiliency through automated replicas, but it doesn't prevent application code or
users from corrupting data. Maintaining data fidelity after application or user error requires more advanced
techniques, such as copying the data to a secondary storage location with an audit log. You have several options:
Block blobs. Create a point-in-time snapshot of each block blob. For each snapshot, you are charged only
for the storage required to store the differences within the blob since the previous snapshot state. The
snapshots are dependent on the original blob, so we recommend copying to another blob or even to
another storage account. This approach ensures that backup data is protected against accidental deletion.
Use AzCopy or Azure PowerShell to copy the blobs to another storage account.
For more information, see Creating a Snapshot of a Blob.
Azure Files. Use share snapshots, AzCopy, or PowerShell to copy your files to another storage account.
Azure Table storage. Use AzCopy to export the table data into another storage account in another region.
SQL Database recovery
To protect your business from data loss, SQL Database automatically performs a combination of full database
backups weekly, differential database backups hourly, and transaction log backups every 5 to 10 minutes. For the
Basic, Standard, and Premium SQL Database tiers, use point-in-time restore to restore a database to an earlier
time. Review the following articles for more information:
Recover an Azure SQL database using automated database backups
Overview of business continuity with Azure SQL Database
Another option is to use active geo-replication for SQL Database, which automatically replicates database changes
to secondary databases in the same or different Azure region. For more information, see Creating and using active
geo-replication.
You can also use a more manual approach for backup and restore:
Use the Database Copy capability to create a backup copy of the database with transactional consistency.
Use the Azure SQL Database Import/Export Service, which supports exporting databases to BACPAC files
(compressed files containing your database schema and associated data) that are stored in Azure Blob storage.
To protect against a region-wide service disruption, copy the BACPAC files to an alternate region.
SQL Server on VMs
For SQL Server running on VMs, you have two options: traditional backups and log shipping.
With traditional backups, you can restore to a specific point in time, but the recovery process is slow. Restoring
traditional backups requires that you start with an initial full backup and then apply any incremental backups.
You can configure a log shipping session to delay the restore of log backups. This provides a window to recover
from errors made on the primary replica.
Azure Database for MySQL and Azure Database for PostgreSQL
In Azure Database for MySQL and Azure Database for PostgreSQL, the database service automatically makes a
backup every five minutes. You can use these automated backups to restore the server and its databases from an
earlier point in time to a new server. For more information, see:
How to back up and restore a server in Azure Database for MySQL by using the Azure portal
How to back up and restore a server in Azure Database for PostgreSQL using the Azure portal
Azure Cosmos DB
Cosmos DB automatically makes a backup at regular intervals. Backups are stored separately in another storage
service and are replicated globally to protect against regional disasters. If you accidentally delete your database or
collection, you can file a support ticket or call Azure support to restore the data from the last automatic backup.
For more information, see Online backup and on-demand restore in Azure Cosmos DB.
Azure Virtual Machines
To protect Azure Virtual Machines from application errors or accidental deletion, use Azure Backup. The created
backups are consistent across multiple VM disks. In addition, the Azure Backup vault can be replicated across
regions to support recovery from a regional loss.

Disaster recovery plan


Start by creating a recovery plan. The plan is considered complete after it has been fully tested. Include the people,
processes, and applications needed to restore functionality within the service-level agreement (SLA) you've
defined for your customers.
Consider the following suggestions when creating and testing your disaster recovery plan:
In your plan, include the process for contacting support and for escalating issues. This information will help to
avoid prolonged downtime as you work out the recovery process for the first time.
Evaluate the business impact of application failures.
Choose a cross-region recovery architecture for mission-critical applications.
Identify a specific owner of the disaster recovery plan, including automation and testing.
Document the process, especially any manual steps.
Automate the process as much as possible.
Establish a backup strategy for all reference and transactional data, and test backup restoration regularly.
Set up alerts for the stack of the Azure services consumed by your application.
Train operations staff to execute the plan.
Perform regular disaster simulations to validate and improve the plan.
If you're using Azure Site Recovery to replicate virtual machines (VMs), create a fully automated recovery plan to
fail over the entire application.

Backup strategy
Many alternative strategies are available for implementing distributed compute across regions. These must be
tailored to the specific business requirements and circumstances of the application. At a high level, the approaches
can be divided into the following categories:
Redeploy on disaster : In this approach, the application is redeployed from scratch at the time of disaster.
This is appropriate for non-critical applications that don’t require a guaranteed recovery time. Redeploy to a
new region
Warm Spare (Active/Passive) : A secondary hosted service is created in an alternate region, and roles are
deployed to guarantee minimal capacity; however, the roles don’t receive production traffic. This approach
is useful for applications that have not been designed to distribute traffic across regions. Basic Web
Application example, Replicate VM to another region
Hot Spare (Active/Active) : The application is designed to receive production load in multiple regions. The
cloud services in each region might be configured for higher capacity than required for disaster recovery
purposes. Alternatively, the cloud services might scale out as necessary at the time of a disaster and failover.
This approach requires substantial investment in application design, but it has significant benefits. These
include low and guaranteed recovery time, continuous testing of all recovery locations, and efficient usage
of capacity. Multi tier DR example

Resource management
You can distribute compute instances across regions by creating a separate cloud service in each target region,
and then publishing the deployment package to each cloud service. However, distributing traffic across cloud
services in different regions must be implemented by the application developer or with a traffic management
service.
Determining the number of spare role instances to deploy in advance for disaster recovery is an important aspect
of capacity planning. Having a full-scale secondary deployment ensures that capacity is already available when
needed; however, this effectively doubles the cost. A common pattern is to have a small, secondary deployment,
just large enough to run critical services. This small secondary deployment is a good idea, both to reserve capacity,
and for testing the configuration of the secondary environment.

NOTE
The subscription quota is not a capacity guarantee. The quota is simply a credit limit. To guarantee capacity, the required
number of roles must be defined in the service model, and the roles must be deployed.

Failover and failback testing


Test failover and failback to verify that your application's dependent services come back up in a synchronized
manner during disaster recovery. Changes to systems and operations may affect failover and failback functions,
but the impact may not be detected until the main system fails or becomes overloaded. Test failover capabilities
before they are required to compensate for a live problem. Also be sure that dependent services fail over and fail
back in the correct order.
If you are using Azure Site Recovery to replicate VMs, run disaster recovery drills periodically by doing test
failovers to validate your replication strategy. A test failover does not affect the ongoing VM replication or your
production environment. For more information, see Run a disaster recovery drill to Azure.

Validating backups
Regularly verify that your backup data is what you expect by running a script to validate data integrity, schema,
and queries. There's no point having a backup if it's not useful to restore your data sources. Log and report any
inconsistencies so the backup service can be repaired.

Backup Storage
Backups are about protecting data, applications, and systems that are important to the organization. In operations
environments, it's easy to provide backups: pick the workload that needs hyper-availability and back it up.
Operations environments are relatively static – in that, the systems and applications used remain relatively
consistent, with only the data changing daily.

Application archives
It's important to remember, that a DR plan is more than just an ordered restoration from backup and validation
process. Applications may require post-restoration configuration due to site changes, or reinstallation may be
necessary with restored data imported after.

Outage retrospectives
No amount of safeguards or preparation can prevent every possible incident, and sometimes simple human error
can have significant consequences to a development project. You can't avoid it, but you can learn from it and take
steps to minimize the chances of a similar incident in the future. The question is how software organizations can
best go about learning from mistakes with agile postmortems.

Planning for regional failures


Azure is divided physically and logically into units called regions. A region consists of one or more datacenters in
close proximity.
Under rare circumstances, it is possible that facilities in an entire region can become inaccessible, for example due
to network failures. Or facilities can be lost entirely, for example due to a natural disaster. This section explains the
capabilities of Azure for creating applications that are distributed across regions. Such distribution helps to
minimize the possibility that a failure in one region could affect other regions.
Review Recover from loss of an Azure region for guidance on specific Azure services.

Service-specific guidance
The following articles describe disaster recovery for specific Azure services:

SERVIC E A RT IC L E

Azure Database for MySQL Overview of business continuity with Azure Database for
MySQL

Azure Database for PostgreSQL Overview of business continuity with Azure Database for
PostgreSQL

Azure Cloud Services What to do in the event of an Azure service disruption that
impacts Azure Cloud Services

Cosmos DB High availability with Azure Cosmos DB

Azure Key Vault Azure Key Vault availability and redundancy

Azure Storage Disaster recovery and storage account failover (preview) in


Azure Storage

SQL Database Restore an Azure SQL Database or failover to a secondary


region

Virtual Machines What to do in the event of an Azure service disruption


impacts Azure Cloud

Azure Virtual Network Virtual Network – Business Continuity

Next steps
Recover from data corruption or accidental deletion
Recover from a region-wide service disruption
Using business metrics to design resilient Azure
applications
11/2/2020 • 5 minutes to read • Edit Online

Workload availability targets


Define your own target SLAs for each workload in your solution so you can determine whether the architecture
meets the business requirements.
Consider cost and complexity
Everything else being equal, higher availability is better. But as you strive for more nines, the cost and complexity
grow. An uptime of 99.99% translates to about five minutes of total downtime per month. Is it worth the additional
complexity and cost to reach five nines? The answer depends on the business requirements.
Here are some other considerations when defining an SLA:
To achieve four nines (99.99%), you can't rely on manual intervention to recover from failures. The application
must be self-diagnosing and self-healing.
Beyond four nines, it's challenging to detect outages quickly enough to meet the SLA.
Think about the time window that your SLA is measured against. The smaller the window, the tighter the
tolerances. It doesn't make sense to define your SLA in terms of hourly or daily uptime.
Consider the MTBF and MTTR measurements. The higher your SLA, the less frequently the service can go down
and the quicker the service must recover.
Get agreement from your customers for the availability targets of each piece of your application, and document
it. Otherwise, your design may not meet the customers' expectations.
Identify dependencies
Perform dependency-mapping exercises to identify internal and external dependencies. Examples include
dependencies relating to security or identity, such as Active Directory, or third-party services such as a payment
provider or e-mail messaging service.
Pay particular attention to external dependencies that might be a single point of failure or cause bottlenecks. If a
workload requires 99.99% uptime but depends on a service with a 99.9% SLA, that service can't be a single point
of failure in the system. One remedy is to have a fallback path in case the service fails. Alternatively, take other
measures to recover from a failure in that service.
The following table shows the potential cumulative downtime for various SLA levels.

SL A DO W N T IM E P ER W EEK DO W N T IM E P ER M O N T H DO W N T IM E P ER Y EA R

99% 1.68 hours 7.2 hours 3.65 days

99.9% 10.1 minutes 43.2 minutes 8.76 hours

99.95% 5 minutes 21.6 minutes 4.38 hours

99.99% 1.01 minutes 4.32 minutes 52.56 minutes

99.999% 6 seconds 25.9 seconds 5.26 minutes


Recovery metrics
Derive these values by conducting a risk assessment, and make sure you understand the cost and risk of downtime
and data loss. These are nonfunctional requirements of a system and should be dictated by business requirements.
Recover y time objective (RTO) is the maximum acceptable time an application is unavailable after an
incident.
Recover y point objective (RPO) is the maximum duration of data loss that's acceptable during a disaster.
If the MTTR value of any critical component in a highly available setup exceeds the system RTO, a failure in the
system might cause an unacceptable business disruption. That is, you can't restore the system within the defined
RTO.

Availability metrics
Use these measures to plan for redundancy and determine customer SLAs.
Mean time to recover (MTTR) is the average time it takes to restore a component after a failure.
Mean time between failures (MTBF) is the how long a component can reasonably expect to last between
outages.

Understand service-level agreements


In Azure, the Service Level Agreement describes Microsoft's commitments for uptime and connectivity. If the SLA
for a particular service is 99.9%, you should expect the service to be available 99.9% of the time. Different services
have different SLAs.
The Azure SLA also includes provisions for obtaining a service credit if the SLA is not met, along with specific
definitions of availability for each service. That aspect of the SLA acts as an enforcement policy.

The Service Level Agreement Estimator sample shows how to calculate the SLA of your architecture.
Composite SLAs
Composite SLAs involve multiple services supporting an application, each with differing levels of availability. For
example, consider an App Service web app that writes to Azure SQL Database. At the time of this writing, these
Azure services have the following SLAs:
App Service web apps = 99.95%
SQL Database = 99.99%
What is the maximum downtime you would expect for this application? If either service fails, the whole application
fails. The probability of each service failing is independent, so the composite SLA for this application is 99.95% ×
99.99% = 99.94%. That's lower than the individual SLAs, which isn't surprising because an application that relies on
multiple services has more potential failure points.
You can improve the composite SLA by creating independent fallback paths. For example, if SQL Database is
unavailable, put transactions into a queue to be processed later.
With this design, the application is still available even if it can't connect to the database. However, it fails if the
database and the queue both fail at the same time. The expected percentage of time for a simultaneous failure is
0.0001 × 0.001, so the composite SLA for this combined path is:
Database or queue = 1.0 − (0.0001 × 0.001) = 99.99999%
The total composite SLA is:
Web app and (database or queue) = 99.95% × 99.99999% = ~99.95%
There are tradeoffs to this approach. The application logic is more complex, you are paying for the queue, and you
need to consider data consistency issues.
SLAs for multiregion deployments
SLAs for multiregion deployments involve a high-availability technique to deploy the application in more than one
region and use Azure Traffic Manager to fail over if the application fails in one region.
The composite SLA for a multiregion deployment is calculated as follows:
N is the composite SLA for the application deployed in one region.
R is the number of regions where the application is deployed.
The expected chance that the application fails in all regions at the same time is ((1 − N) ^ R). For example, if the
single-region SLA is 99.95%:
The combined SLA for two regions = (1 − (1 − 0.9995) ^ 2) = 99.999975%
The combined SLA for four regions = (1 − (1 − 0.9995) ^ 4) = 99.999999%
The SLA for Traffic Manager is also a factor. Failing over is not instantaneous in active-passive configurations, which
can result in downtime during a failover. See Traffic Manager endpoint monitoring and failover.
Chaos engineering
11/13/2020 • 5 minutes to read • Edit Online

Chaos engineering is a methodology that helps developers attain consistent reliability by hardening services
against failures in production. A common way to introduce chaos is to deliberately inject faults that cause system
components to fail. The goal is to observe, monitor, respond to, and improve your system's reliability under
adverse circumstances.

Context
It's difficult to simulate the characteristics of a service's behavior at scale outside a production environment. The
transient nature of cloud platforms can exacerbate this difficulty. Architecting your service to expect failure is a core
approach to creating a modern service. Chaos engineering embraces the uncertainty of the production
environment and strives to anticipate rare, unpredictable, and disruptive outcomes, so that you can minimize any
potential impact on your customers.

Principles
Chaos engineering is aimed at increasing your service’s resiliency and its ability to react to failures. By conducting
experiments in a controlled environment, you can identify issues that are likely to arise during development and
deployment. During this process, be vigilant in adopting the following guidelines:
Be proactive.
Embrace failure.
Break the system.
Identify and address single points of failure early.
Install guardrails and graceful mitigations.
Minimize the blast radius.
Build immunity.
Chaos engineering should be an integral part of development team culture and an ongoing practice, not a short-
term tactical effort in response to a single outage.
Development team members are partners in the process. They must be equipped with the resources to triage
issues, implement the testability that's required for fault injection, and drive the necessary product changes.

When to apply chaos


Ideally, you should apply chaos principles continuously. There's constant change in the environments in which
software and hardware run, so monitoring the changes is key. By constantly applying stress or faults on
components, you can help expose issues early, before small problems are compounded by a number of other
factors.
Apply chaos engineering principles when you're:
Deploying new code.
Adding dependencies.
Observing changes in usage patterns.
Mitigating problems.
Process
Chaos engineering requires specialized expertise, technology, and practices. As with security and performance
teams, the model of a central team supporting the service teams is a common, effective approach.
If you plan to practice the simulated handling of potentially catastrophic scenarios under controlled conditions,
here's a simplified way to organize your teams:

AT TA C K ER DEF EN DER

Inject faults Assess

Provide hints Analyze

Mitigate

Goals
Familiarize team members with monitoring tools.
Recognize outage patterns.
Learn how to assess the impact.
Determine the root cause and mitigate accordingly.
Practice log analysis.
Overall method
1. Start with a hypothesis.
2. Measure baseline behavior.
3. Inject a fault or faults.
4. Monitor the resulting behavior.
5. Document the process and observations.
6. Identify and act on the result.
Periodically validate your process, architecture choices, and code. By conducting fault-injection experiments, you
can confirm that monitoring is in place and alerts are set up, the directly responsible individual (DRI) process is
effective, and your documentation and investigation processes are up to date. Keep in mind a few key
considerations:
Challenge system assumptions.
Validate change (topology, platform, resources).
Use service-level agreement (SLA) buffers.
Use live-site outages as opportunities.

Best practices
Shift left
Shift-left testing means experiment early, experiment often. Incorporate fault-injection configurations and create
resiliency-validation gates during the development stages and in the deployment pipeline.
Shift right
Shift-right testing means that you verify that the service is resilient where it counts in a pre-production or
production environment with actual customer load. Adopt a proactive approach as opposed to reacting to failures.
Be a part of determining and controlling requirements for the blast radius.
Blast radius
Stop the experiment when it goes beyond scope. Unknown results are an expected outcome of chaos experiments.
Strive to achieve balance between collecting substantial result data and affecting as few production users as
possible. For an example of this principle in practice, see the Bulkhead pattern article.
Error budget testing
Establish an error budget as an investment in chaos and fault injection. Your error budget is the difference between
achieving 100% of the service-level objective (SLO) and achieving the agreed-upon SLO.

Considerations
The following sections discuss additional considerations about chaos engineering, based on its application inside
Azure.
Identify faults that are relevant to the development team
Work closely with the development teams to ensure the relevance of the injected failures. Use past incidents or
issues as a guide. Examine dependencies and evaluate the results when those dependencies are removed.
An external team can't hypothesize faults for your team. A study of failures from an artificial source might be
relevant to your team's purposes, but the effort must be justified.
Inject faults in a way that accurately reflects production failures
Simulate production failures. Treat injected faults in the same way that you would treat production-level faults.
Enforcing a tighter limit on the blast radius will enable you to simulate a production environment. Each fault-
injection effort must be accompanied by tooling that's designed to inject the types of faults that are relevant to
your team's scenarios. Here are two basic ways:
Inject faults in a non-production environment, such as Canary or Test In Production (TIP).
Partition the production service or environment.
Halt all faults and roll back the state to its last-known good configuration if the state seems severe.
Build confidence incrementally
Start by hardening the core, and then expand out in layers. At each point, lock in progress with automated
regression tests. Each team should have a long-term strategy based on a progression that makes sense for the
team's circumstances.
By applying the shift left strategy, you can help ensure that any obstacles to developer usage are removed early
and the testing results are actionable.
The process must be very low tax. That is, the process must make it easy for developers to understand what
happened and to fix the issues. The effort must fit easily into their normal workflow, not burden them with one-off
special activities.

Faults
The following table lists faults that you can apply to inject chaos. The list represents commonly injected faults and
isn't intended to be exhaustive.
Resource pressure
CPU
Memory
Physical
Virtual
bad checksum
Hard disk
Capacity
Read
Write
Availability
Data corruption
Read / Write Latency
Network
Layers
Transport (TCP/UDP)
Application layer (HTTP)
Types
Disconnect
Latency
Alter response codes (HTTP)
Packet reorder / loss (TCP/UDP)
# of connections (active / passive)
DOS attack
Filters
Domain / IP / Subnet
URL path
Port / Protocol
DNS Host Name resolution
Process
Stop / Kill
Restart
Stop service
Start
Crash
Hang
Vir tual Machine
Stop
Restart
BSOD
Change date
Re-image
Live Migration
Platform
Quorum loss
Data loss
Move primary node
Remove replica
Functions
Latency
Exceptions
Status codes
Intercept / Denylist calls
Disk capacity
Application specific
Intercept / Re-route calls
No access to service code
Hardware
Machine
Storage
Network devices
Rack
UPS
Datacenter
Data Management for Reliability
11/2/2020 • 4 minutes to read • Edit Online

Database resiliency
Azure SQL Database
SQL Database automatically performs a combination of full database backups weekly, differential database
backups hourly, and transaction log backups every five - 10 minutes to protect your business from data loss. Use
point-in-time restore to restore a database to an earlier time. For more information, see:
Recover an Azure SQL database using automated database backups
Overview of business continuity with Azure SQL Database
SQL Server on VMs
For SQL Server running on VMs, there are two options: traditional backups and log shipping. Traditional backups
enables you to restore to a specific point in time, but the recovery process is slow. Restoring traditional backups
requires starting with an initial full backup, and then applying any backups taken after that. The second option is to
configure a log shipping session to delay the restore of log backups (for example, by two hours). This provides a
window to recover from errors made on the primary.
Azure Cosmos DB
Azure Cosmos DB automatically takes backups at regular intervals. Backups are stored separately in another
storage service, and those backups are globally replicated for resiliency against regional disasters. If you
accidentally delete your database or collection, you can file a support ticket or call Azure support to restore the
data from the last automatic backup. For more information, see Automatic online backup and restore with Azure
Cosmos DB.
Azure Database for MySQL, Azure Database for PostgreSQL
When using Azure Database for MySQL or Azure Database for PostgreSQL, the database service automatically
makes a backup of the service every five minutes. Using this automatic backup feature you may restore the server
and all its databases into a new server to an earlier point-in-time. For more information, see:
How to back up and restore a server in Azure Database for MySQL by using the Azure portal
How to backup and restore a server in Azure Database for PostgreSQL using the Azure portal

Distribute data geographically


SQL Database
Azure SQL Database provides two types of recovery: geo-restore and active geo-replication.
Geo-restore
Geo-restore is also available with Basic, Standard, and Premium databases. It provides the default recovery option
when the database is unavailable because of an incident in the region where your database is hosted. Similar to
point-in-time restore, geo-restore relies on database backups in geo-redundant Azure storage. It restores from the
geo-replicated backup copy, and therefore is resilient to the storage outages in the primary region. For more
information, see Restore an Azure SQL Database or failover to a secondary.
Active geo-replication
Active geo-replication is available for all database tiers. It’s designed for applications that have more aggressive
recovery requirements than geo-restore can offer. Using active geo-replication, you can create up to four readable
secondaries on servers in different regions. You can initiate failover to any of the secondaries. In addition, active
geo-replication can be used to support the application upgrade or relocation scenarios, as well as load balancing
for read-only workloads. For details, see Configure active geo-replication for Azure SQL Database and initiate
failover. Refer to Designing globally available services using Azure SQL Database and Managing rolling upgrades
of cloud applications by using SQL Database active geo-replication for details on how to design and implement
applications and applications upgrade without downtime.
SQL Server on Azure Virtual Machines
A variety of options are available for recovery and high availability for SQL Server 2012 (and later) running in
Azure Virtual Machines. For more information, see High availability and disaster recovery for SQL Server in Azure
Virtual Machines.
SQL Server Always On Availability Groups across regions
Alternatively, you can use SQL Always On Availability Groups for high availability by creating a single availability
group that includes the SQL Server instances in both regions.
As an example, Multi-region N-tier application reference architecture shows a set of practices for running an N-tier
application in multiple Azure regions to achieve availability and a robust disaster recovery infrastructure. It uses a
SQL Server Always On Availability Group and Azure Traffic Manager.

Storage resiliency
Azure Storage provides data resiliency through automated replicas. However, this does not prevent application
code or users from corrupting data, whether accidentally or maliciously. Maintaining data fidelity in the face of
application or user error requires more advanced techniques, such as copying the data to a secondary storage
location with an audit log.
Block blobs . Create a point-in-time snapshot of each block blob. For more information, see Creating a
Snapshot of a Blob. For each snapshot, you are only charged for the storage required to store the differences
within the blob since the last snapshot state. The snapshots are dependent on the existence of the original
blob they are based on, so a copy operation to another blob or even another storage account is advisable.
This ensures that backup data is properly protected against accidental deletion. You can use AzCopy or Azure
PowerShell to copy the blobs to another storage account.
Files . Use share snapshots, or use AzCopy or PowerShell to copy your files to another storage account.
Tables . Use AzCopy to export the table data into another storage account in another region.

Samples related to storage resiliency are here. The scripts perform these tasks:
ARM template to deploy a storage account and blob container.
Copies file into the blob container.
Creates the blob snapshot.
Creates a share snapshot.
Calls AzCopy for table storage
Monitoring application health for reliability in Azure
11/2/2020 • 8 minutes to read • Edit Online

Monitoring and diagnostics are crucial for resiliency. If something fails, you need to know that it failed, when it
failed — and why.
Monitoring is not the same as failure detection. For example, your application might detect a transient error and
retry, avoiding downtime. But it should also log the retry operation so that you can monitor the error rate to get an
overall picture of application health.
Think of the monitoring and diagnostics process as a pipeline with four distinct stages: Instrumentation, collection
and storage, analysis and diagnosis, and visualization and alerts.

Early warning system


Monitor your application for warning signs that might require proactive intervention. Tools that assess the overall
health of the application and its dependencies help you to recognize quickly when a system or its components
suddenly become unavailable. Use them to implement an early warning system.
1. Identify the key performance indicators of your application's health, such as transient exceptions and remote
call latency.
2. Set thresholds at levels that identify issues before they become critical and require a recovery response.
3. Send an alert to operations when the threshold value is reached.
Consider Microsoft System Center 2016 or third-party tools to provide monitoring capabilities. Most monitoring
solutions track key performance counters and service availability. Azure resource health provides some built-in
health status checks, which can help diagnose throttling of Azure services.

Remote call statistics


Track and report remote call statistics in real time and provide an easy way to review this information, so the
operations team has an instantaneous view into the health of your application. Summarize remote call metrics,
such as latency, throughput, and errors in the 99 and 95 percentiles.

Long-running workflow failures


Long-running workflows often include multiple steps, each of which should be independent.
Track the progress of long-running processes to minimize the likelihood that the entire workflow will need to be
rolled back or that multiple compensating transactions will need to be executed.

TIP
Monitor and manage the progress of long-running workflows by implementing a pattern such as Scheduler Agent
Supervisor.

Visualization and alerts


Present telemetry data in a format that makes it easy for an operator to notice problems or trends quickly, such as
a dashboard or email alert.
Get a full-stack view of application state by using Azure dashboards to create a consolidated view of monitoring
graphs from Application Insights, Log Analytics, Azure Monitor metrics, and Service Health. And use Azure Monitor
alerts to create notifications on Service Health, resource health, Azure Monitor metrics, logs in Log Analytics, and
Application Insights.
For more information about monitoring and diagnostics, see Monitoring and diagnostics.

Here are some samples about creating and querying alerts:


HealthAlert. A sample about creating resource-level health activity log alerts. The sample uses Azure Resource
Manager to create alerts.
GraphAlertsPsSample. A set of PowerShell commands that queries for alerts generated against your
subscription.

Test Monitoring
Include monitoring systems in your test plan. Automated failover and failback systems depend on the correct
functioning of monitoring and instrumentation. Dashboards to visualize system health and operator alerts also
depend on having accurate monitoring and instrumentation. If these elements fail, miss critical information, or
report inaccurate data, an operator might not realize that the system is unhealthy or failing.

Azure subscription and service limits


Azure subscriptions have limits on certain resource types, such as number of resource groups, cores, and storage
accounts. To ensure that your application doesn't run up against Azure subscription limits, create alerts that poll for
services nearing their limits and quotas.
Address the following subscription limits with alerts.

A reference implementation is available on GitHub.


For information about that implementation, see Event-based cloud automation on Azure.
Individual services
Individual Azure services have consumption limits on storage, throughput, number of connections, requests per
second, and other metrics. Your application will fail if it attempts to use resources beyond these limits, resulting in
service throttling and possible downtime.
Depending on the specific service and your application requirements, you can often stay under these limits by
scaling up (choosing another pricing tier, for example) or scaling out (adding new instances).
Azure storage scalability and performance targets
Azure allows a maximum number of storage accounts per subscription. If your application requires more storage
accounts than are currently available in your subscription, create a new subscription with additional storage
accounts. For more information, see Azure subscription and service limits, quotas, and constraints.
If you exceed Azure storage scalability and performance targets, your application will experience storage throttling.
For more information, see Azure Storage scalability and performance targets.
Scalability targets for virtual machine disks
An Azure infrastructure as a service (IaaS) VM supports attaching a number of data disks, depending on several
factors, including the VM size and the type of storage account. If your application exceeds the scalability targets for
virtual machine disks, provision additional storage accounts and create the virtual machine disks there. For more
information, see Scalability and performance targets for VM disks on Windows.
VM size
If the actual CPU, memory, disk, and I/O of your VMs approach the limits of the VM size, your application may
experience capacity issues. To correct the issues, increase the VM size. VM sizes are described in Sizes for virtual
machines in Azure.
If your workload fluctuates over time, consider using virtual machine scale sets to automatically scale the number
of VM instances. Otherwise, you need to manually increase or decrease the number of VMs. For more information,
see the virtual machine scale sets overview.
Azure SQL Database
If your Azure SQL Database tier isn't adequate to handle your application's Database Transaction Unit (DTU)
requirements, your data use will be throttled. For more information on selecting the correct service plan, see Azure
SQL Database purchasing models.
Sample:
This reference implementation monitors a Cosmos DB database for throttling. Azure Monitor alerts are triggered
when data access requests to Cosmos DB exceed the capacity in Request Units (or RUs). For more context, see the
Reference Architecture (Throttling response scenario)

Instrumentation
It's not practical to monitor your application directly, so instrumentation is key. A large-scale distributed system
might run on dozens of virtual machines (VMs), which are added and removed over time. Likewise, a cloud
application might use a number of data stores and a single user action might span multiple subsystems.
Provide rich instrumentation:
For failures that are likely but have not yet occurred, provide enough data to determine the cause, mitigate the
situation, and ensure that the system remains available.
For failures that have already occurred, the application should return an appropriate error message to the user
but should attempt to continue running, albeit with reduced functionality.
Monitoring systems should capture comprehensive details so that applications can be restored efficiently and, if
necessary, designers and developers can modify the system to prevent the situation from recurring.
The raw data for monitoring can come from a variety of sources, including:
Application logs, such as those produced by the Azure Application Insights service.
Operating system performance metrics collected by Azure monitoring agents.
Azure resources, including metrics collected by Azure Monitor.
Azure Service Health, which offers a dashboard to help you track active events.
Azure AD logs built into the Azure platform.
Most Azure services have metrics and diagnostics that you can configure to analyze and determine the cause of
problems. To learn more, see Monitoring data collected by Azure Monitor.

Collection and storage


Raw instrumentation data can be held in various locations and formats, including:
Application trace logs
IIS logs
Performance counters
These disparate sources are collected, consolidated, and placed in reliable data stores in Azure, such as Application
Insights, Azure Monitor metrics, Service Health, storage accounts, and Azure Log Analytics.
Creating good health probes
The health and performance of an application can degrade over time, and degradation might not be noticeable
until the application fails.
Implement probes or check functions, and run them regularly from outside the application. These checks can be as
simple as measuring response time for the application as a whole, for individual parts of the application, for
specific services that the application uses, or for separate components.
Check functions can run processes to ensure that they produce valid results, measure latency and check availability,
and extract information from the system.

The HealthProbesSample sample shows how to set up health probes. It provides an ARM template to set up
the infrastructure. A load balancer accepts public requests and load balance to a set of virtual machines. The health
probe is set up so that it can check for service's path /Health.

Application logs
Application logs are an important source of diagnostics data. To gain insight when you need it most, follow best
practices for application logging.
Log data in the production environment
Capture robust telemetry data while the application is running in the production environment, so you have
sufficient information to diagnose the cause of issues in the production state.
Log events at service boundaries
Include a correlation ID that flows across service boundaries. If a transaction flows through multiple services and
one of them fails, the correlation ID helps you track requests across your application and pinpoints why the
transaction failed.
Use semantic (structured) logging
With structured logs, it's easier to automate the consumption and analysis of the log data, which is especially
important at cloud scale. Generally, we recommend storing Azure resources metrics and diagnostics data in a Log
Analytics workspace rather than in a storage account. This way, you can use Kusto queries to obtain the data you
want quickly and in a structured format. You can also use Azure Monitor APIs and Azure Log Analytics APIs.
Use asynchronous logging
Synchronous logging operations sometimes block your application code, causing requests to back up as logs are
written. Use asynchronous logging to preserve availability during application logging.
Separate application logging from auditing
Audit records are commonly maintained for compliance or regulatory requirements and must be complete. To
avoid dropped transactions, maintain audit logs separately from diagnostic logs.

Analysis and diagnosis


Analyze data consolidated in these data stores to troubleshoot issues and gain an overall view of application
health. Generally, you can search for and analyze the data in Application Insights and Log Analytics using Kusto
queries or view preconfigured graphs using management solutions. Or use Azure Advisor to view
recommendations with a focus on resiliency and performance.
Recover from a region-wide service disruption
11/13/2020 • 13 minutes to read • Edit Online

Azure is divided physically and logically into units called regions. A region consists of one or more datacenters in
close proximity.
Under rare circumstances, it is possible that facilities in an entire region can become inaccessible, for example due
to network failures. Or facilities can be lost entirely, for example due to a natural disaster. This section explains the
capabilities of Azure for creating applications that are distributed across regions. Such distribution helps to
minimize the possibility that a failure in one region could affect other regions.

Cloud services
Resource management
You can distribute compute instances across regions by creating a separate cloud service in each target region, and
then publishing the deployment package to each cloud service. However, distributing traffic across cloud services
in different regions must be implemented by the application developer or with a traffic management service.
Determining the number of spare role instances to deploy in advance for disaster recovery is an important aspect
of capacity planning. Having a full-scale secondary deployment ensures that capacity is already available when
needed; however, this effectively doubles the cost. A common pattern is to have a small, secondary deployment,
just large enough to run critical services. This small secondary deployment is a good idea, both to reserve capacity,
and for testing the configuration of the secondary environment.

NOTE
The subscription quota is not a capacity guarantee. The quota is simply a credit limit. To guarantee capacity, the required
number of roles must be defined in the service model, and the roles must be deployed.

Load Balancing
To load balance traffic across regions requires a traffic management solution. Azure provides Azure Traffic
Manager. You can also take advantage of third-party services that provide similar traffic management capabilities.
Strategies
Many alternative strategies are available for implementing distributed compute across regions. These must be
tailored to the specific business requirements and circumstances of the application. At a high level, the approaches
can be divided into the following categories:
Redeploy on disaster : In this approach, the application is redeployed from scratch at the time of disaster.
This is appropriate for non-critical applications that don’t require a guaranteed recovery time.
Warm Spare (Active/Passive) : A secondary hosted service is created in an alternate region, and roles are
deployed to guarantee minimal capacity; however, the roles don’t receive production traffic. This approach
is useful for applications that have not been designed to distribute traffic across regions.
Hot Spare (Active/Active) : The application is designed to receive production load in multiple regions. The
cloud services in each region might be configured for higher capacity than required for disaster recovery
purposes. Alternatively, the cloud services might scale out as necessary at the time of a disaster and fail
over. This approach requires substantial investment in application design, but it has significant benefits.
These include low and guaranteed recovery time, continuous testing of all recovery locations, and efficient
usage of capacity.
A complete discussion of distributed design is outside the scope of this document. For more information, see
Disaster Recovery and High Availability for Azure Applications.

Virtual machines
Recovery of infrastructure as a service (IaaS) virtual machines (VMs) is similar to platform as a service (PaaS)
compute recovery in many respects. There are important differences, however, due to the fact that an IaaS VM
consists of both the VM and the VM disk.
Use Azure Backup to create cross region backups that are application consistent . Azure Backup
enables customers to create application consistent backups across multiple VM disks, and support
replication of backups across regions. You can do this by choosing to geo-replicate the backup vault at the
time of creation. Replication of the backup vault must be configured at the time of creation. It can't be set
later. If a region is lost, Microsoft will make the backups available to customers. Customers will be able to
restore to any of their configured restore points.
Separate the data disk from the operating system disk . An important consideration for IaaS VMs is
that you cannot change the operating system disk without re-creating the VM. This is not a problem if your
recovery strategy is to redeploy after disaster. However, it might be a problem if you are using the Warm
Spare approach to reserve capacity. To implement this properly, you must have the correct operating
system disk deployed to both the primary and secondary locations, and the application data must be stored
on a separate drive. If possible, use a standard operating system configuration that can be provided on both
locations. After a failover, you must then attach the data drive to your existing IaaS VMs in the secondary
DC. Use AzCopy to copy snapshots of the data disk(s) to a remote site.
Be aware of potential consistency issues after a geo-failover of multiple VM Disks . VM Disks are
implemented as Azure Storage blobs, and have the same geo-replication characteristic. Unless Azure
Backup is used, there are no guarantees of consistency across disks, because geo-replication is
asynchronous and replicates independently. Individual VM disks are guaranteed to be in a crash consistent
state after a geo-failover, but not consistent across disks. This could cause problems in some cases (for
example, in the case of disk striping).
Use Azure Site Recover y to replicate applications across regions . With Azure Site Recovery,
customers don't need to worry about separating data disks from operating system disks or about potential
consistency issues. Azure Site Recovery replicates workloads running on physical and virtual machines
from a primary site (either on-premises or in Azure) to a secondary location (in Azure). When an outage
occurs at the customer's primary site, a failover can be triggered to quickly return the customer to an
operational state. After the primary location is restored, customers can then fail back. They can easily
replicate using recovery points with application-consistent snapshots. These snapshots capture disk data, all
in-memory data, and all in-process transactions. Azure Site Recovery allows customers to keep recovery
time objectives (RTO) and recovery point objectives (RPO) within organizational limits. Customers can also
easily run disaster recovery drills without affecting applications in production. Using recovery plans,
customers can sequence the failover and recovery of multitier applications running on multiple VMs. They
can group machines together in a recovery plan, and optionally add scripts and manual actions. Recovery
plans can be integrated with Azure Automation runbooks.

Storage
Recovery by using geo -redundant storage of blob, table, queue, and VM disk storage
In Azure, blobs, tables, queues, and VM disks are all geo-replicated by default. This is referred to as geo-redundant
storage (GRS). GRS replicates storage data to a paired datacenter located hundreds of miles apart within a specific
geographic region. GRS is designed to provide additional durability in case there is a major datacenter disaster.
Microsoft controls when failover occurs, and failover is limited to major disasters in which the original primary
location is deemed unrecoverable in a reasonable amount of time. Under some scenarios, this can be several days.
Data is typically replicated within a few minutes, although synchronization interval is not yet covered by a service
level agreement.
If a geo-failover occurs, there will be no change to how the account is accessed (the URL and account key will not
change). The storage account will, however, be in a different region after failover. This could impact applications
that require regional affinity with their storage account. Even for services and applications that do not require a
storage account in the same datacenter, the cross-datacenter latency and bandwidth charges might be compelling
reasons to move traffic to the failover region temporarily. This could factor into an overall disaster recovery
strategy.
In addition to automatic failover provided by GRS, Azure has introduced a service that gives you read access to the
copy of your data in the secondary storage location. This is called read-access geo-redundant storage (RA-GRS).
For more information about both GRS and RA-GRS storage, see Azure Storage replication.
Geo -replication region mappings
It is important to know where your data is geo-replicated, in order to know where to deploy the other instances of
your data that require regional affinity with your storage. For more information, see Azure Paired Regions.
Geo -replication pricing
Geo-replication is included in current pricing for Azure Storage. This is called geo-redundant storage (GRS). If you
do not want your data geo-replicated, you can disable geo-replication for your account. This is called locally
redundant storage (LRS), and it is charged at a discounted price compared to GRS.
Determining if a geo -failover has occurred
If a geo-failover occurs, this will be posted to the Azure Service Health Dashboard. Applications can implement an
automated means of detecting this, however, by monitoring the geo-region for their storage account. This can be
used to trigger other recovery operations, such as activation of compute resources in the geo-region where their
storage moved to. You can perform a query for this from the service management API, by using Get Storage
Account Properties. The relevant properties are:

<GeoPrimaryRegion>primary-region</GeoPrimaryRegion>
<StatusOfPrimary>[Available|Unavailable]</StatusOfPrimary>
<LastGeoFailoverTime>DateTime</LastGeoFailoverTime>
<GeoSecondaryRegion>secondary-region</GeoSecondaryRegion>
<StatusOfSecondary>[Available|Unavailable]</StatusOfSecondary>

Database
SQL Database
Azure SQL Database provides two types of recovery: geo-restore and active geo-replication.
Geo-restore
Geo-restore is also available with Basic, Standard, and Premium databases. It provides the default recovery option
when the database is unavailable because of an incident in the region where your database is hosted. Similar to
point-in-time restore, geo-restore relies on database backups in geo-redundant Azure storage. It restores from the
geo-replicated backup copy, and therefore is resilient to the storage outages in the primary region. For more
information, see Restore an Azure SQL Database or failover to a secondary.
Active geo-replication
Active geo-replication is available for all database tiers. It’s designed for applications that have more aggressive
recovery requirements than geo-restore can offer. Using active geo-replication, you can create up to four readable
secondaries on servers in different regions. You can initiate failover to any of the secondaries. In addition, active
geo-replication can be used to support the application upgrade or relocation scenarios, as well as load balancing
for read-only workloads. For details, see Configure active geo-replication for Azure SQL Database and initiate
failover. Refer to Designing globally available services using Azure SQL Database and Managing rolling upgrades
of cloud applications by using SQL Database active geo-replication for details on how to design and implement
applications and applications upgrade without downtime.
SQL Server on Azure Virtual Machines
A variety of options are available for recovery and high availability for SQL Server 2012 (and later) running in
Azure Virtual Machines. For more information, see High availability and disaster recovery for SQL Server in Azure
Virtual Machines.

Other Azure platform services


When attempting to run your cloud service in multiple Azure regions, you must consider the implications for each
of your dependencies. In the following sections, the service-specific guidance assumes that you must use the same
Azure service in an alternate Azure datacenter. This involves both configuration and data-replication tasks.

NOTE
In some cases, these steps can help to mitigate a service-specific outage rather than an entire datacenter event. From the
application perspective, a service-specific outage might be just as limiting and would require temporarily migrating the
service to an alternate Azure region.

Service Bus
Azure Service Bus uses a unique namespace that does not span Azure regions. So the first requirement is to set up
the necessary service bus namespaces in the alternate region. However, there are also considerations for the
durability of the queued messages. There are several strategies for replicating messages across Azure regions. For
the details on these replication strategies and other disaster recovery strategies, see Best practices for insulating
applications against Service Bus outages and disasters.
App Service
To migrate an Azure App Service application, such as Web Apps or Mobile Apps, to a secondary Azure region, you
must have a backup of the website available for publishing. If the outage does not involve the entire Azure
datacenter, it might be possible to use FTP to download a recent backup of the site content. Then create a new app
in the alternate region, unless you have previously done this to reserve capacity. Publish the site to the new region,
and make any necessary configuration changes. These changes could include database connection strings or other
region-specific settings. If necessary, add the site’s SSL certificate and change the DNS CNAME record so that the
custom domain name points to the redeployed Azure Web App URL.
HDInsight
The data associated with HDInsight is stored by default in Azure Blob Storage. HDInsight requires that a Hadoop
cluster processing MapReduce jobs must be colocated in the same region as the storage account that contains the
data being analyzed. Provided you use the geo-replication feature available to Azure Storage, you can access your
data in the secondary region where the data was replicated if for some reason the primary region is no longer
available. You can create a new Hadoop cluster in the region where the data has been replicated and continue
processing it.
SQL Reporting
At this time, recovering from the loss of an Azure region requires multiple SQL Reporting instances in different
Azure regions. These SQL Reporting instances should access the same data, and that data should have its own
recovery plan in the event of a disaster. You can also maintain external backup copies of the RDL file for each
report.
Media Services
Azure Media Services has a different recovery approach for encoding and streaming. Typically, streaming is more
critical during a regional outage. To prepare for this, you should have a Media Services account in two different
Azure regions. The encoded content should be located in both regions. During a failure, you can redirect the
streaming traffic to the alternate region. Encoding can be performed in any Azure region. If encoding is time-
sensitive, for example during live event processing, you must be prepared to submit jobs to an alternate
datacenter during failures.
Virtual network
Configuration files provide the quickest way to set up a virtual network in an alternate Azure region. After
configuring the virtual network in the primary Azure region, export the virtual network settings for the current
network to a network configuration file. If an outage occurs in the primary region, restore the virtual network from
the stored configuration file. Then configure other cloud services, virtual machines, or cross-premises settings to
work with the new virtual network.

Checklists for disaster recovery


Cloud Services checklist
1. Review the Cloud Services section of this document.
2. Create a cross-region disaster recovery strategy.
3. Understand trade-offs in reserving capacity in alternate regions.
4. Use traffic routing tools, such as Azure Traffic Manager.
Virtual Machines checklist
1. Review the Virtual Machines section of this document.
2. Use Azure Backup to create application consistent backups across regions.
Storage checklist
1. Review the Storage section of this document.
2. Do not disable geo-replication of storage resources.
3. Understand alternate region for geo-replication if a failover occurs.
4. Create custom backup strategies for user-controlled failover strategies.
SQL Database checklist
1. Review the SQL Database section of this document.
2. Use Geo-restore or geo-replication as appropriate.
SQL Server on Virtual Machines checklist
1. Review the SQL Server on Virtual Machines section of this document.
2. Use cross-region AlwaysOn Availability Groups or database mirroring.
3. Alternately use backup and restore to blob storage.
Service Bus checklist
1. Review the Service Bus section of this document.
2. Configure a Service Bus namespace in an alternate region.
3. Consider custom replication strategies for messages across regions.
App Service checklist
1. Review the App Service section of this document.
2. Maintain site backups outside of the primary region.
3. If outage is partial, attempt to retrieve current site with FTP.
4. Plan to deploy the site to new or existing site in an alternate region.
5. Plan configuration changes for both application and DNS CNAME records.
HDInsight checklist
1. Review the HDInsight section of this document.
2. Create a new Hadoop cluster in the region with replicated data.
SQL Reporting checklist
1. Review the SQL Reporting section of this document.
2. Maintain an alternate SQL Reporting instance in a different region.
3. Maintain a separate plan to replicate the target to that region.
Media Services checklist
1. Review the Media Services section of this document.
2. Create a Media Services account in an alternate region.
3. Encode the same content in both regions to support streaming failover.
4. Submit encoding jobs to an alternate region if a service disruption occurs.
Virtual Network checklist
1. Review the Virtual Network section of this document.
2. Use exported virtual network settings to re-create it in another region.
Testing Azure applications for resiliency and
availability
11/2/2020 • 2 minutes to read • Edit Online

To test resiliency, you should verify how the end-to-end workload performs under intermittent failure conditions.
Run tests in production using both synthetic and real user data. Test and production are rarely identical, so it's
important to validate your application in production using a blue-green or canary deployment. This way, you're
testing the application under real conditions, so you can be sure that it will function as expected when fully
deployed.
As part of your test plan, include:
Chaos engineering
Automated predeployment testing
Fault injection testing
Peak load testing
Disaster recovery testing
Third-party service testing

Chaos engineering
Harden services against failures in production to attain consistent reliability by adopting chaos engineering.

Simulation testing
Simulation testing involves creating small, real-life situations. Simulations demonstrate the effectiveness of the
solutions in the recovery plan and highlight any issues that weren't adequately addressed.
As you perform simulation testing, follow best practices:
Conduct simulations in a manner that doesn't disrupt actual business but feels like a real situation.
Make sure that simulated scenarios are completely controllable. If the recovery plan seems to be failing, you can
restore the situation back to normal without causing damage.
Inform management about when and how the simulation exercises will be conducted. Your plan should detail
the time frame and the resources affected during the simulation.

Perform fault injection testing


For fault injection testing, check the resiliency of the system during failures, either by triggering actual failures or
by simulating them. Here are some strategies to induce failures:
Shut down virtual machine (VM) instances.
Crash processes.
Expire certificates.
Change access keys.
Shut down the DNS service on domain controllers.
Limit available system resources, such as RAM or number of threads.
Unmount disks.
Redeploy a VM.
Your test plan should incorporate possible failure points identified during the design phase, in addition to common
failure scenarios:
Test your application in an environment as close to production as possible.
Test failures in combination.
Measure the recovery times, and be sure that your business requirements are met.
Verify that failures don't cascade and are handled in an isolated way.
Best practices for Chaos Engineering, for more information about failure scenarios, see Failure and disaster
recovery for Azure applications.

Test under peak loads


Load testing is crucial for identifying failures that only happen under load, such as the back-end database being
overwhelmed or service throttling. Test for peak load and anticipated increase in peak load, using production data
or synthetic data that is as close to production data as possible. Your goal is to see how the application behaves
under real-world conditions.
Resiliency checklist for specific Azure services
11/2/2020 • 14 minutes to read • Edit Online

Resiliency is the ability of a system to recover from failures and continue to function. Every technology has its own
particular failure modes, which you must consider when designing and implementing your application. Use this
checklist to review the resiliency considerations for specific Azure services. For more information about designing
resilient applications, see Design reliable Azure applications.

App Service
Use Standard or Premium tier. These tiers support staging slots and automated backups. For more information,
see Azure App Service plans in-depth overview
Avoid scaling up or down. Instead, select a tier and instance size that meet your performance requirements
under typical load, and then scale out the instances to handle changes in traffic volume. Scaling up and down may
trigger an application restart.
Store configuration as app settings. Use app settings to hold configuration settings as app settings. Define the
settings in your Resource Manager templates, or using PowerShell, so that you can apply them as part of an
automated deployment / update process, which is more reliable. For more information, see Configure web apps in
Azure App Service.
Create separate App Ser vice plans for production and test. Don't use slots on your production deployment
for testing. All apps within the same App Service plan share the same VM instances. If you put production and test
deployments in the same plan, it can negatively affect the production deployment. For example, load tests might
degrade the live production site. By putting test deployments into a separate plan, you isolate them from the
production version.
Separate web apps from web APIs. If your solution has both a web front end and a web API, consider
decomposing them into separate App Service apps. This design makes it easier to decompose the solution by
workload. You can run the web app and the API in separate App Service plans, so they can be scaled independently.
If you don't need that level of scalability at first, you can deploy the apps into the same plan, and move them into
separate plans later, if needed.
Avoid using the App Ser vice backup feature to back up Azure SQL databases. Instead, use SQL Database
automated backups. App Service backup exports the database to a SQL BACPAC file, which costs DTUs.
Deploy to a staging slot. Create a deployment slot for staging. Deploy application updates to the staging slot,
and verify the deployment before swapping it into production. This reduces the chance of a bad update in
production. It also ensures that all instances are warmed up before being swapped into production. Many
applications have a significant warmup and cold-start time. For more information, see Set up staging environments
for web apps in Azure App Service.
Create a deployment slot to hold the last-known-good (LKG) deployment. When you deploy an update to
production, move the previous production deployment into the LKG slot. This makes it easier to roll back a bad
deployment. If you discover a problem later, you can quickly revert to the LKG version. For more information, see
Basic web application.
Enable diagnostics logging , including application logging and web server logging. Logging is important for
monitoring and diagnostics. See Enable diagnostics logging for web apps in Azure App Service
Log to blob storage. This makes it easier to collect and analyze the data.
Create a separate storage account for logs. Don't use the same storage account for logs and application data.
This helps to prevent logging from reducing application performance.
Monitor performance. Use a performance monitoring service such as New Relic or Application Insights to
monitor application performance and behavior under load. Performance monitoring gives you real-time insight
into the application. It enables you to diagnose issues and perform root-cause analysis of failures.

Application Gateway
Provision at least two instances. Deploy Application Gateway with at least two instances. A single instance is a
single point of failure. Use two or more instances for redundancy and scalability. In order to qualify for the SLA, you
must provision two or more medium or larger instances.

Cosmos DB
Replicate the database across regions. Cosmos DB allows you to associate any number of Azure regions with
a Cosmos DB database account. A Cosmos DB database can have one write region and multiple read regions. If
there is a failure in the write region, you can read from another replica. The Client SDK handles this automatically.
You can also fail over the write region to another region. For more information, see How to distribute data globally
with Azure Cosmos DB.

Event Hubs
Use checkpoints . An event consumer should write its current position to persistent storage at some predefined
interval. That way, if the consumer experiences a fault (for example, the consumer crashes, or the host fails), then a
new instance can resume reading the stream from the last recorded position. For more information, see Event
consumers.
Handle duplicate messages. If an event consumer fails, message processing is resumed from the last recorded
checkpoint. Any messages that were already processed after the last checkpoint will be processed again. Therefore,
your message processing logic must be idempotent, or the application must be able to deduplicate messages.
Handle exceptions.. An event consumer typically processes a batch of messages in a loop. You should handle
exceptions within this processing loop to avoid losing an entire batch of messages if a single message causes an
exception.
Use a dead-letter queue. If processing a message results in a nontransient failure, put the message onto a dead-
letter queue, so that you can track the status. Depending on the scenario, you might retry the message later, apply a
compensating transaction, or take some other action. Note that Event Hubs does not have any built-in dead-letter
queue functionality. You can use Azure Queue Storage or Service Bus to implement a dead-letter queue, or use
Azure Functions or some other eventing mechanism.
Implement disaster recover y by failing over to a secondar y Event Hubs namespace. For more
information, see Azure Event Hubs Geo-disaster recovery.

Azure Cache for Redis


Configure Geo-replication . Geo-replication provides a mechanism for linking two Premium-tier Azure Cache for
Redis instances. Data written to the primary cache is replicated to a secondary read-only cache. For more
information, see How to configure geo-replication for Azure Cache for Redis
Configure data persistence. Redis persistence allows you to persist data stored in Redis. You can also take
snapshots and back up the data, which you can load in case of a hardware failure. For more information, see How
to configure data persistence for a Premium-tier Azure Cache for Redis
If you are using Azure Cache for Redis as a temporary data cache and not as a persistent store, these
recommendations may not apply.

Search
Provision more than one replica. Use at least two replicas for read high-availability, or three for read-write
high-availability.
Configure indexers for multi-region deployments. If you have a multi-region deployment, consider your
options for continuity in indexing.
If the data source is geo-replicated, you should generally point each indexer of each regional Azure Search
service to its local data source replica. However, that approach is not recommended for large datasets stored
in Azure SQL Database. The reason is that Azure Search cannot perform incremental indexing from
secondary SQL Database replicas, only from primary replicas. Instead, point all indexers to the primary
replica. After a failover, point the Azure Search indexers at the new primary replica.
If the data source is not geo-replicated, point multiple indexers at the same data source, so that Azure Search
services in multiple regions continuously and independently index from the data source. For more
information, see Azure Search performance and optimization considerations.

Service Bus
Use Premium tier for production workloads . Service Bus Premium Messaging provides dedicated and
reserved processing resources, and memory capacity to support predictable performance and throughput.
Premium Messaging tier also gives you access to new features that are available only to premium customers at
first. You can decide the number of messaging units based on expected workloads.
Handle duplicate messages . If a publisher fails immediately after sending a message, or experiences network or
system issues, it may erroneously fail to record that the message was delivered, and may send the same message
to the system twice. Service Bus can handle this issue by enabling duplicate detection. For more information, see
Duplicate detection.
Handle exceptions . Messaging APIs generate exceptions when a user error, configuration error, or other error
occurs. The client code (senders and receivers) should handle these exceptions in their code. This is especially
important in batch processing, where exception handling can be used to avoid losing an entire batch of messages.
For more information, see Service Bus messaging exceptions.
Retr y policy . Service Bus allows you to pick the best retry policy for your applications. The default policy is to
allow 9 maximum retry attempts, and wait for 30 seconds but this can be further adjusted. For more information,
see Retry policy – Service Bus.
Use a dead-letter queue . If a message cannot be processed or delivered to any receiver after multiple retries, it
is moved to a dead letter queue. Implement a process to read messages from the dead letter queue, inspect them,
and remediate the problem. Depending on the scenario, you might retry the message as-is, make changes and
retry, or discard the message. For more information, see Overview of Service Bus dead-letter queues.
Use Geo-Disaster Recover y . Geo-disaster recovery ensures that data processing continues to operate in a
different region or datacenter if an entire Azure region or datacenter becomes unavailable due to a disaster. For
more information, see Azure Service Bus Geo-disaster recovery.

Storage
For application data, use read-access geo-redundant storage (RA-GRS). RA-GRS storage replicates the
data to a secondary region, and provides read-only access from the secondary region. If there is a storage outage
in the primary region, the application can read the data from the secondary region. For more information, see
Azure Storage replication.
For VM disks, use managed disks. Managed disks provide better reliability for VMs in an availability set,
because the disks are sufficiently isolated from each other to avoid single points of failure. Also, managed disks
aren't subject to the IOPS limits of VHDs created in a storage account. For more information, see Manage the
availability of Windows virtual machines in Azure.
For Queue storage, create a backup queue in another region. For Queue storage, a read-only replica has
limited use, because you can't queue or dequeue items. Instead, create a backup queue in a storage account in
another region. If there is a storage outage, the application can use the backup queue, until the primary region
becomes available again. That way, the application can still process new requests.

SQL Database
Use Standard or Premium tier. These tiers provide a longer point-in-time restore period (35 days). For more
information, see SQL Database options and performance.
Enable SQL Database auditing. Auditing can be used to diagnose malicious attacks or human error. For more
information, see Get started with SQL database auditing.
Use Active Geo-Replication Use Active Geo-Replication to create a readable secondary in a different region. If
your primary database fails, or simply needs to be taken offline, perform a manual failover to the secondary
database. Until you fail over, the secondary database remains read-only. For more information, see SQL Database
Active Geo-Replication.
Use sharding. Consider using sharding to partition the database horizontally. Sharding can provide fault isolation.
For more information, see Scaling out with Azure SQL Database.
Use point-in-time restore to recover from human error. Point-in-time restore returns your database to an
earlier point in time. For more information, see Recover an Azure SQL database using automated database
backups.
Use geo-restore to recover from a ser vice outage. Geo-restore restores a database from a geo-redundant
backup. For more information, see Recover an Azure SQL database using automated database backups.

Azure Synapse Analytics


Do not disable geo-backup. By default, Synapse Analytics takes a full backup of your data every 24 hours for
disaster recovery. It is not recommended to turn this feature off. For more information, see Geo-backups.

SQL Server running in a VM


Replicate the database. Use SQL Server Always On availability groups to replicate the database. Provides high
availability if one SQL Server instance fails. For more information, see Run Windows VMs for an N-tier application
Back up the database . If you are already using Azure Backup to back up your VMs, consider using Azure Backup
for SQL Server workloads using DPM. With this approach, there is one backup administrator role for the
organization and a unified recovery procedure for VMs and SQL Server. Otherwise, use SQL Server Managed
Backup to Microsoft Azure.

Traffic Manager
Perform manual failback . After a Traffic Manager failover, perform manual failback, rather than automatically
failing back. Before failing back, verify that all application subsystems are healthy. Otherwise, you can create a
situation where the application flips back and forth between datacenters. For more information, see Run VMs in
multiple regions for high availability.
Create a health probe endpoint. Create a custom endpoint that reports on the overall health of the application.
This enables Traffic Manager to fail over if any critical path fails, not just the front end. The endpoint should return
an HTTP error code if any critical dependency is unhealthy or unreachable. Don't report errors for non-critical
services, however. Otherwise, the health probe might trigger failover when it's not needed, creating false positives.
For more information, see Traffic Manager endpoint monitoring and failover.

Virtual Machines
Avoid running a production workload on a single VM. A single VM deployment is not resilient to planned or
unplanned maintenance. Instead, put multiple VMs in an availability set or virtual machine scale set, with a load
balancer in front.
Specify an availability set when you provision the VM. Currently, there is no way to add a VM to an
availability set after the VM is provisioned. When you add a new VM to an existing availability set, make sure to
create a NIC for the VM, and add the NIC to the back-end address pool on the load balancer. Otherwise, the load
balancer won't route network traffic to that VM.
Put each application tier into a separate Availability Set. In an N-tier application, don't put VMs from
different tiers into the same availability set. VMs in an availability set are placed across fault domains (FDs) and
update domains (UD). However, to get the redundancy benefit of FDs and UDs, every VM in the availability set must
be able to handle the same client requests.
Replicate VMs using Azure Site Recover y. When you replicate Azure VMs using Site Recovery, all the VM
disks are continuously replicated to the target region asynchronously. The recovery points are created every few
minutes. This gives you a Recovery Point Objective (RPO) in the order of minutes. You can conduct disaster
recovery drills as many times as you want, without affecting the production application or the ongoing replication.
For more information, see Run a disaster recovery drill to Azure.
Choose the right VM size based on performance requirements. When moving an existing workload to
Azure, start with the VM size that's the closest match to your on-premises servers. Then measure the performance
of your actual workload with respect to CPU, memory, and disk IOPS, and adjust the size if needed. This helps to
ensure the application behaves as expected in a cloud environment. Also, if you need multiple NICs, be aware of the
NIC limit for each size.
Use managed disks for VHDs. Managed disks provide better reliability for VMs in an availability set, because
the disks are sufficiently isolated from each other to avoid single points of failure. Also, managed disks aren't
subject to the IOPS limits of VHDs created in a storage account. For more information, see Manage the availability
of Windows virtual machines in Azure.
Install applications on a data disk , not the OS disk . Otherwise, you may reach the disk size limit.
Use Azure Backup to back up VMs. Backups protect against accidental data loss. For more information, see
Protect Azure VMs with a Recovery Services vault.
Enable diagnostic logs. Include basic health metrics, infrastructure logs, and boot diagnostics. Boot diagnostics
can help you diagnose a boot failure if your VM gets into a nonbootable state. For more information, see Overview
of Azure Diagnostic Logs.
Configure Azure Monitor. Collect and analyze monitoring data from Azure virtual machines including the guest
operating system and the workloads that run in it, see Azure Monitor and Quickstart: Azure Monitor.

Virtual Network
To allow or block public IP addresses, add a network security group to the subnet. Block access from
malicious users, or allow access only from users who have privilege to access the application.
Create a custom health probe. Load Balancer Health Probes can test either HTTP or TCP. If a VM runs an HTTP
server, the HTTP probe is a better indicator of health status than a TCP probe. For an HTTP probe, use a custom
endpoint that reports the overall health of the application, including all critical dependencies. For more information,
see Azure Load Balancer overview.
Don't block the health probe. The Load Balancer Health probe is sent from a known IP address, 168.63.129.16.
Don't block traffic to or from this IP in any firewall policies or network security group rules. Blocking the health
probe would cause the load balancer to remove the VM from rotation.
Enable Load Balancer logging. The logs show how many VMs on the back-end are not receiving network traffic
due to failed probe responses. For more information, see Log analytics for Azure Load Balancer.
Overview of the security pillar
11/2/2020 • 3 minutes to read • Edit Online

Information Security has always been a complex subject, and it evolves quickly with the creative ideas and
implementations of attackers and security researchers. The origin of security vulnerabilities started with
identifying and exploiting common programming errors and unexpected edge cases. However over time, the
attack surface that an attacker may explore and exploit has expanded well beyond that. Attackers now freely
exploit vulnerabilities in system configurations, operational practices, and the social habits of the systems’ users.
As system complexity, connectedness, and the variety of users increase, attackers have more opportunities to
identify unprotected edge cases and to “hack” systems into doing things they were not designed to do.
Security is one of the most important aspects of any architecture. It provides confidentiality, integrity, and
availability assurances against deliberate attacks and abuse of your valuable data and systems. Losing these
assurances can negatively impact your business operations and revenue, as well as your organization’s reputation
in the marketplace. In the following series of articles, we’ll discuss key architectural considerations and principles
for security and how they apply to Azure.
To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the
Microsoft Azure Well-Architected Review.
These are the topics we cover in the security pillar of the Microsoft Azure Well-Architected Framework

SEC URIT Y TO P IC DESC RIP T IO N

Role of security Security is one of the most important aspects of any


architecture. Security provides confidentiality, integrity, and
availability assurances against deliberate attacks and abuse of
your valuable data and systems.

Security design principles These principles support these three key strategies and
describe a securely architected system hosted on cloud or on-
premises datacenters (or a combination of both).

Types of attacks to resist An architecture built on good security practices should be


resilient to attacks. It should both resist attacks and recover
rapidly from disruption to the security assurances of
confidentiality, integrity, and availability.

Regulatory compliance Governments and other organizations frequently publish


standards to help define good security practices (due
diligence) so that organizations can avoid being negligent in
security.

Reduce organizational risk Much like physical safety, success in information security is
defined more as an ongoing task of applying good security
practices and principles and hygiene rather than a static
absolute state.
SEC URIT Y TO P IC DESC RIP T IO N

Administration Administration is the practice of monitoring, maintaining, and


operating Information Technology (IT) systems to meet
service levels that the business requires. Administration
introduces some of the highest impact security risks because
performing these tasks requires privileged access to a very
broad set of these systems and applications.

Applications and services Applications and the data associated with them ultimately act
as the primary store of business value on a cloud platform.

Governance, risk, and compliance How is the organization’s security going to be monitored,
audited, and reported? What types of risks does the
organization face while trying to protect identifiable
information, Intellectual Property (IP), financial information?
Are there specific industry, government, or regulatory
requirements that dictate or provide recommendation on
criteria that your organization’s security controls must meet?

Identity and access management Identity provides the basis of a large percentage of security
assurances.

Info protection and storage Protecting data at rest is required to maintain confidentiality,
integrity, and availability assurances across all workloads.

Network security and containment Network security has been the traditional linchpin of
enterprise security efforts. However, cloud computing has
increased the requirement for network perimeters to be more
porous and many attackers have mastered the art of attacks
on identity system elements (which nearly always bypass
network controls).

Security Operations Security operations maintain and restores the security


assurances of the system as live adversaries attack it. The
tasks of security operations are described well by the NIST
Cybersecurity Framework functions of Detect, Respond, and
Recover.
Security design principles
11/2/2020 • 5 minutes to read • Edit Online

These principles support these three key strategies and describe a securely architected system hosted on cloud or
on-premises datacenters (or a combination of both). Application of these principles will dramatically increase the
likelihood your security architecture will maintain assurances of confidentiality, integrity, and availability.
Each recommendation in this document includes a description of why it is recommended which maps to one of
more of these principles:
Align Security Priorities to Mission – Security resources are almost always limited, so prioritize efforts
and assurances by aligning security strategy and technical controls to the business using classification of
data and systems. Security resources should be focused first on people and assets (systems, data, accounts,
etc.) with intrinsic business value and those with administrative privileges over business critical assets.
Build a Comprehensive Strategy – A security strategy should consider investments in culture, processes,
and security controls across all system components. The strategy should also consider security for the full
lifecycle of system components including the supply chain of software, hardware, and services.
Drive Simplicity – Complexity in systems leads to increased human confusion, errors, automation failures,
and difficulty of recovering from an issue. Favor simple and consistent architectures and implementations.
Design for Attackers – Your security design and prioritization should be focused on the way attackers see
your environment, which is often not the way IT and application teams see it. Inform your security design
and test it with penetration testing to simulate one time attacks and red teams to simulate long-term
persistent attack groups. Design your enterprise segmentation strategy and other security controls to
contain attacker lateral movement within your environment. Actively measure and reduce the potential
Attack Surface that attackers target for exploitation for resources within the environment.
Leverage Native Controls – Favor native security controls built into cloud services over external controls
from third parties. Native security controls are maintained and supported by the service provider,
eliminating or reducing effort required to integrate external security tooling and update those integrations
over time.
Use Identity as Primar y Access Control – Access to resources in cloud architectures is primarily
governed by identity-based authentication and authorization for access controls. Your account control
strategy should rely on identity systems for controlling access rather than relying on network controls or
direct use of cryptographic keys.
Accountability – Designate clear ownership of assets and security responsibilities and ensure actions are
traceable for nonrepudiation. You should also ensure entities have been granted the least privilege required
(to a manageable level of granularity).
Embrace Automation - Automation of tasks decreases the chance of human error that can create risk, so
both IT operations and security best practices should be automated as much as possible to reduce human
errors (while ensuring skilled humans govern and audit the automation).
Focus on Information Protection – Intellectual property is frequently one of the biggest repositories of
organizational value and this data should be protected anywhere it goes including cloud services, mobile
devices, workstations, or collaboration platforms (without impeding collaboration that allows for business
value creation). Your security strategy should be built around classifying information and assets to enable
security prioritization, leveraging strong access control and encryption technology, and meeting business
needs like productivity, usability, and flexibility.
Design for Resilience – Your security strategy should assume that controls will fail and design
accordingly. Making your security posture more resilient requires several approaches working together
Balanced Investment – across core functions spanning the full NIST Cybersecurity Framework
lifecycle (identify, protect, detect, respond, and recover) to ensure that attackers who successfully
evade preventive controls lose access from detection, response, and recovery capabilities.
Ongoing maintenance – of security controls and assurances to ensure that they don’t decay over time
with changes to the environment or neglect
Ongoing vigilance – to ensure that anomalies and potential threats that could pose risks to the
organizations are addressed in a timely manner.
Defense in depth – approach includes additional controls in the design to mitigate risk to the
organization in the event a primary security control fails. This design should consider how likely the
primary control is to fail, the potential organizational risk if it does, and the effectiveness of the
additional control (especially in the likely cases that would cause the primary control to fail).
Least Privilege – This is a form of defense in depth to limit the damage that can be done by any one
account. Accounts should be granted the least amount of privileged required to accomplish their
assigned tasks by access permissions and by time. This helps mitigate the damage of an external
attacker who gains access to the account and/or an internal employee that inadvertently or
deliberately (for example, insider attack) compromises security assurances.
Baseline and Benchmark – To ensure your organization considers current thinking from outside sources,
evaluate your strategy and configuration against external references (including compliance requirements).
This helps to validate your approaches, minimize risk of inadvertent oversight, and the risk of punitive fines
from noncompliance.
Drive Continuous Improvement – Systems and existing practices should be regularly evaluated and
improved to ensure they are and remain effective against attackers who continuously improve and the
continuous digital transformation of the enterprise. This should include processes that proactively integrate
learnings from real world attacks, realistic penetration testing and red team activities, and other sources as
available.
Assume Zero Trust – When evaluating access requests, all requesting users, devices, and applications
should be considered untrusted until their integrity can be sufficiently validated. Access requests should be
granted conditionally based on the requestors trust level and the target resource’s sensitivity. Reasonable
attempts should be made to offer means to increase trust validation (for example, request multi-factor
authentication) and remediate known risks (change known-leaked password, remediate malware infection)
to support productivity goals.
Educate and incentivize security – The humans that are designing and operating the cloud workloads
are part of the whole system. It is critical to ensure that these people are educated, informed, and
incentivized to support the security assurance goals of the system. This is particularly important for people
with accounts granted broad administrative privileges.
Enforce governance to reduce risks
11/2/2020 • 2 minutes to read • Edit Online

As part of overall design, prioritize where to invest the available resources; financial, people, and time. Constraints
on those resources also affect the security implementation across the organization. Set organizational policies for
operations, technologies, and configurations based on internal factors (business requirements, risks, asset
evaluation) and external factors (benchmarks, regulatory standards, threat environment).
Before defining the policies, consider:
How is the organization’s security monitored, audited, and reported? Is there mandatory reporting?
Are the existing security practices are working?
Are there new requirements?
Are there any requirements specific to industry, government, or regulatory requirements?
Designate group(s) (or individual roles) for central functions that affect shared services and applications.
After the policies are set, continuously improve those standards incrementally. Make sure that the security posture
doesn’t degrade over time by having auditing and monitoring compliance. For information about managing
security standards of an organization, see governance, risk, and compliance (GRC).

In this section
A RT IC L E DESC RIP T IO N

Team roles and responsibilities Define clear lines of responsibility and separation of duties.

Reference model: Segmentation Reference model and strategies of how the functions and
teams can be segmented.

Management groups and permissions Strategies using management groups to manage resources
across multiple subscriptions consistently and efficiently.

Regulatory compliance Guidance on standards published by law, authorities, and


regulators.
Team roles and responsibilities
11/2/2020 • 2 minutes to read • Edit Online

To avoid confusion that can lead to human and automation errors that create security risk, define clear lines of
responsibility and separation of duties.
Do the teams have a clear view on responsibilities and individual/group access levels?

Designate groups (or individual roles) that will be responsible for central functions, such as network, policy
management and so on.
Document responsibilities of each group, share contacts to facilitate communication.

C EN T RA L F UN C T IO N RESP O N SIB IL IT Y T EC H N IC A L T EA M

Network security Configuration and maintenance of Network security team.


Azure Firewall, Network Virtual
Appliances (and associated routing),
Web Application Firewall (WAF),
Network Security Groups, Application
Security Groups (ASG), and other cross-
network traffic.

Network operations Enterprise-wide virtual network and Network operations team.


subnet allocation.

IT operations, Security Server endpoint security includes IT operations and, or, security.
monitoring and remediating server
security. This includes tasks such as
patching, configuration, endpoint
security,and so on.

Security operations Incident monitoring and response to Security operations team.


investigate and remediate security
incidents in Security Information and
Event Management (SIEM) or source
console such as Azure Security Center
Azure AD Identity Protection.

Policy management Apply governance based on risk Governance and architecture teams.
analysis and compliance requirements.
Set direction for use of Roles Based
Access Control (RBAC), Azure Security
Center, Administrator protection
strategy, and Azure Policy to govern
Azure resources

Identity Security and Standards Set direction for Azure AD directories, Security and identity teams
PIM/PAM usage, MFA,
password/synchronization
configuration, Application Identity
Standards.

Application teams should include central functions listed in the preceding table. Application roles and
responsibilities should cover different access level of each operational function. For example, publish production
release, access customer data, manipulate database records, and so on.

Next steps
Start with this reference model and adapt it to your organization’s needs. This model shows how functions,
resources, and teams can be segmented.
Segmentation reference model
Segmentation strategies
11/2/2020 • 3 minutes to read • Edit Online

Segmentation refers to the isolation of resources from other parts of the organization. It's an effective way of
detecting and containing adversary movements.
An approach to segmentation is network isolation. This approach is not recommended because different technical
teams are not aligned with the overall business use cases and application workloads. An outcome is complexity
especially with on-premises network and leads to broad network firewall exceptions. While network control should
be considered as one of the strategies, it must be part of a unified segmentation strategy.
An effective segmentation strategy will guide all technical teams (IT, security, applications) to consistently isolate
access using networking, applications, identity, and any other access controls. The strategy should aim to:
Minimize operation friction by aligning to business practices and applications.
Contain Risk by adding cost attackers. This is done by:
Isolating sensitive workloads from compromise of other assets
Isolating high exposure systems from being used as a pivot to other systems
Monitor operations that can lead to potential violations of the integrity of the segments (account usage,
unexpected traffic.).
Here are some recommendations for creating a unified strategy.
Ensure alignment of technical teams to a single strategy that is based on assessing business risks.
Establish a modern perimeter based on zero-trust principles focused on identity, device, applications, and other
signals. This will overcome the limitation of network controls to protect new resources and attack types.
Reinforce network controls for legacy applications by exploring microsegmentation strategies.

Reference model
Start with this reference model and adapt it to your organization’s needs. This model shows how functions,
resources, and teams can be segmented.

Example segments
Consider isolating shared and individual resources as shown in the preceding image.
Core services segment
This segment hosts shared services utilized across the organization. These shared services typically include Active
Directory Domain Services, DNS/DHCP, System Management Tools hosted on Azure Infrastructure as a Service
(IaaS) virtual machines.
Individual segments
There are other segments that can contain resources based on some criteria. For instance, resources used by a
workload application can be contained in a separate segment. Another way is segment by lifecycle stages:
development, test, and production. Some resources might intersect, such as applications can use virtual networks
used for lifecycle stages.
Functions and teams
These are the main functions for this reference model. Mapping between central functions, responsibilities, and
teams are described in Team roles and responsibilities.

F UN C T IO N SC O P E RESP O N SIB IL IT Y

Policy management (Core and Some or all resources. Monitor and enforce compliance with
individual segments) external (or internal) regulations,
standards, and security policy assign
appropriate permission to those roles.

Central IT operations (Core) Across all resources. Grant permissions to the central IT
department (often the infrastructure
team) to create, modify, and delete
resources like virtual machines and
storage.

Central networking group (Core and All network resources. Ensure consistency and avoid technical
individual segments) conflicts, assign network resource
responsibilities to a single central
networking organization. These
resources should include virtual
networks, subnets, Network Security
Groups (NSG), and the virtual machines
hosting virtual network appliances.

Resource role permissions (Core) - For most core services, administrative


privileges required are granted through
the application (Active Directory,
DNS/DHCP, System Management Tools).
No additional Azure resource
permissions are required. If your
organizational model requires these
teams to manage their own VMs,
storage, or other Azure resources, you
can assign these permissions to those
roles.

Security operations (Core and individual All resources. Assess risk factors, identify potential
segments) mitigations, and advise organizational
stakeholders who accept the risk.
F UN C T IO N SC O P E RESP O N SIB IL IT Y

IT operations (individual segments) All resources. Grant permission to create, modify, and
delete resources. The purpose of the
segment (and resulting permissions) will
depend on your organization structure.
Segments with resources
managed by a centralized IT
organization can grant the
central IT department (often the
infrastructure team) permission
to modify these resources.
Segments managed by
independent business units or
functions (such as a Human
Resources IT Team) can grant
those teams permission to all
resources in the segment.
Segments with autonomous
DevOps teams don’t need to
grant permissions across all
resources because the resource
role (below) grants permissions
to application teams. For
emergencies, use the service
admin account (break-glass
account).

Service admin (Core and individual Use the service admin role only for
segments) emergencies (and initial setup if
required). Do not use this role for daily
tasks.

See also
Management groups

Next steps
Start with this reference model and manage resources across multiple subscriptions consistently and efficiently
with management groups.
Management groups
Security management groups
11/2/2020 • 5 minutes to read • Edit Online

Management groups can manage resources across multiple subscriptions consistently and efficiently. However,
due to its flexibility, your design can become complex and compromise security and operations.
The recommendations in this article are based on an example reference model. Start with this model and adapt it
to your organization’s needs. For more information, see Enterprise segmentation strategy.

Support your segmentation strategy with management groups


It's recommended that you align the top level of management groups into a simple enterprise segmentation
strategy and limit the levels to no more than two.
In the example reference, there are enterprise-wide resources used by all segments, a set of core services that
share services, additional segments for each workload.
The recommended approach is to use these management groups:
Root management group for enterprise-wide resources.
Use the root management group to include identities that have the requirement to apply policies across
every resource. For example, regulatory requirements, such as restrictions related to data sovereignty. This
group is effective in by applying policies, permissions, tags, across all subscriptions.
Management group for each workload segment.
Use a separate management group for teams with limited scope of responsibility. This group is typically
required because of organizational boundaries or regulatory requirements.
Root or segment management group for the core set of services.

Azure role-based access control (RBAC) role assignment


Grant roles the appropriate permissions that start with least privilege and add more based your operational needs.
Provide clear guidance to your technical teams that implement permissions. This clarity makes it easier to detect
and correct that reduces human errors such as overpermissioning.
Assign permissions at management group for the segment rather than the individual subscriptions. This
will drive consistency and ensure application to future subscriptions.
Consider the built-in roles before creating custom roles to grant the appropriate permissions to VMs and
other objects.
Security managers group membership may be appropriate for smaller teams/organizations where
security teams have extensive operational responsibilities.
When assigning permissions for a segment, consider consistency while allowing flexibility to accommodate
several organizational models. These models can range from a single centralized IT group to mostly independent
IT and DevOps teams.
The reference model has several groups assigned for core services and workload segments. Here are some other
considerations when assigning permissions for the common teams across those segments.
Security team's visibility
Grant read-only access to security attributes for all technical environments. Assign security teams with the
Security Readers permission that provides access needed to assess risk factors, identify potential mitigations,
without providing access to the data.
You can assign this permission by using:
Root management group – for teams responsible for assessing and reporting risk on all resources.
Segment management group(s) – for teams with limited scope of responsibility. This group is typically required
because of organizational boundaries or regulatory requirements.

IMPORTANT
Treat security teams as critical impact accounts and apply the same protections as administrators.

Policy management across some or all resources


Assign appropriate permission to roles that monitor and enforce compliance with external (or internal)
regulations, standards, and security policy. The roles and permissions you choose will depend on the
organizational culture and expectations of the policy program.
The reference model assigns Read Contributor permissions to roles related to policy management.
Central IT operations across all resources
Grant permissions to the central IT department (often the infrastructure team) to create, modify, and delete
resources like virtual machines and storage. Contributor or Owner roles are appropriate for this function.
Central networking group across network resources
Assign network resource responsibilities to a single central networking organization. The Network Contributor
role is appropriate for this group.
IT Operations across all resources
Grant permission to create, modify, and delete resources that belong to a segment. The purpose of the segment
(and resulting permissions) will depend on your organization structure.
Segments with resources managed by a centralized IT organization can grant the central IT department (often
the infrastructure team) permission to modify these resources.
Segments managed by independent business units or functions (such as a Human Resources IT Team) can
grant those teams permission to all resources in the segment.
Segments with autonomous DevOps teams don’t need to grant permissions across all resources because the
resource role (below) grants permissions to application teams. For emergencies, use the service admin account
(break-glass account).
Resource role permissions
For most core services, administrative privileges required to manage them are granted through the application
(Active Directory, DNS/DHCP, System Management Tools), so no additional Azure resource permissions are
required. If your organizational model requires these teams to manage their own VMs, storage, or other Azure
resources, you can assign these permissions to those roles.
Workload segments with autonomous DevOps teams will manage the resources associated with each application.
The actual roles and their permissions depend on the application size and complexity, the application team size
and complexity, and the culture of the organization and application team.
Break-glass account
Use the Ser vice Administrator role only for emergencies and initial setup. Do not use this role for daily tasks.
Use root management group with caution
Use the root management group to drive consistency across the enterprise by applying policies, permissions, tags,
across all subscriptions. This group can affect every all resources on Azure and potentially cause downtime or
other negative impacts.
Select enterprise-wide identities that have a clear requirement to be applied across every resources. These
requirements could be for regulatory reasons. Also, select identities that have near-zero negative impact on
operations. For example, policy with audit effect, tag assignment, RBAC permissions assignments that have been
carefully reviewed.
Use a dedicated service principal name (SPN) to execute management group management operations,
subscription management operations, and role assignment. SPN reduces the number of users who have elevated
rights and follows least-privilege guidelines. Assign the User Access Administrator at the root management
group scope (/) to grant the SPN just mentioned access at the root level. After the SPN is granted permissions, the
User Access Administrator role can be safely removed. In this way, only the SPN is part of the User Access
Administrator role.
Assign Contributor permission to the SPN, which allows tenant-level operations. This permission level ensures
that the SPN can be used to deploy and manage resources to any subscription within your organization.
Limit the number of Azure Policy assignments made at the root management group scope (/). This limitation
minimizes debugging inherited policies in lower-level management groups.
Don't create any subscriptions under the root management group. This hierarchy ensures that subscriptions don't
only inherit the small set of Azure policies assigned at the root-level management group, which don't represent a
full set necessary for a workload.

IMPORTANT
Test all enterprise-wide changes on the root management group before applying (policy, tags, RBAC model, and so on). You
can use a test lab. This can be representative lab tenant or lab segment in production tenant. Another option is to use a
production pilot. This can be a segment management group or designated subset in subscription(s) management group.
Validate changes to make sure the requirements have the desired effect.
Administrative account security
11/13/2020 • 9 minutes to read • Edit Online

Administration is the practice of monitoring, maintaining, and operating Information Technology (IT) systems to
meet service levels that the business requires. Administration introduces some of the highest impact security risks
because performing these tasks requires privileged access to a very broad set of these systems and applications.
Attackers know that gaining access to an account with administrative privileges can get them access to most or all
of the data they would target, making the security of administration one of the most critical security areas.
As an example, Microsoft makes significant investments in protection and training of administrators for our cloud
systems and IT systems:

Microsoft's recommended core strategy for administrative privileges is to use the available controls to reduce risk
Reduce risk exposure (scope and time) – The principle of least privilege is best accomplished with modern
controls that provide privileges on demand. This help to limit risk by limiting administrative privileges exposure by:
Scope – Just Enough Access (JEA) provides only the required privileges for the administrative operation
required (vs. having direct and immediate privileges to many or all systems at a time, which is almost never
required).
Time – Just in Time (JIT) approaches provided the required privileged as they are needed.
Mitigate the remaining risks – Use a combination of preventive and detective controls to reduce risks
such as isolating administrator accounts from the most common risks phishing and general web browsing,
simplifying and optimizing their workflow, increasing assurance of authentication decisions, and identifying
anomalies from normal baseline behavior that can be blocked or investigated.
Microsoft has captured and documented best practices for protecting administrative accounts and published
prioritized roadmaps for protecting privileged access that can be used as references for prioritizing mitigations for
accounts with privileged access.
Securing Privileged Access (SPA) roadmap for administrators of on-premises Active Directory
Guidance for securing administrators of Azure Active Directory

Minimize number of critical impact admins


Grant the fewest number of accounts to privileges that can have a critical business impact
Each admin account represents potential attack surface that an attacker can target, so minimizing the number of
accounts with that privilege helps limit the overall organizational risk. Experience has taught us that membership of
these privileged groups grows naturally over time as people change roles if membership not actively limited and
managed.
We recommend an approach that reduces this attack surface risk while ensuring business continuity in case
something happens to an administrator:
Assign at least two accounts to the privileged group for business continuity
When two or more accounts are required, provide justification for each member including the original two
Regularly review membership & justification for each group member

Managed accounts for admins


Ensure all critical impact admins in are managed by enterprise directory to follow organizational policy
enforcement.
Consumer accounts such as Microsoft accounts like @Hotmail.com, @live.com, @outlook.com, don't offer
sufficient security visibility and control to ensure the organization's policies and any regulatory requirements are
being followed. Because Azure deployments often start small and informally before growing into enterprise-
managed tenants, some consumer accounts remain as administrative accounts long afterward for example,
original Azure project managers, creating blind spots, and potential risks.

Separate accounts for admins


Ensure all critical impact admins have a separate account for administrative tasks (vs the account they use for
email, web browsing, and other productivity tasks).
Phishing and web browser attacks represent the most common attack vectors to compromise accounts, including
administrative accounts.
Create a separate administrative account for all users that have a role requiring critical privileges. For these
administrative accounts, block productivity tools like Office 365 email (remove license). If possible, block arbitrary
web browsing (with proxy and/or application controls) while allowing exceptions for browsing to the Azure portal
and other sites required for administrative tasks.

No standing access / Just in Time privileges


Avoid providing permanent "standing" access for any critical impact accounts
Permanent privileges increase business risk by increasing the time an attacker can use the account to do damage.
Temporary privileges force attackers targeting an account to either work within the limited times the admin is
already using the account or to initiate privilege elevation (which increases their chance of being detected and
removed from the environment).
Grant privileges required only as required using one of these methods:
Just in Time - Enable Azure AD Privileged Identity Management (PIM) or a third party solution to require
following an approval workflow to obtain privileges for critical impact accounts
Break glass – For rarely used accounts, follow an emergency access process to gain access to the accounts.
This is preferred for privileges that have little need for regular operational usage like members of global
admin accounts.

Emergency access or 'Break Glass' accounts


Ensure you have a mechanism for obtaining administrative access in case of an emergency
While rare, sometimes extreme circumstances arise where all normal means of administrative access are
unavailable.
We recommend following the instructions at Managing emergency access administrative accounts in Azure AD and
ensure that security operations monitor these accounts carefully.

Admin workstation security


Ensure critical impact admins use a workstation with elevated security protections and monitoring
Attack vectors that use browsing and email like phishing are cheap and common. Isolating critical impact admins
from these risks will significantly lower your risk of a major incident where one of these accounts is compromised
and used to materially damage your business or mission.
Choose level of admin workstation security based on the options available at https://aka.ms/securedworkstation
Highly Secure Productivity Device (Enhanced Security Workstation or Specialized Workstation)
You can start this security journey for critical impact admins by providing them with a higher security
workstation that still allows for general browsing and productivity tasks. Using this as an interim step helps
ease the transition to fully isolated workstations for both the critical impact admins as well as the IT staff
supporting these users and their workstations.
Privileged Access Workstation (Specialized Workstation or Secured Workstation)
These configurations represent the ideal security state for critical impact admins as they heavily restrict
access to phishing, browser, and productivity application attack vectors. These workstations don't allow
general internet browsing, only allow browser access to Azure portal and other administrative sites.

Critical impact admin dependencies – Account/Workstation


Carefully choose the on-premises security dependencies for critical impact accounts and their workstations
To contain the risk from a major incident on-premises spilling over to become a major compromise of cloud assets,
you must eliminate or minimize the means of control that on premises resources have to critical impact accounts in
the cloud. As an example, attackers who compromise the on premises Active Directory can access and compromise
cloud-based assets that rely on those accounts like resources in Azure, Amazon Web Services (AWS), ServiceNow,
and so on. Attackers can also use workstations joined to those on premises domains to gain access to accounts and
services managed from them.
Choose the level of isolation from on premises means of control also known as security dependencies for critical
impact accounts
User Accounts – Choose where to host the critical impact accounts
Native Azure AD Accounts -*Create Native Azure AD Accounts that are not synchronized with on-
premises active directory
Synchronize from On Premises Active Directory
Use existing accounts hosted in the on premises active directory.
Workstations – Choose how you will manage and secure the workstations used by critical admin accounts:
Native Cloud Management and Security (Recommended): Join workstations to Azure AD &
Manage/Patch them with Intune or other cloud services. Protect and Monitor with Windows
Microsoft Defender ATP or another cloud service not managed by on premises based accounts.
Manage with Existing Systems: Join existing AD domain and use existing management/security.

Passwordless or multi-factor authentication for admins


Require all critical impact admins to use passwordless authentication or multi-factor authentication (MFA).
Attack methods have evolved to the point where passwords alone cannot reliably protect an account. This is well
documented in a Microsoft Ignite Session.
Administrative accounts and all critical accounts should use one of the following methods of authentication. These
capabilities are listed in preference order by highest cost/difficulty to attack (strongest/preferred options) to lowest
cost/difficult to attack:
Passwordless (such as Windows Hello)
https://aka.ms/HelloForBusiness
Passwordless (Authenticator App)
/azure/active-directory/authentication/howto-authentication-phone-sign-in
Multifactor Authentication
/azure/active-directory/authentication/howto-mfa-userstates
Note that SMS Text Message based MFA has become very inexpensive for attackers to bypass, so we recommend
you avoid relying on it. This option is still stronger than passwords alone, but is much weaker than other MFA
options

Enforce conditional access for admins - Zero Trust


Authentication for all admins and other critical impact accounts should include measurement and enforcement of
key security attributes to support a Zero Trust strategy.
Attackers compromising Azure Admin accounts can cause significant harm. Conditional Access can significantly
reduce that risk by enforcing security hygiene before allowing access to Azure management.
Configure Conditional Access policy for Azure management that meets your organization's risk appetite and
operational needs.
Require Multifactor Authentication and/or connection from designated work network
Require Device integrity with Microsoft Defender ATP (Strong Assurance)

Avoid granular and custom permissions


Avoid permissions that specifically reference individual resources or users
Specific permissions create unneeded complexity and confusion as they don't carry the intention to new similar
resources. This then accumulates into a complex legacy configuration that is difficult to maintain or change without
fear of "breaking something" – negatively impacting both security and solution agility.
Instead of assigning specific resource-specific permissions, use either
Management Groups for enterprise-wide permissions
Resource groups for permissions within subscriptions
Instead of granting permissions to specific users, assign access to groups in Azure AD. If there isn't an appropriate
group, work with the identity team to create one. This allows you to add and remove group members externally to
Azure and ensure permissions are current, while also allowing the group to be used for other purposes such as
mailing lists.

Use built-in roles


Use built-in roles for assigning permissions where possible.
Customization leads to complexity that increases confusion and makes automation more complex, challenging, and
fragile. These factors all negatively impact security
We recommend that you evaluate the built-in roles designed to cover most normal scenarios. Custom roles are a
powerful and sometimes useful capability, but they should be reserved for cases when built in roles won't work.

Establish lifecycle management for critical impact accounts


Ensure you have a process for disabling or deleting administrative accounts when admin personnel leave the
organization (or leave administrative positions)
See Regularly review critical access for more details.

Attack simulation for critical impact accounts


Regularly simulate attacks against administrative users with current attack techniques to educate and empower
them.
People are a critical part of your defense, especially your personnel with access to critical impact accounts. Ensuring
these users (and ideally all users) have the knowledge and skills to avoid and resist attacks will reduce your overall
organizational risk.
You can use Office 365 Attack Simulation capabilities or any number of third party offerings.
Identity management
11/2/2020 • 5 minutes to read • Edit Online

Most architectures have shared services that are hosted and accessed across networks. Those services share
common infrastructure and users need to access resources and data from anywhere. For such architectures,
relying on just network security controls isn't enough.
Provide security assurance through identity management: the process of authenticating and authorizing security
principals. Use identity management services to authenticate and grant permission to users, partners, customers,
applications, services, and other entities. In Azure, use Azure Active Directory (AD), Azure AD B2B, Azure AD B2C.
How are you implementing Zero-Trust in your design?

Support a single enterprise directory.


Keep the cloud and on-premises directories synchronized, except for high-privilege accounts.
Enforce and measure key security attributes when authenticating all users, especially for high-privilege
accounts.
Have a separate identity source for non-employees.
Use passwordless and multi-factor authentication.
Blocking legacy protocols and authentication methods.
The recommendations in this article can be implemented through the use of Azure AD.

Use a single enterprise directory


Have a single enterprise directory for managing identities of full-time employees and enterprise resources. For
example, a single Azure AD directory instance can serve as the authoritative source for corporate and
organizational accounts.
This single-source approach increases clarity and consistency for all roles in IT and Security teams. If changes are
needed, teams only need to change the directory in one place. Another benefit is the ability to have single sign-on
(SSO). By signing in just once using a single user account, you can grant access to all the applications and
resources as per the business needs. Users don't have to manage multiple sets of usernames and passwords and
you can provision or de-provision application access automatically. For more information, see Single sign-on.
The Integrate on-premises Active Directory domains with Azure Active Directory reference architecture integrates
on-premises AD domains with Azure AD to provide cloud-based identity authentication.

Use role-based access control (RBAC)


Use Azure RBAC to control the level of access for users. Assign built-in roles to users, groups, service principals,
and managed identities.
For more information, see Azure built-in roles.

Synchronize the hybrid identity systems


Keep your cloud identity synchronized with the existing identity systems to ensure consistency and reduce human
errors.
Consider using Azure AD connect for synchronizing Azure AD with your existing on-premises directory. For
migration projects, have a requirement to complete this task before an Azure migration and development projects
begin.

IMPORTANT
Don’t synchronize high-privilege accounts to an on-premises directory. If an attacker gets full control of on-premises assets,
they can compromise a cloud account. This strategy will limit the scope of an incident. For more information, see Critical
impact account dependencies.
Synchronization is blocked by default in the default Azure AD Connect configuration. Make sure that you haven’t
customized this configuration. For information about filtering in Azure AD, see Azure AD Connect sync: Configure filtering

For more information, see hybrid identity providers.

Use a separate identity source for third-party accounts


Avoid hosting non-employee accounts in a corporate directory.
Use cloud identity services that are designed to host third-party accounts. Grant the appropriate level of access to
those external entities instead of the default permissions that are given to full-time employees. This differentiation
can prevent and detect attacks from these vectors. Also, you reduce the onus of managing these identities
internally.
For example, these capabilities natively integrate into the same Azure AD identity and permission model used by
Azure and Microsoft 365.
Azure AD – Employees and Enterprise Resources
Azure AD B2B – Partners
Azure AD B2C – Customers, citizens

Use modern password protection


Adopt passwordless authentication and multi-factor authentication (MFA) over time. You can also reduce use of
passwords through the use of Managed Identities.
This recommendation should be enforced for high-privilege accounts, such as administrators. For more
information, see Passwordless Or multi-factor authentication for admins. MFA adds a critical second layer of
security through phone calls, text messages, and so on. For more information, see What is Azure Multi-Factor
Authentication.
Password Guidance
NIST guidance
Cloud identity providers manage large volumes of logons and can detect anomalies effectively. They use various
data sources to proactively notify companies if their users’ passwords have been found in other breaches. Also,
they can validate any given sign-on appears legitimate and isn't coming from an unexpected or known-malicious
host.
Synchronizing passwords to the cloud to support these checks also add resiliency during some attacks. You should
be able to continue business operations when password hashes are synced to Azure AD.
For information, see Configure Azure AD Connect to synchronize password hashes.

Disable legacy authentication methods


Don't use legacy protocols for internet-facing services. For Azure AD-based accounts, configure Conditional Access
to block legacy protocols.
Legacy authentication methods aren't effective in deterring password spraying, dictionary, or brute force attacks.
These protocols also lack other attack-counter measures, such as account lockouts or back-off timers. Services,
running on Azure, which block legacy protocols have observed a 66% reduction in successful account
compromises.
Disabling legacy authentication can be challenging. Users might not want to move to new client software with
modern authentication methods. Gradually phase out legacy authentication. Start by using metrics and logging
from your authentication provider to determine the number of users who still authenticate with old clients. Next,
disable any down-level protocols that aren't used, and set up conditional access for all users who aren’t using
legacy protocols. Finally, give plenty of notice and guidance to users about upgrading before blocking legacy
authentication for all users and on all services at a protocol level.
For more information, see:
What is Conditional Access?
Manage access to Azure management with Conditional Access

Use cross-platform credential management


Use a common identity provider, such as Azure Active Directory (AD), for authenticating all platforms (Windows,
Linux, and others) and cloud services. Azure AD can authenticate on Windows, Linux, Azure, Microsoft 365,
Amazon Web Services (AWS), Google Services, (remote access to) legacy on-premises applications, and third-
party Software as a Service providers.

Related Links
Five steps to securing your identity infrastructure

Prefer Identity Authentication over Keys


Always authenticate with identity services rather than cryptographic keys when available.
Managing keys securely with application code is difficult and regularly leads to mistakes like accidentally
publishing sensitive access keys to code repositories like GitHub. Identity systems offer secure and usable
experience for access control with built-in sophisticated mechanisms for key rotation, monitoring for anomalies,
and more. Most organizations also have skilled teams dedicated to managing identity systems and few (if any)
people actively managing key security systems.
For services that offer the Azure AD authentication like Azure Storage, Azure App Service, Azure Backup, use it for
authentication and authorization. To further simplify using identities for developers, you can also take advantage
of managed identities to assign identities to resources like VMs and App Services so that developers don’t have to
manage identities within the application.
For multitenant best practices, see Manage identity in multitenant applications.
Network security review
11/2/2020 • 4 minutes to read • Edit Online

Assess your workload in areas such as network boundary security, network security, database security, data storage
security, identity management, and operational security.
How do you implement DDoS protection?

Risk : Potential of smaller-scale attack that doesn't trip the platform-level protection.
Azure Virtual Network resources offers Basic and Standard tiers for DDoS protection. Enable DDoS Protection
Standard for all business-critical web application and services.
The Windows N-tier application on Azure with SQL Server reference architecture uses DDoS Protection Standard
because this option:
Uses adaptive tuning, based on the application's network traffic patterns, to detect threats.
Guarantees 100% SLA.
Can be cost effective. For example, during a DDoS attack, the first set of attacks cause the provisioned resources
to scale out. For a resource such as a virtual machine scale set, 10 machines can grow to 100, increasing overall
costs. With Standard protection, you don't have to worry about the cost of the scaled resources because Azure
will provide the cost credit.
For information about Standard DDoS Protection, see Azure DDoS Protection Service.
How do you configure public IPs for which traffic is passed in, and how and where it's translated?

Risk : Inability to provision VMs with private IP addresses for protection.


Use Azure Firewall for built-in high availability and unrestricted cloud scalability.
Utilize Azure IP address to determine which traffic is passed in, and how and where it's translated on to the virtual
network.
How do you isolate network traffic in Azure?

Risk : Inability to ensure VMs and communication between them remains private within a network boundary.
Use Azure Virtual Network to allow VMs to securely communicate with each other, the internet, and on-premises
networks. Otherwise, communication between VMs may not be fully private within a network boundary.
How do you configure traffic flow between multiple application tiers?

Risk : Inability to define different access policies based on the workload types, and to control traffic flows between
them.
You may want to define different access policies based on the workload types and control flow between them. Use
Azure Virtual Network Subnet to allocate separate address spaces for different elements or tiers within the
workload. Then, define different access policies, and control traffic flows between those tiers.
How do you route network traffic through security appliances for security boundar y policy
enforcement, auditing, and inspection?

Risk : Inability to define communication paths between different tiers within a network boundary.
Use Azure Virtual Network User Defined Routes (UDR) to control next hop for traffic between Azure, on-premises,
and Internet resources through virtual appliance, virtual network gateway, virtual network, or internet.
Do you use firewalls, load balancers, and Network Intrusion Detection/Network Intrusion Prevention
(NIDS/NIPS)?

Risk : Possibility of not being able to select comprehensive solutions for secure network boundaries.
Use Network Appliances from Azure Marketplace to deploy a variety of preconfigured network virtual appliances.
Utilize Application Gateway WAF to detect and protect against common web attacks.
How do you segment the larger address space into subnets?

Risk : Inability to allow or deny inbound network traffic to, or outbound network traffic from, within larger network
space.
Use network security groups (NSGs) to allow or deny traffic to and from single IP address, to and from multiple IP
addresses, or even to and from entire subnets.
How do you control routing behavior between VM connectivity?

Risk : Inability to customize the routing configuration.


Employ Azure Virtual Network User Defined Routes (UDR) to customize the routing configuration for deployments.
How do you implement forced tunneling?

Risk : Potential of outbound connections from any VM increasing attack surface area leveraged by attackers.
Utilize forced tunneling to ensure that connections to the internet go through corporate network security devices.
How do you implement enhanced levels of security - firewall, IPS/IDS, antivirus, vulnerability
management, botnet protection on top of network-level controls?

Risk : Inability to enable security for other OSI model layers other than network and transport layer.
Use Azure Marketplace to provision devices for higher levels of network security than with network-level access
controls.
How do you establish cross premises connectivity?

Risk : Potential of access to company’s information assets on-premises.


Use Azure site-to-site VPN or ExpressRoute to set up cross premises connectivity to on-premises networks.
How do you implement global load balancing?

Risk : Inability to make services available even when datacenters might become unavailable.
Utilize Azure Traffic Manager to load balance connections to services based on the location of the user and/or other
criteria
How do you disable RDP/SSH access to vir tual machines?

Risk : Potential for attackers to use brute force techniques to gain access and launch other attacks.
Disable RDP/SSH access to Azure Virtual Machines and use VPN/ExpressRoute to access these virtual machines for
remote management.
How do you prevent, detect, and respond to threats?
Risk : Inability to have a single pane of visibility to prevent, detect, and respond to threats.
Employ Azure Security Center for increased visibility into, and control over, the security of Azure resources,
integrated security monitoring, and policy management across Azure subscriptions.
How do you monitor and diagnose conditions of the network?

Risk : Inability to understand, diagnose, and gain insights to the network in Azure.
Use Network Watcher to understand, diagnose, and gain insights to the network in Azure.
How do you gain access to real time performance information at the packet level?

Risk : Inability to investigate an issue in detail for better diagnoses.


Utilize packet capture to set alerts and gain access to real time performance information at the packet level.
How do you gather data for compliance, auditing, and monitoring the network security profile?

Risk : Inability to build a deeper understanding of the network traffic pattern.


Use network security group flow logs to gather data for compliance, auditing, and monitoring of your network
security profile.
How do you diagnose VPN connectivity issues? Risk : Inability to identify the issue and use the detailed logs
for further investigation.
Use Network Watcher troubleshooter to diagnose most common VPN gateway and connections issues
Network security
11/2/2020 • 8 minutes to read • Edit Online

Protect assets by placing controls on network traffic originating in Azure, between on-premises and Azure hosted
resources, and traffic to and from Azure. If security measures aren't in place attackers can gain access, for instance,
by scanning across public IP ranges. Proper network security controls can provide defense-in-depth elements that
help detect, contain, and stop attackers who gain entry into your cloud deployments.
Here are some recommendations for network security and containment:
Align network segmentation with overall strategy
Centralize network management and security
Build a network containment strategy
Define an internet edge strategy

Centralize network management and security


Centralize the management and security of core networking functions such as cross-premises links, virtual
networking, subnetting, and IP address schemes. This also applies to security elements such as virtual network
appliances, encryption of cloud virtual network activity and cross-premises traffic, network-based access controls,
and other traditional network security components.
Centralizing can reduce the potential for inconsistent strategies that a potential attacker can misuse. Also, the entire
organization benefits from the centralized network team’s expertise in network management, security knowledge,
and tooling.
Azure Security Center can be used to help centralize the management of network security.

Build a network segmentation strategy


At an organizational level, align your network segmentation strategy with the enterprise segmentation strategy to
have a unified strategy. This approach helps reduce human errors, and increases reliability through automation.
Here's an example of network architecture for a hybrid cloud infrastructure.
Evolve security beyond network controls
In modern architectures, mobile and other devices outside the network access various services for applications
over the internet or on cloud provider networks. Traditional network controls that are based on a trusted intranet
approach can’t effectively provide security assurances for those applications.
Combine network controls with application, identity, and other technical control types. This approach is effective in
preventing, detecting, and responding to threats outside the networks you control.
Ensure that resource grouping and administrative privileges align to the segmentation model.
Design security controls that identify and allow expected traffic, access requests, and application
communication between segments. Monitor the communication between segment. Use data to identify
anomalies, set alerts, or block traffic to mitigate the risk of attackers crossing segmentation boundaries.
For information, see Jericho Forum and ‘Zero Trust’ approaches.

Build a risk containment strategy


Containment of attack vectors within an environment is critical. Here are some considerations for creating a
containment strategy.
Use native capabilities provided by a cloud service provider, dynamic just-in-time (JIT) methods, and integrated
identity and password controls, such as those recommended by zero trust and continuous validation
approaches.
Make sure that there are network access controls between network constructs. These constructs can represent
virtual networks, or subnets within those virtual networks. This works to protect and contain East-West traffic
within your cloud network infrastructure.
Determine whether to use host-based firewalls. If host-based firewalls have been effective in helping you
protect and discover threats in the past, consider using them for your cloud-based resources. Otherwise,
explore native solutions on your cloud service provider’s platform.
Adopt a zero-trust strategy based on user, device, and application identities. Network access controls are based
on elements such as source and destination IP address, protocols, and port numbers. Adding zero trust enforces
and validates access control during “access time”. This strategy avoids the need for complex combinations of
open ports, network routes, available or serviced protocols for several types of changes. Only the destination
resource needs to provide the necessary access controls.
These technology options can be used to set up network control:
Azure Network Security Groups. Use for basic layer 3 & 4 access controls between Azure Virtual Networks,
their subnets, and the Internet.
Azure Web Application Firewall and the Azure Firewall. Use for more advanced network access controls that
require application layer support.
Local Admin Password Solution (LAPS) or a third-party Privileged Access Management. Set strong local
admin passwords and just in time access to them.
Third parties offer microsegmentation approaches that may enhance network controls by applying zero-trust
principles to networks you control with legacy assets on them.

Define an internet edge Strategy


Decide whether to use native cloud service provider controls or virtual network appliances for internet edge
security.
Internet edge traffic (also known as North-South traffic) represents network connectivity between cloud resources
and the internet. Legacy workloads require protection from internet endpoints because they were built with the
assumption of an internet firewall. An internet edge strategy is intended to mitigate attacks from the internet and
detect or block threats.
There are two primary choices that can provide internet edge security controls and monitoring:
Cloud service provider native controls (Azure Firewall and Web Application Firewall (WAF)).
Partner virtual network appliances (Firewall and WAF Vendors available in Azure Marketplace. Partner
solutions consistently cost more than native controls.
Native controls offer basic security and are insufficient for common attacks, such as the OWASP Top 10. Partner
capabilities can provide advanced features to protect against sophisticated (but typically uncommon) attacks.
Configuration of partner solutions can be complex and more fragile because they don't integrate with cloud’s
fabric controllers. From a cost perspective, native control is cheaper than partner solutions.
The decision to use native as opposed to partner controls should be based on your organization’s experience and
requirements. If the features of the advanced firewall solutions don’t provide sufficient return on investment,
consider using the native capabilities that are easy to configure and scale.

Evaluate the use of legacy network security technology


Carefully plan your use of signature-based Network Intrusion Detection/Network Intrusion Prevention
(NIDS/NIPS) Systems and Network Data Leakage/Loss Prevention (DLP) as you adopt cloud applications services.
IDS/IPS can generate a high number of false positive alerts. A well-tuned IDS/IPS system can be effective for classic
application architectures but don't work well for SaaS and PaaS architectures.
Network-based DLP is ineffective at identifying both inadvertent and deliberate data loss because most modern
protocols and attackers use network-level encryption for inbound and outbound communications. SSL-bridging
can provide an authorized man-in-the-middle entity that terminates and then reestablishes encrypted network
connections. However, this can also introduce privacy, security, and reliability challenges.
Keep reviewing and updating your network security strategy on these considerations as you migrate existing
workloads to Azure:
Cloud service providers filter malformed packets and common network layer attacks.
Many traditional NIDS/NIPS solutions use signature-based approaches per packet. This can be easily evaded by
attackers and typically produce a high rate of false positives.
Ensure your IDS/IPS system(s) are providing positive value from alerts.
Measure alert quality by the percentage of real attacks detections versus false alarms in the alerts raised
by the system.
Provide high-quality alerts to security analysts. Ideally, alerts should have a 90% true positive rate for
creating incidents in the primary queue to which triage investigation teams must respond. Lower quality
alerts would go to proactive hunting exercises to reduce analyst fatigue and burnout.
Adopt modern zero-trust identity approaches for protecting modern SaaS and PaaS applications. See Zero-Trust
for more information
For IaaS workloads, focus on network security solutions that provide per network context rather than per
packet/session context. Software-defined networks in the cloud are naturally instrumented and can achieve this
much more easily than on-premises equipment.
Favor solutions use machine learning techniques across these large volumes of traffic. ML technology is far
superior to static/manual human analysis at rapidly identifying anomalies that could be attacker activity out of
normal traffic patterns.

Design virtual network subnet security


Design virtual networks and subnets for growth.
Typically, you'll add more network resources as the design matures. This will require refactoring of IP addressing
and subnetting schemes to accommodate the extra resources. There is limited security value in creating many
small subnets and then trying to map network access controls (such as security groups) to each of them.
Plan your subnets based on roles and functions that use same protocols. That way, you can add resources to the
subnet without making changes to security groups that enforce network level access controls.
Do not use all open rules that allow inbound and outbound traffic to and from 0.0.0.0-255.255.255.255. Use a least
privilege approach and only allow relevant protocols. It will decrease your overall network attack surface on the
subnet. All open rules provide a false sense of security because such a rule enforces no security.
The exception is when you want to use security groups only for network logging purposes. We don't recommend
this option. But it's an option if you have another network access control solution in place.
Azure Virtual Network subnets can be designed in this way.

Mitigate DDoS attacks


Enable Distributed Denial of Service (DDoS) mitigations for all business-critical web application and services. DDoS
attacks are common and are easily conducted by unsophisticated attackers. DDoS attacks can be debilitating. An
attack can completely block access to your services and even take down the services. Responsible cloud service
providers offer DDoS protection of services with varying effectiveness and capacity. Mostly they provide two
options:
DDoS protection at the cloud network fabric level – all customers of the cloud service provider benefit from
these protections. The protection usually focuses on the network (layer 3) level.
DDoS protection at higher levels that profile your services – this option provides a baseline for your
deployments and then uses machine learning techniques to detect anomalous traffic and proactively protects
based on protection level set prior to service degradation. Adopt the advance protection for any services where
downtime will negatively impact the business.
For an example of advanced DDoS protection, see Azure DDoS Protection Service.
Decide on an internet ingress/egress policy
Decide whether to route traffic for the internet through on-premises security devices (also called forced tunneling)
or allow internet connectivity through cloud-based network security devices.
For production enterprise, allow cloud resources to start and respond to internet request directly through cloud
network security devices defined by your internet edge strategy. This approach fits the Nth datacenter paradigm,
that is Azure datacenters are a part of your enterprise. It scales better for an enterprise deployment because it
removes hops that add load, latency, and cost.
Forced tunneling is achieved through a cross-premise WAN link. The goal is to provide the network security teams
with greater security and visibility to internet traffic. Even when your resources in the cloud try to respond to
incoming requests from the internet, the responses are force tunneled. Alternately, forced tunneling fits a
datacenter expansion paradigm and can work well for a quick proof of concept, but scales poorly because of the
increased traffic load, latency, and cost. For those reasons, we recommend that you avoid forced tunneling.
Storage
11/2/2020 • 2 minutes to read • Edit Online

Protecting data at rest is required to maintain confidentiality, integrity, and availability assurances across all
workloads. Storage in a cloud service like Azure is architected and implemented quite differently than on premises
solutions to enable massive scaling, modern access through REST APIs, and isolation between tenants.
Granting access to Azure storage is possible through Azure Active Directory (Azure AD) as well as key based
authentication mechanisms (Symmetric Shared Key Authentication, or Shared Access Signature (SAS)).
Storage in Azure includes a number of native security design attributes:
All data is encrypted by the service.
Data in the storage system cannot be read by a tenant if it has not been written by that tenant (to mitigate
the risk of cross tenant data leakage).
Data will remain only in the region you choose.
The system maintains three synchronous copies of data in the region you choose.
Detailed activity logging is available on an opt-in basis.
Additional security features can be configured such as a storage firewall to provide an additional layer of access
control as well as storage threat protection to detect anomalous access and activities.
Applications and services
11/13/2020 • 2 minutes to read • Edit Online

Applications and the data associated with them act as the primary store of business value on a cloud platform.
Applications can play a role in risks to the business because:
Business processes are encapsulated and executed by applications and services need to be available and
provided with high integrity.
Business data is stored and processed by application workloads and requires high assurances of
confidentiality, integrity, and availability.

In this section
A SSESSM EN T DESC RIP T IO N

What aspects of the application do you need to Understanding the hosting models and the security
protect? responsibility.

Does the organization identify the highest severity Identify risks to the application and risks it may pose to your
threats to this workload through threat modeling? enterprise through threat modeling.

Do you have any regulator y or governance Guidance on standards published by law, authorities, and
requirements? regulators.

Are you exposing information through exception Consider the way you store secrets and handle exceptions.
handling or HTTP headers? Here are some considerations.

Are the frameworks and libraries used by the Evaluate frameworks and libraries used by the application and
application secure? the resulting vulnerabilities.

Next steps
See these best practices related to PaaS applications.
Securing PaaS deployments
Secure communication paths between applications and the services. Make sure that there's a distinction between
the endpoints exposed to the public internet and private ones. Also, the public endpoints are protected with web
application firewall.
Network security
Application classification for security
11/13/2020 • 4 minutes to read • Edit Online

Azure can host both legacy and modern applications through Infrastructure as a Service (IaaS) virtual machines
and Platform as a Service (PaaS). With legacy applications, you have the responsibility of securing all
dependencies including OS, middleware, and other components. For PaaS applications, you don't need to manage
and secure the underlying server OS. You are responsible for the application configuration.
This article describes the considerations for understanding the hosting models and the security responsibility of
each, identifying critical applications.

Understand your responsibility as an owner

Securing an application requires security assurances for three aspects:


Application code . The logic that defines the custom application that you write. Securing that code requires
identifying and mitigating risks from the design and implementation of the application and assessing
supply chain risk of included components.
Application ser vices . The cloud services that the application uses such as databases, identity providers,
event hubs, IoT device management, and so on. Security for cloud services is a shared responsibility. The
cloud provider ensures the security of the underlying service. The application owner is responsible for
security implications of the configuration and operation of the service instance(s) used by the application
including any data stored and processed on the service.
Application hosting platform . The computing environment where the application runs. This could take
many forms with significant variations on who is responsible for security:
Legacy applications . typically require a full operating system (and any middleware) hosted on
physical or virtualized hardware. This operating system and installed middleware/other components
are operated and secured by the application owner or their infrastructure team(s). The security
responsibility for the physical hardware and OS virtualization components (virtualization hosts,
operating systems, and management services) varies:
On-premises - The application owner is responsible for maintenance and security.
IaaS – The cloud provider is responsible for the underlying infrastructure and the application
owner’s organization is responsible for the VM configuration, operating system, and any
components installed on it.
Modern applications are hosted on PaaS environments such as an Azure application service. The
underlying operating system is secured by the cloud provider. Application owners are responsible for
the security of the application service configurations.
Containers are an application packaging mechanism in which applications are abstracted from the
environment in which they run. The containerized applications can run on a container service by the
cloud provider (modern applications) or on a server managed on premises or in IaaS.

Identify and classify applications


Identify applications that have a high potential impact and,or a high potential exposure to risk.
Business critical data . Applications that process or store information must have assurance of confidentiality,
integrity, and availability.
Regulated data . Applications that handle monetary instruments and sensitive personal information regulated
by standards such as the payment card industry (PCI), General Data Protection Regulation (GDPR), and Health
Information Portability and Accountability Act (HIPAA).
Business critical availability . Applications whose functionality is critical to the business mission, such as
production lines generating revenue, devices or services critical to life and safety, and other critical functions.
Significant Access . Applications that have access to systems with a high impact through technical means
such as
Stored Credentials or keys/certificates that grant access to the data/service.
Permissions granted through access control lists or other methods.
High exposure to attacks . Applications that are easily accessible to attackers such as web applications on the
public internet. Legacy applications can also be higher exposure as attackers (and penetration testers)
frequently target them because they know these legacy applications often have vulnerabilities that are difficult
to fix.

Use Azure services for fundamental components


Use Azure services databases, encryption, identity directory, and authentication instead of developing custom
solutions.
Identity – User directories and other authentication functions are complex to develop and critically
important to security assurances. Avoid custom authentication solutions. Instead choose native capabilities
like Azure Active Directory (Azure AD), Azure AD B2B, Azure AD B2C, or third-party solutions to
authenticate and grant permission to users, partners, customers, applications, services, and other entities.
For more information, see Security with identity and access management (IAM) in Azure.
Data Protection – Use established capabilities from cloud providers such as native encryption in cloud
services to encrypt and protect data. If direct use of cryptography is required, use well-established
cryptographic algorithms and not attempt to invent their own.
Key management – Always authenticate with identity services rather than handling cryptographic key. For
situations where you need to keys, use a managed key store such as Azure Key Vault. This will make sure
keys are handled safely in application code. You can use CredScan to discover potentially exposed keys in
your application code.
Application Configurations – Inconsistent configurations for applications can create security risks. Azure
App Configuration provides a service to centrally manage application settings and feature flags, which
helps mitigate this risk.

Use native capabilities


Use native security capabilities built into cloud services instead of adding external security components, such as
data encryption, network traffic filtering, threat detection, and other functions.
Azure controls are maintained and supported by Microsoft. You don't have to invest in additional security tooling.
List of Azure Services
Native security capabilities of each service

Next steps
Applications and services
Application classification
Application threat analysis
Regulatory compliance
Application threat analysis
11/2/2020 • 3 minutes to read • Edit Online

Do a comprehensive analysis to identify threats, attacks, vulnerabilities, and counter measures. Having this
information can protect the application and threats it might pose to the system. Start with simple questions to gain
insight into potential risks. Then, progress to advanced techniques using threat modeling.

1- Gather information about the basic security controls


Start by gathering information about each component of the application. The answers to these questions will
identify gaps in basic protection and understand the attack vectors.

A SK T H IS Q UEST IO N . . . TO DET ERM IN E C O N T RO L S T H AT . . .

Are connections authenticated using Azure AD, TLS (with Prevent unauthorized access to the application component
mutual authentication), or another modern security protocol and data.
approved by the security team?
Between users and the application
Between different application components and
services

Are you limiting access to only those accounts that have the Prevent unauthorized data tampering or alteration.
need to write or modify data in the application

Is the application activity logged and fed into a Security Detect and investigate attacks quickly.
Information and Event Management (SIEM) through Azure
Monitor or a similar solution?

Is critical data protected with encryption that has been Prevent unauthorized copying of data at rest.
approved by the security team?

Is inbound and outbound network traffic encrypted using Prevent unauthorized copying of data in transit.
TLS?

Is the application protected against Distributed Denial of Detect attacks designed to overload the application so it can’t
Service (DDoS) attacks using services such as Azure DDoS be used.
protection?

Does the application store any logon credentials or keys to Identify whether an attack can use your application to attack
access other applications, databases, or services? other systems.

Do the application controls allow you to fulfill regulatory Protect user’s private data and avoid compliance fines.
requirements?

2- Evaluate the application design progressively


Analyze application components and connections and their relationships. Threat modeling is a crucial engineering
exercise that includes defining security requirements, identifying and mitigating threats, and validating those
mitigations. This technique can be used at any stage of application development or production, but it's most
effective during the design stages of a new functionality.
Popular methodologies include:
STRIDE:
Spoofing
Tampering
Repudiation
Information Disclosure
Denial of Service
Elevation of Privilege
Microsoft Security Development Lifecycle uses STRIDE and provides a tool to assist with this process. This tool is
available at no additional cost. For more information, see Microsoft Threat Modeling Tool.
Open Web Application Security Project (OWASP) has documented a threat modeling approach for applications.

Integrate threat modeling through automation using secure operations. Here are some resources:
Toolkit for Secure DevOps on Azure.
Guidance on DevOps pipeline security by OWASP.

3- Mitigate the identified threats


The threat modeling tool produces a report of all the threats identified. After a potential threat is identified,
determine and how it can be detected and the response to that attack.
Use the Defense-in-Depth approach. This can help identify controls needed in the design to mitigate risk if a
primary security control fails. Evaluate how likely it is for the primary control to fail. If it does, what is the extent of
the potential organizational risk? Also, what is the effectiveness of the additional control (especially in cases that
would cause the primary control to fail). Based on the evaluation apply Defense-in-Depth measures to address
potential failures of security controls.
The principle of least privilege is one way of implementing Defense-in-Depth. It limits the damage that can be
done by a single account. Grant least number of privileges to accounts that allows them to accomplish with the
required permissions within a time period. This helps mitigate the damage of an attacker who gains access to the
account to compromise security assurances.
How are threats addressed once found?

Here are some best practices:


Make sure the results are communicated to the interested teams.
Upload the threat modeling report to a tracking tool. Create work items that can be validated and addressed by
the developers. Cyber security teams can also use the report to determine attack vectors during a penetration
test.
As new features are added to the application, update the threat model report and integrate it into the code
management process. Triage security issues into the next release cycle or a faster release, depending on the
severity.
For information about mitigation strategies, see RapidAttack.
How long does it typically take to deploy a security fix into production?

If a security vulnerability is discovered, update the software with the fix as soon as possible. Have processes, tools,
and approvals in place to roll out the fix quickly.
Next steps
Applications and services
Application classification
Regulatory compliance
Regulatory compliance
11/13/2020 • 2 minutes to read • Edit Online

Cloud architectures can have challenges when strictly applying standards to workloads. This article provides
some security recommendations for adhering to standards.

Gather regulatory requirements


What do the law, authorities, and regulators require?

Governments and regulatory organizations frequently publish standards to help define good security practices so
that organizations can avoid negligence. The purpose and scope of these standards and regulations vary, but the
security requirements can influence the design for data protection and retention, data privacy, and system
security.
Noncompliance can lead to fines or other business impact. Work with your regulators and carefully review the
standards to understand both the intent and the literal wording. Start by answering these questions:
What are the regulatory requirements for the solution?
How is compliance measured?
Who approves that solution meets the requirements?
Are there processes for obtaining attestations?

Use the Microsoft Trust Center


Outdated security practices won’t sufficiently protect cloud workloads. Keep checking the Microsoft Trust Center
for the latest information, news, and best practices in security, privacy, and compliance.
Data governance . Focus on protecting information in cloud services, mobile devices, workstations, or
collaboration platforms. Build the security strategy by classifying and labeling information. Use strong
access control and encryption technology.
Compliance offerings . Microsoft offers a comprehensive set of compliance offerings to help your
organization comply with national, regional, and industry-specific requirements governing the collection
and use of data. For information, see Compliance offerings.
Compliance score . Use Microsoft Compliance Score to assess your data protection controls on an
ongoing basis. Act on the recommendations to make progress toward compliance.
Audit repor ts . Use audit reports to stay current on the latest privacy, security, and compliance-related
information for Microsoft’s cloud services. See Audit Reports.
Shared responsibility . The workload can be hosted on Software as a Service (SaaS), Platform as a
Service (PaaS), Infrastructure as a Service (IaaS), or in an on-premises datacenter. Have a clear
understanding about the portions of the architecture for which you are responsible versus Azure.
Whatever the hosting model, the following responsibilities are always retained by you:
Data
Endpoints
Account
Access management
For more information, see Shared responsibility in the cloud.

Next steps
Applications and services
Application classification
Application threat analysis
Secure application configuration and dependencies
11/2/2020 • 2 minutes to read • Edit Online

Security of an application that is hosted in Azure is a shared responsibility between you as the application owner
and Azure. For IaaS, you are responsible for configurations related to VM, operating system, and components
installed on it. For PaaS, you are responsible for the security of the application service configurations and making
sure that the dependencies used by the application are also secure.

Key points
Do not store secrets in source code or configuration files. Instead keep them in a secure store, such as Azure
Key Vault.
Do not expose detailed error information when handling application exceptions.
Do not expose platform-specific information.
Restrict access to Azure resources that do not meet the security requirements.
Validate the security of any open-source code added to your application.
Update frameworks and libraries as part of the application lifecycle.

Configuration security
During the design phase, consider the way you store secrets and handle exceptions. Here are some considerations.
Are application keys and secrets stored securely?

Configuration within in the application can include secrets like database connection strings, certificate keys, and so
on. Do not store secrets in source code or configuration files. Instead keep them in a secure store, such as Azure
Key Vault. If you need to store secrets in code, identify them with static code scanning tools. Add the scanning
process in your continuous integration (CI) pipeline.
Are errors and exceptions handled properly without exposing that information to users?

When handling application exceptions, make the application fail gracefully and log the error. Do not provide
detailed information related to the failure, such as call stack, SQL queries, or out of range errors. This information
can provide attackers with valuable information about the internals of the application.
Is platform-specific information removed from ser ver-client communication?

Do not reveal information about the application platform. Such information (for example, "X-Powered-By", "X-
ASPNET-VERSION") can get exposed through HTTP banners HTTP headers, error messages, website footers.
Malicious actors can use this information when mapping attack vectors of the application.
Consider using Azure CDN to separate the hosting platform from end users. Azure API Management offers
transformation policies that allow you to modify HTTP headers and remove sensitive information.
Are Azure policies used to control the configuration of the solution resources?

Use Azure Policy to deploy desired settings where applicable. Block resources that do not meet the proper security
requirements defined during service enablement.

Dependencies, frameworks, and libraries


What are the frameworks and libraries used by the application?

It's important to be aware of the implications of using third-party frameworks and libraries in your application
code. These components can result in vulnerabilities. Here are some best practices:
Validate the security of any open-source code added to your application. Tools that can help this
assessment:
OWASP Dependency-Check
NPM audit
Maintain a list of frameworks and libraries as part of the application inventory. Also, keep track of versions
in use.
Update frameworks and libraries as part of the application lifecycle. Prioritize critical security patches.

Referenced Azure services


Azure Key Vault
Azure CDN
Azure Policy

Next steps
Applications and services
Application classification
Application threat analysis
Regulatory compliance
Security operations
11/2/2020 • 4 minutes to read • Edit Online

Security operations maintain and restores the security assurances of the system as live adversaries attack it. The
tasks of security operations are described well by the NIST Cybersecurity Framework functions of Detect, Respond,
and Recover.
Detect - Security operations must detect the presence of adversaries in the system, who are incented to stay
hidden in most cases as this allows them to achieve their objectives unimpeded. This can take the form of
reacting to an alert of suspicious activity or proactively hunting for anomalous events in the enterprise
activity logs.
Respond – Upon detection of potential adversary action or campaign, security operations must rapidly
investigate to identify whether it is an actual attack (true positive) or a false alarm (false positive) and then
enumerate the scope and goal of the adversary operation.
Recover – The ultimate goal of security operations is to preserve or restore the security assurances
(confidentiality, integrity, availability) of business services during and after an attack.
The most significant security risk most organizations face is from human attack operators (of varying skill levels).
This is because risk from automated/repeated attacks have been mitigated significantly for most organizations by
signature and machine learning based approaches built into anti-malware (though there are notable exceptions
like Wannacrypt and NotPetya, which moved faster than these defenses).
While human attack operators are challenging to face because of their adaptability (vs. automated/repeated logic),
they are operating at the same “human speed” as defenders, which help level the playing field.
Security Operations (sometimes referred to as a Security Operations Center (SOC)) has a critical role to play in
limiting the time and access an attacker can get to valuable systems and data. Each minute that an attacker has in
the environment allows them to continue to conduct attack operations and access sensitive/valuable systems.

Objective and metrics


The metrics you measure will have a significant effect on the behaviors and outcomes of security operations.
Focusing on the right measurements will help drive continuous improvement in the right areas that meaningfully
reduce risk.
To ensure that security operations are effectively containing attackers access, the objectives should focus on
Reducing time to acknowledge an alert to ensure that detected adversaries are not ignored while
defenders are spending time investigating false positives.
Reducing time to remediate a discovered adversary to reduce their opportunity time to conduct and
attack and reach sensitive systems
Prioritizing security investments into systems that have high intrinsic value (likely targets / high business
impact) and access to many systems or sensitive systems (administrator accounts and sensitive users)
Increase focus on proactively hunting for adversaries as your program matures and reactive incidents get
under control. This is focused on reducing the time that a higher skilled adversary can operate in the
environment (for example, skilled enough to evade reactive alerts).
For more information on how Microsoft’s SOC uses these metrics, see https://aka.ms/ITSOC.
Hybrid enterprise view
Security operations should ensure their tooling, processes, and analyst skillsets enable visibility across the full span
of their hybrid estate.
Attackers don’t restrict their actions to a particular environment when targeting an organization, they will attack
resources on any platform using any method available. Enterprise organizations adopting cloud services like Azure
and AWS are effectively operating a hybrid of cloud and on-premises assets.
Security operations tooling and processes should be designed for attacks on cloud and on-premises assets as well
as attackers pivoting between cloud and on-premises resources using identity or other means. This enterprise-wide
view will enable security operations teams to rapidly detect, respond, and recover from attacks, reducing
organizational risk.

Leverage native detections and controls


You should favor the use of security detections and controls built into the cloud platform before creating custom
detections using event logs from the cloud.
Cloud platforms evolve rapidly with new features, which can make maintaining detections challenging. Native
controls are maintained by the cloud provider and are typically high quality (low false positive rate).
Because many organizations may use multiple cloud platforms and need a unified view across the enterprise, you
should ensure these native detections and controls feed a centralize SIEM or other tool. We don’t recommend
trying to substitute generalized log analysis tools and queries instead of native detections and controls. These tools
can offer numerous values for proactive hunting activities, but getting to a high-quality alert with these tools
requires application of deep expertise and time that could be better spent on hunting and other activities.
To complement the broad visibility of a centralized SIEM (like Azure Sentinel, Splunk, or QRadar), you should
leverage native detections and controls such as
Organizations using Azure should leverage capabilities like Azure Security Center for alert generation on the
Azure platform.
Organizations should leverage native logging capabilities like Azure Monitor and AWS CloudTrail for pulling
logs into a central view
Organizations using Azure should leverage Network Security Group (NSG) capabilities for visibility into
network activities on the Azure platform.
Investigation practices should leverage native tools with deep knowledge of the asset type such as an
Endpoint Detection and Response (EDR) solution, Identity tools, and Azure Sentinel.

Prioritize alert and log integration


Ensure that you are integrating critical security alerts and logs into SIEMs without introducing a high volume of low
value data.
Introducing too much low value data can increase SIEM cost, increase noise and false positives, and lower
performance.
The data you collect should be focused on supporting one or more of these operations activities:
Aler ts (detections from existing tools or data required for generating custom alerts)
Investigation of an incident (for example, required for common queries)
Proactive hunting activities
Integrating more data can allow you to enrich alerts with additional context that enable rapid response and
remediation (filter false positives, and elevate true positives, etc.), but collection is not detection. If you don’t have a
reasonable expectation that the data will provide value (for example, high volume of firewall denies events), you
may deprioritize integration of these events.
Governance, risk, and compliance
11/13/2020 • 22 minutes to read • Edit Online

Organizations of all sizes are constrained by their available resources; financial, people, and time. To achieve an
effective return on investment (ROI) organizations must prioritize where they will invest. Implementation of security
across the organization is also constrained by this, so to achieve an appropriate ROI on security the organization
needs to first understand and define its security priorities.
Governance: How is the organization’s security going to be monitored, audited, and reported? Design and
implementation of security controls within an organization is only the beginning of the story. How does the
organization know that things are actually working? Are they improving? Are there new requirements? Is there
mandatory reporting? Similar to compliance there may be external industry, government or regulatory standards
that need to be considered.
Risk : What types of risks does the organization face while trying to protect identifiable information, Intellectual
Property (IP), financial information? Who may be interested or could use this information if stolen, including
external and internal threats as well as unintentional or malicious? A commonly forgotten but extremely important
consideration within risk is addressing Disaster Recovery and Business Continuity.
Compliance: Is there a specific industry, government, or regulatory requirements that dictate or provide
recommendation on criteria that your organization’s security controls must meet? Examples of such standards,
organizations, controls, and legislation are ISO27001, NIST, PCI-DSS.
The collective role of organization(s) is to manage the security standards of the organization through their lifecycle:
Define - Set organizational standards and policies for practices, technologies, and configurations based on
internal factors (organizational culture, risk appetite, asset valuation, business initiatives, etc.) and external
factors (benchmarks, regulatory standards, threat environment, and more)
Improve – Continually push these standards incrementally forward towards the ideal state to ensure
continual risk reduction.
Sustain – Ensure the security posture doesn’t degrade naturally over time by instituting auditing and
monitoring compliance with organizational standards.

Prioritize security best practices investments


Security best practices are ideally applied proactively and completely to all systems as you build your cloud
program, but this isn’t reality for most enterprise organizations. Business goals, project constraints, and other
factors often cause organizations to balance security risk against other risks and apply a subset of best practices at
any given point.
We recommend applying as many as of the best practices as early as possible, and then working to retrofit any
gaps over time as you mature your security program. We recommend evaluating the following considerations
when prioritizing which to follow first:
High business impact and highly exposed systems: These include systems with direct intrinsic value as
well as the systems that provide attackers a path to them. For more information, see Identify and classify
business critical applications.
Easiest to implement mitigations: Identify quick wins by prioritizing the best practices, which your
organization can execute quickly because you already have the required skills, tools, and knowledge to do it
(for example, implementing a Web App Firewall (WAF) to protect a legacy application). Be careful not to
exclusively use (or overuse) this short-term prioritization method. Doing so can increase your risk by
preventing your program from growing and leaving critical risks exposed for extended periods.
Microsoft has provided some prioritized lists of security initiatives to help organizations start with these decisions
based on our experience with threats and mitigation initiatives in our own environments and across our customers.
See Module 4a of the Microsoft CISO Workshop.

Operationalize Azure Secure Score


Assign stakeholders to use Secure Score in Azure Security Center to monitor risk profile and continuously improve
security posture. This will help you continuously improve your security posture by rapidly identifying and
remediating common security hygiene risks, which can significantly reduce your organization's risk. Set up a
regular cadence (typically monthly) to review Azure secure score and plan initiatives with specific improvement
goals. You should also look into gamifying the activity if possible to increase engagement and focus from the
responsible teams (e.g. creating a friendly competition on who increased their score more).
NOTE: Security posture management is an emerging function, this may be a single dedicated team or the
responsibility may be taken on by one or more existing teams described below.
Monitor Secure Score
SC O RE A REA RESP O N SIB L E T EA M

Monitor Secure Score and Follow up with technical teams Security Posture Management Team
Vulnerability Management (or Governance/Risk/Compliance
team)
Architecture Team
Responsible Technical Team (listed below)

Improve Secure Score


C O M P UT E A N D A P P S RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M

App Services Application Development/Security Team(s)

Containers Application Development and/or Infrastructure/IT Operations

VMs/Scale sets/compute IT/Infrastructure Operations

DATA & STO RA GE RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M

SQL/Redis/Data Lake Analytics/Data Lake Store Database Team

Storage Accounts Storage/Infrastructure Team

IDEN T IT Y A N D A C C ESS RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M

Subscriptions Identity Team(s)

Key Vault Information/Data Security Team

N ET W O RK IN G RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M


N ET W O RK IN G RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M

Networking Resources Networking Team


Network Security Team

IOT SEC URIT Y RESP O N SIB L E T EC H N IC A L T EA M

IoT Resources IoT Operations Team

NOTE: Note that in the DevOps model, some application teams may be responsible for their own application
resources.

The Azure Secure Score sample shows how to get your Azure Secure Score for a subscription by calling the
Azure Security Center REST API. The API methods provide the flexibility to query the data and build your own
reporting mechanism of your secure scores over time.

Manage connected tenants


Ensure your security organization is aware of all enrollments and associated subscriptions connected to your
existing environment (via ExpressRoute or Site-Site VPN) and monitoring as part of the overall enterprise.
These Azure resources are part of your enterprise environment and security organizations require visibility into
them. Security organizations need this access to assess risk and to identify whether organizational policies and
applicable regulatory requirements are being followed.
Ensure all Azure environments that connect to your production environment/network apply your organization’s
policy and IT governance controls for security. You can discover existing connected tenants using a tool provided by
Microsoft. Guidance on permissions you may assign to security is in the Assign privileges for managing the
environment section.

Clear lines of responsibility


Designate the parties responsible for specific functions in Azure
Clearly documenting and sharing the contacts responsible for each of these functions will create consistency and
facilitate communication. Based on our experience with many cloud adoption projects, this will avoid confusion that
can lead to human and automation errors that create security risk.
Designate groups (or individual roles) that will be responsible for these key functions:

GRO UP O R IN DIVIDUA L RO L E RESP O N SIB IL IT Y

Network Security Typically existing network security team. Configuration and


maintenance of Azure Firewall, Network Virtual Appliances
(and associated routing), WAFs, NSGs, ASGs, etc.

Network Management Typically existing network operations team. Enterprise-wide


virtual network and subnet allocation.

Ser ver Endpoint Security Typically IT operations, security, or jointly. Monitor and
remediate server security (patching, configuration, endpoint
security, etc.).
GRO UP O R IN DIVIDUA L RO L E RESP O N SIB IL IT Y

Incident Monitoring and Response Typically security operations team. Investigate and remediate
security incidents in Security Information and Event
Management (SIEM) or source console.

Policy Management Typically GRC team + Architecture. Set Direction for use of
Role-Based Access Control (RBAC), Azure Security Center,
Administrator protection strategy, and Azure Policy to govern
Azure resources.

Identity Security and Standards Typically Security Team + Identity Team jointly. Set direction for
Azure AD directories, PIM/PAM usage, MFA,
password/synchronization configuration, Application Identity
Standards.

Enterprise segmentation strategy


Identify groups of resources that can be isolated from other parts of the enterprise to contain (and detect)
adversary movement within your enterprise. This unified enterprise segmentation strategy will guide all technical
teams to consistently segment access using networking, applications, identity, and any other access controls.
A clear and simple segmentation strategy helps contain risk while enabling productivity and business operations.
An enterprise segmentation strategy is defined higher than a traditional “network segmentation” security strategy.
Traditional segmentation approaches for on premises environments frequently failed to achieve their goals because
they were developed “bottom-up” by different technical teams and were not aligned well with business use cases
and application workloads. This resulted in overwhelming complexity that generates support issues and often
undermines the original purpose with broad network firewall exceptions.
Creating a unified enterprise segmentation strategy enables to guide all technical teams stakeholders (IT, Security,
Applications, etc.) Business Units that is built around the business risks and needs will increase alignment to and
understand and support sustainability of the security containment promises. This clarity and alignment will also
reduce s the risk of human errors and automation failures that can lead to security vulnerabilities, operational
downtime, or both.
While network micro-segmentation also offers promise to reduce risk (discussed more in Network Security and
Containment section), it doesn’t eliminate the need to align technical teams. Micro segmentation should be
considered after to and plans to ensure the ensuring technical teams are aligned so you can avoid a recurrence of
the internal conflicts that plagued and confusion of the on-premises network generation segmentation strategies.
Here are Microsoft's recommendations for prioritizing initiatives on containment and segmentation (based on Zero
Trust principles). These recommendations are listed in priority order by highest importance.
Ensure alignment of technical teams to a single enterprise segmentation strategy.
Invest in broadening containment by establishing a modern perimeter based on zero trust principles focused
on identity, device, applications, and other signals (to overcome limitation of network controls to protect new
resources and attack types).
Bolster network controls for legacy applications by exploring micro segmentation strategies.
A good enterprise segmentation strategy meets these criteria:
Enables Operations – Minimizes operation friction by aligning to business practices and applications
Contains Risk - Adds cost and friction to attackers by
Isolating sensitive workloads from compromise of other assets
Isolating high exposure systems from being used as a pivot to other systems
Monitored – Security Operations should monitor for potential violations of the integrity of the segments
(account usage, unexpected traffic, etc.)

Security team visibility


Provide security teams read-only access to the security aspects of all technical resources in their purview
Security organizations require visibility into the technical environment to perform their duties of assessing and
reporting on organizational risk. Without this visibility, security will have to rely on information provided from
groups, operating the environment, who have a potential conflict of interest (and other priorities).
Note that security teams may separately be granted additional privileges if they have operational responsibilities or
a requirement to enforce compliance on Azure resources.
For example in Azure, assign security teams to the Security Readers permission that provides access to measure
security risk (without providing access to the data itself)
For enterprise security groups with broad responsibility for security of Azure, you can assign this permission using:
Root management group – for teams responsible for assessing and reporting risk on all resources
Segment management group(s) – for teams with limited scope of responsibility (typically required because
of organizational boundaries or regulatory requirements)
Because security will have broad access to the environment (and visibility into potentially exploitable
vulnerabilities), you should consider them critical impact accounts and apply the same protections as
administrators. The Administration section details these controls for Azure.

Assign privileges for managing the environment


Grant roles with operational responsibilities in Azure the appropriate permissions based on a clearly documented
strategy built from the principle of least privilege and your operational needs.
Providing clear guidance that follows a reference model will reduce risk because by increasing it provides clarity for
your technical teams implementing these permissions. This clarity makes it easier to detect and correct human
errors like overpermissioning, reducing your overall risk.
Microsoft recommends starting from these Microsoft reference models and adapting to your organization.

Core Services Reference Permissions


This segment hosts shared services utilized across the organization. These shared services typically include Active
Directory Domain Services, DNS/DHCP, System Management Tools hosted on Azure Infrastructure as a Service
(IaaS) virtual machines.
Security Visibility across all resources – For security teams, grant read-only access to security attributes for all
technical environments. This access level is needed to assess risk factors, identify potential mitigations, and advise
organizational stakeholders who accept the risk. See Security Team Visibility for more details.
Policy management across some or all resources – To monitor and enforce compliance with external (or
internal) regulations, standards, and security policy, assign appropriate permission to those roles. The roles and
permissions you choose will depend on the organizational culture and expectations of the policy program. See
Microsoft Cloud Adoption Framework for Azure.
Central IT operations across all resources – Grant permissions to the central IT department (often the
infrastructure team) to create, modify, and delete resources like virtual machines and storage.
Central networking group across network resources – To ensure consistency and avoid technical conflicts,
assign network resource responsibilities to a single central networking organization. These resources should
include virtual networks, subnets, Network Security Groups (NSG), and the virtual machines hosting virtual
network appliances. See Centralize Network Management And Security for more details
Resource Role Permissions – For most core services, administrative privileges required to manage them are
granted via the application itself (Active Directory, DNS/DHCP, System Management Tools, etc.), so no additional
Azure resource permissions are required. If your organizational model requires these teams to manage their own
VMs, storage, or other Azure resources, you can assign these permissions to those roles.
Ser vice admin (Break Glass Account) – Use the service admin role only for emergencies (and initial setup if
required). Do not use this role for daily tasks. See Emergency Access (‘Break Glass’ Accounts) for more details.
Segment reference permissions
This segment permission design provides consistency while allowing flexibility to accommodate the range of
organizational models from a single centralized IT group to mostly independent IT and DevOps teams.
Security visibility across all resources – For security teams, grant read-only access to security attributes for all
technical environments. This access level is needed to assess risk factors, identify potential mitigations, and advise
organizational stakeholders who accept the risk. See Security Team Visibility.
Policy management across some or all resources – To monitor and enforce compliance with external (or
internal) regulations, standards, and security policy assign appropriate permission to those roles. The roles and
permissions you choose will depend on the organizational culture and expectations of the policy program. See
Microsoft Cloud Adoption Framework for Azure.
IT Operations across all resources – Grant permission to create, modify, and delete resources. The purpose of
the segment (and resulting permissions) will depend on your organization structure.
Segments with resources managed by a centralized IT organization can grant the central IT department
(often the infrastructure team) permission to modify these resources.
Segments managed by independent business units or functions (such as a Human Resources IT Team) can
grant those teams permission to all resources in the segment.
Segments with autonomous DevOps teams don’t need to grant permissions across all resources because the
resource role (below) grants permissions to application teams. For emergencies, use the service admin
account (break-glass account).
Central networking group across network resources – To ensure consistency and avoid technical conflicts,
assign network resource responsibilities to a single central networking organization. These resources should
include virtual networks, subnets, Network Security Groups (NSG), and the virtual machines hosting virtual
network appliances. See Centralize Network Management And Security.
Resource Role Permissions – Segments with autonomous DevOps teams will manage the resources associated
with each application. The actual roles and their permissions depend on the application size and complexity, the
application team size and complexity, and the culture of the organization and application team.
Ser vice Admin (Break Glass Account) – Use the service admin role only for emergencies (and initial setup if
required). Do not use this role for daily tasks. See Emergency Access (‘Break Glass’ Accounts) for more details.
Permission Guidance and Tips
To drive consistency and ensure application to future subscriptions, permissions should be assigned at
management group for the segment rather than the individual subscriptions. See Avoid Granular and
Custom Permissions for more details.
You should first review the built-in roles to see if one is applicable before creating a custom role to grant the
appropriate permissions to VMs and other objects. See Use Built in Roles for more details
Security managers group membership may be appropriate for smaller teams/organizations where
security teams have extensive operational responsibilities.

Establish segmentation with management groups


Structure management groups into a simple design that guides the enterprise segmentation model.
Management groups offer the ability to consistently and efficiently manage resources (including multiple
subscriptions as needed). However, because of their flexibility, it's possible to create an overly complex design.
Complexity creates confusion and negatively impacts both operations and security (as illustrated by overly complex
Organizational Unit (OU) and Group Policy Object (GPO) designs for Active Directory).
Microsoft recommends aligning the top level of management groups (MGs) into a simple enterprise segmentation
strategy limited to 1 or 2 levels.

Use root management group carefully


Use the Root Management Group (MG) for enterprise consistency, but test changes carefully to minimize risk of
operational disruption.
The root management group enables you to ensure consistency across the enterprise by applying policies,
permissions, and tags across all subscriptions. Care must be taken when planning and implementing assignments
to the root management group because this can affect every resource on Azure and potentially cause downtime or
other negative impacts on productivity in the event of errors or unanticipated effects.
Root management group guidance:
Plan Carefully - Select enterprise-wide elements to the root management group that have a clear
requirement to be applied across every resource and/or low impact.
Good candidates include:
Regulator y requirements with clear business risk/impact (for example, restrictions related to data
sovereignty).
Near-zero potential negative impact on operations such as policy with audit effect, Tag
assignment, RBAC permissions assignments that have been carefully reviewed.
Test First - Carefully test all enterprise-wide changes on the root management group before applying
(policy, tags, RBAC model, etc.) using a
Test Lab - Representative lab tenant or lab segment in production tenant.
Production Pilot - Segment MG or Designated subset in subscription(s) / MG.
Validate Changes – to ensure they have the desired effect.

Virtual Machine (VM) security updates and strong passwords


Ensure policy and processes enable (and require) rapid application of security updates to virtual machines.
Attackers constantly scan public cloud IP ranges for open management ports and attempt “easy” attacks like
common passwords and unpatched vulnerabilities.
Enable Azure Security Center to identify missing security updates & apply them.
Local Admin Password Solution (LAPS) or a third-party Privileged Access Management can set strong local admin
passwords and just in time access to them.

Remove Virtual Machine (VM) direct internet connectivity


Make sure policies and processes require restricting and monitoring direct internet connectivity by virtual
machines.
For Azure, you can enforce policies by,
Enterprise-wide prevention - Prevent inadvertent exposure by following the permissions and roles
described in the reference model.
Ensures that network traffic is routed through approved egress points by default.
Exceptions (such as adding a public IP address to a resource) must go through a centralized group
that evaluates exception requests and makes sure appropriate controls are applied.
Identify and remediate exposed virtual machines by using the Azure Security Center network visualization
to quickly identify internet exposed resources.
Restrict management por ts (RDP, SSH) using Just in Time access in Azure Security Center.
One way of managing VMs in the virtual network is by using Azure Bastion. This service allows you to log into VMs
in the virtual network through SSH or remote desktop protocol (RDP) without exposing the VMs directly to the
internet. To see a reference architecture that uses Bastion, see Network DMZ between Azure and an on-premises
datacenter.

Assign incident notification contact


Ensure a security contact receives Azure incident notifications from Microsoft typically a notification that your
resource is compromised and/or attacking another customer.
This enables your security operations team to rapidly respond to potential security risks and remediate them.
Ensure administrator contact information in the Azure enrollment portal includes contact information that will
notify security operations (directly or rapidly via an internal process)

Regularly review critical access


Regularly review roles that are assigned privileges with a business-critical impact.
Set up a recurring review pattern to ensure that accounts are removed from permissions as roles change. You can
conduct the review manually or through an automated process by using tools such as Azure AD access reviews.

Discover and remediate common risks


Identity well-known risks for your Azure tenants, remediate those risks, and track your progress using Secure Score.
Identifying and remediating common security hygiene risks significantly reduces overall risk to your organization
by increasing cost to attackers. When you remove cheap and well-established attack vectors, attackers are forced to
acquire and use advanced or untested attack methods.
Azure Secure Score in Azure Security Center monitors the security posture of machines, networks, storage and data
services, and applications to discover potential security issues (internet connected VMs, or missing security updates,
missing endpoint protection or encryption, deviations from baseline security configurations, missing Web
Application Firewall (WAF), and more). You should enable this capability (no additional cost), review the findings,
and follow the included recommendations to plan and execute technical remediations starting with the highest
priority items.
As you address risks, track progress and prioritize ongoing investments in your governance and risk reduction
programs.

Increase automation with Azure Blueprints


Use Azure’s native automation capabilities to increase consistency, compliance, and deployment speed for
workloads.
Automation of deployment and maintenance tasks reduces security and compliance risk by limiting opportunity to
introduce human errors during manual tasks. This will also allow both IT Operations teams and security teams to
shift their focus from repeated manual tasks to higher value tasks like enabling developers and business initiatives,
protecting information, and so on.
Utilize the Azure Blueprint service to rapidly and consistently deploy application environments that are compliant
with your organization’s policies and external regulations. Azure Blueprint Service automates deployment of
environments including RBAC roles, policies, resources (VM/Net/Storage/etc.), and more. Azure Blueprints builds on
Microsoft’s significant investment into the Azure Resource Manager to standardize resource deployment in Azure
and enable resource deployment and governance based on a desired-state approach. You can use built in
configurations in Azure Blueprint, make your own, or just use Resource Manager scripts for smaller scope.
Several Security and Compliance Blueprints samples are available to use as a starting template.

Evaluate security using benchmarks


Use an industry standard benchmark to evaluate your organizations current security posture.
Benchmarking allows you to improve your security program by learning from external organizations.
Benchmarking lets you know how your current security state compares to that of other organizations, providing
both external validation for successful elements of your current system as well as identifying gaps that serve as
opportunities to enrich your team’s overall security strategy. Even if your security program isn’t tied to a specific
benchmark or regulatory standard, you will benefit from understanding the documented ideal states by those
outside and inside of your industry.
As an example, the Center for Internet Security (CIS) has created security benchmarks for Azure that map to the
CIS Control Framework. Another reference example is the MITRE ATT&CK™ framework that defines the various
adversary tactics and techniques based on real-world observations. These external references control mappings
help you to understand any gaps between your current strategy what you have and what other experts in the
industry.

Audit and enforce policy compliance


Ensure that the security team is auditing the environment to report on compliance with the security policy of the
organization. Security teams may also enforce compliance with these policies.
Organizations of all sizes will have security compliance requirements. Industry, government, and internal corporate
security policies all need to be audited and enforced. Policy monitoring is critical to check that initial configurations
are correct and that it continues to be compliant over time.
In Azure, you can take advantage of Azure Policy to create and manage policies that enforce compliance. Like Azure
Blueprints, Azure Policies are built on the underlying Azure Resource Manager capabilities in the Azure platform
(and Azure Policy can also be assigned via Azure Blueprints).
For more information on how to do this in Azure, please review Tutorial: Create and manage policies to
enforce compliance .

Monitor identity Risk


Monitor identity-related risk events for warning on potentially compromised identities and remediate those risks.
Most security incidents take place after an attacker initially gains access using a stolen identity. These identities can
often start with low privileges, but the attackers then use that identity to traverse laterally and gain access to more
privileged identities. This repeats as needed until the attacker controls access to the ultimate target data or systems.
Azure Active Directory uses adaptive machine learning algorithms, heuristics, and known compromised credentials
(username/password pairs) to detect suspicious actions that are related to your user accounts. These
username/password pairs come from monitoring public and dark web sites (where attackers often dump
compromised passwords) and by working with security researchers, law enforcement, Security teams at Microsoft,
and others.
There are two places where you review reported risk events:
Azure AD repor ting - Risk events are part of Azure AD's security reports. For more information, see
the users at risk security report and the risky sign-ins security report.
Azure AD Identity Protection - Risk events are also part of the reporting capabilities of Azure Active
Directory Identity Protection.
In addition, you can use the Identity Protection risk events API to gain programmatic access to security detections
using Microsoft Graph.
Remediate these risks by manually addressing each reported account or by setting up a user risk policy to require a
password change for these high risk events.

Penetration testing
Use Penetration Testing to validate security defenses.
Real world validation of security defenses is critical to validate your defense strategy and implementation. This can
be accomplished by a penetration test (simulates a one time attack) or a red team program (simulates a persistent
threat actor targeting your environment).
Follow the guidance published by Microsoft for planning and executing simulated attacks.

Discover & replace insecure protocols


Discover and disable the use of legacy insecure protocols SMBv1, LM/NTLMv1, wDigest, Unsigned LDAP Binds, and
Weak ciphers in Kerberos.
Authentication protocols are a critical foundation of nearly all security assurances. These older versions can be
exploited by attackers with access to your network and are often used extensively on legacy systems on
Infrastructure as a Service (IaaS).
Here are ways to reduce your risk:
Discover protocol usage by reviewing logs with Azure Sentinel’s Insecure Protocol Dashboard or third-party
tools.
Restrict or Disable use of these protocols by following guidance for SMB, NTLM, WDigest
We recommend implementing changes using pilot or other testing method to mitigate risk of operational
interruption.

Elevated security capabilities


Consider whether to utilize specialized security capabilities in your enterprise architecture.
These measures have the potential to enhance security and meet regulatory requirements, but can introduce
complexity that may negatively impact your operations and efficiency.
We recommend careful consideration and judicious use of these security measures as required:
Dedicated Hardware Security Modules (HSMs)
Dedicated Hardware Security Modules (HSMs) may help meet regulatory or security requirements.
Confidential Computing
Confidential Computing may help meet regulatory or security requirements.
Monitor tools
11/2/2020 • 3 minutes to read • Edit Online

Security provides confidentiality, integrity, and availability assurances against deliberate attacks and abuse of your
valuable data and systems. Losing these assurances can negatively impact your business operations and revenue,
as well as your organization’s reputation in the marketplace.
Use this article to learn how to strengthen your security posture using the following monitor tools:
Azure Security Center - Prevent, detect, and respond to threats with increased visibility and control over the
security of your resources.
Azure Sentinel - Receive intelligent security analytics and threat intelligence across the enterprise.
Azure DDoS Protection - Defend against DDoS attacks.
Azure Rights Management (RMS) - Protect files and emails across multiple devices.
Azure Information Protection - Secure email, documents, and sensitive data that you share outside your
company.

Prevent, detect, and respond to threats


If your business has hybrid workloads, you may be experiencing weak security posture. For example, if you are
deploying new resources across workloads, they may not be configured according to security best practices. You
may also need to continuously monitor the security status of the network in order to block unwanted connections
that could potentially make it easier for an attacker to creep along your network.
Azure Security Center strengthens the security posture of your data centers, and provides advanced threat
protection across your hybrid workloads in the cloud (whether they're in Azure or not) as well as on-premises.
For information on the Azure Security Center tools, see Strengthen security posture.
For frequently asked questions on Azure Security Center, see FAQ - General Questions.

Detect threats early


Your business may be experiencing increasingly sophisticated attacks, increasing volumes of alerts, and long
resolution timeframes.
To combat these issues, Azure Sentinel uses intelligent security analytics and threat intelligence to provide a single
solution for alert detection, threat visibility, proactive hunting, and threat response.
Azure Sentinel's "birds-eye view" across the enterprise allows you to:
Collect data at cloud scale.
Detect previously undetected threats and minimize false positives.
Investigate threats with artificial intelligence and hunt for suspicious activities at scale.
Respond to incidents rapidly.
For information on the Azure Sentinel tools that will help to meet these requirements, see What is Azure Sentinel?

Protect resources against DDoS attacks


A Distributed Denial of Service (DDoS) attack attempts to exhaust an application's resources, making the
application unavailable to legitimate users. DDoS attacks can be targeted at any endpoint that is publicly reachable
through the internet.
Azure DDoS Protection, combined with application design best practices, provide defense against DDoS attacks.
The service tier that is used (Basic or Standard) determines the available features.
DDoS attack protection features include:
Always-on traffic monitoring which provides near real-time detection of a DDoS attack.
Automatic configuration and tuning of your DDoS Protection settings using intelligent traffic-profiling.
Detailed reports generated in five-minute increments during an attack, and a complete summary after the
attack ends (Standard service tier only).
For types of DDoS attacks that DDoS Protection Standard mitigates as well as more features, see Azure DDoS
Protection Standard overview.

Protect files and emails across multiple devices


Your business may encounter challenges with protecting documents and emails. For example, file protection,
collaboration, and sharing may be issues. You also might be experiencing problems regarding platform support or
infrastructure.
Azure Rights Management (RMS) is a cloud-based protection service that uses encryption, identity, and
authorization policies to help secure files and emails across multiple devices, including phones, tablets, and PCs.
To learn more about how RMS can address these issues, see Business problems solved by Azure Rights
Management.

Classify and protect documents and emails


Organizations that are weak on data classification and file protection might be more susceptible to data leakage or
data misuse.
Azure Information Protection (AIP) is a cloud-based solution that enables organizations to classify and protect
documents and emails by applying labels.
The data classification process categorizes data by sensitivity and business impact in order to identify risks. When
data is classified, you can manage it in ways that protect sensitive or important data from theft or loss.
With proper file protection, you can analyze data flows to gain insight into your business, detect risky behaviors
and take corrective measures, track access to documents, and more. The protection technology in AIP uses
encryption, identity, and authorization policies. Protection stays with the documents and emails, independently of
the location, regardless of whether they are inside or outside your organization, networks, file servers, and
applications.
Network monitoring
11/2/2020 • 2 minutes to read • Edit Online

Monitor all communications between segments to detect potential security threats flowing over the wire. One way
is by integrating logs and analyzing the data to identify anomalies. Based on those insights, you can choose to set
alerts or block traffic crossing segmentation boundaries.

Enable network visibility


How do you monitor and diagnose conditions of the network?

Enable logs (including raw traffic) from your network devices.


Integrate network logs into a security information and event management (SIEM) service, such as Azure Sentinel.
Other popular choices include Splunk, QRadar, or ArcSight ESM.
Use machine learning analytics platforms that support ingestion of large amounts of information and can analyze
large datasets quickly. Also, these solutions can be tuned to significantly reduce the false positive alerts.
Here are some ways to integrate network logs:
Security group logs – flow logs and diagnostic logs
Web application firewall logs
Virtual network taps
Azure Network Watcher

Proactive monitoring
How do you gain access to real-time performance information at the packet level?

Take advantage of packet capture to set alerts and gain access to real-time performance information at the packet
level.
Packet capture tracks traffic in and out of virtual machines. It gives you the capability to run proactive captures
based on defined network anomalies including information about network intrusions.
For an example, see Scenario: Get alerts when VM is sending you more TCP segments than usual.
Monitor identity risk
11/2/2020 • 2 minutes to read • Edit Online

Monitor identity-related risk events on potentially compromised identities and remediate those risks.
Suppose an attacker gains access using a stolen identity. Even though the identity has low privileges, the attacker
can use it to traverse laterally and gain access to more privileged identities. This way the attacker can control access
to the target data or systems.
How do you actively monitor for suspicious activities?

Azure Active Directory (Azure AD) uses adaptive machine learning algorithms, heuristics, and known compromised
credentials (username/password pairs) to detect suspicious actions that are related to your user accounts. These
username/password pairs come from monitoring public and dark web and by working with security researchers,
law enforcement, security teams at Microsoft, and others.
Review the reported risk events by using,
Azure AD reporting. For information, see users at risk security report and the risky sign-ins security report.
Azure Active Directory Identity Protection.
Also, you can use the Identity Protection risk events API to gain programmatic access to security detections by
using Microsoft Graph.
Remediate risks by manually addressing each reported account or by setting up a user risk policy to require a
password change for high risk events.
Governance, risk, and compliance
11/13/2020 • 22 minutes to read • Edit Online

Organizations of all sizes are constrained by their available resources; financial, people, and time. To achieve an
effective return on investment (ROI) organizations must prioritize where they will invest. Implementation of security
across the organization is also constrained by this, so to achieve an appropriate ROI on security the organization
needs to first understand and define its security priorities.
Governance: How is the organization’s security going to be monitored, audited, and reported? Design and
implementation of security controls within an organization is only the beginning of the story. How does the
organization know that things are actually working? Are they improving? Are there new requirements? Is there
mandatory reporting? Similar to compliance there may be external industry, government or regulatory standards
that need to be considered.
Risk : What types of risks does the organization face while trying to protect identifiable information, Intellectual
Property (IP), financial information? Who may be interested or could use this information if stolen, including
external and internal threats as well as unintentional or malicious? A commonly forgotten but extremely important
consideration within risk is addressing Disaster Recovery and Business Continuity.
Compliance: Is there a specific industry, government, or regulatory requirements that dictate or provide
recommendation on criteria that your organization’s security controls must meet? Examples of such standards,
organizations, controls, and legislation are ISO27001, NIST, PCI-DSS.
The collective role of organization(s) is to manage the security standards of the organization through their lifecycle:
Define - Set organizational standards and policies for practices, technologies, and configurations based on
internal factors (organizational culture, risk appetite, asset valuation, business initiatives, etc.) and external
factors (benchmarks, regulatory standards, threat environment, and more)
Improve – Continually push these standards incrementally forward towards the ideal state to ensure
continual risk reduction.
Sustain – Ensure the security posture doesn’t degrade naturally over time by instituting auditing and
monitoring compliance with organizational standards.

Prioritize security best practices investments


Security best practices are ideally applied proactively and completely to all systems as you build your cloud
program, but this isn’t reality for most enterprise organizations. Business goals, project constraints, and other
factors often cause organizations to balance security risk against other risks and apply a subset of best practices at
any given point.
We recommend applying as many as of the best practices as early as possible, and then working to retrofit any
gaps over time as you mature your security program. We recommend evaluating the following considerations
when prioritizing which to follow first:
High business impact and highly exposed systems: These include systems with direct intrinsic value as
well as the systems that provide attackers a path to them. For more information, see Identify and classify
business critical applications.
Easiest to implement mitigations: Identify quick wins by prioritizing the best practices, which your
organization can execute quickly because you already have the required skills, tools, and knowledge to do it
(for example, implementing a Web App Firewall (WAF) to protect a legacy application). Be careful not to
exclusively use (or overuse) this short-term prioritization method. Doing so can increase your risk by
preventing your program from growing and leaving critical risks exposed for extended periods.
Microsoft has provided some prioritized lists of security initiatives to help organizations start with these decisions
based on our experience with threats and mitigation initiatives in our own environments and across our customers.
See Module 4a of the Microsoft CISO Workshop.

Operationalize Azure Secure Score


Assign stakeholders to use Secure Score in Azure Security Center to monitor risk profile and continuously improve
security posture. This will help you continuously improve your security posture by rapidly identifying and
remediating common security hygiene risks, which can significantly reduce your organization's risk. Set up a
regular cadence (typically monthly) to review Azure secure score and plan initiatives with specific improvement
goals. You should also look into gamifying the activity if possible to increase engagement and focus from the
responsible teams (e.g. creating a friendly competition on who increased their score more).
NOTE: Security posture management is an emerging function, this may be a single dedicated team or the
responsibility may be taken on by one or more existing teams described below.
Monitor Secure Score
SC O RE A REA RESP O N SIB L E T EA M

Monitor Secure Score and Follow up with technical teams Security Posture Management Team
Vulnerability Management (or Governance/Risk/Compliance
team)
Architecture Team
Responsible Technical Team (listed below)

Improve Secure Score


C O M P UT E A N D A P P S RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M

App Services Application Development/Security Team(s)

Containers Application Development and/or Infrastructure/IT Operations

VMs/Scale sets/compute IT/Infrastructure Operations

DATA & STO RA GE RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M

SQL/Redis/Data Lake Analytics/Data Lake Store Database Team

Storage Accounts Storage/Infrastructure Team

IDEN T IT Y A N D A C C ESS RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M

Subscriptions Identity Team(s)

Key Vault Information/Data Security Team

N ET W O RK IN G RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M


N ET W O RK IN G RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M

Networking Resources Networking Team


Network Security Team

IOT SEC URIT Y RESP O N SIB L E T EC H N IC A L T EA M

IoT Resources IoT Operations Team

NOTE: Note that in the DevOps model, some application teams may be responsible for their own application
resources.

The Azure Secure Score sample shows how to get your Azure Secure Score for a subscription by calling the
Azure Security Center REST API. The API methods provide the flexibility to query the data and build your own
reporting mechanism of your secure scores over time.

Manage connected tenants


Ensure your security organization is aware of all enrollments and associated subscriptions connected to your
existing environment (via ExpressRoute or Site-Site VPN) and monitoring as part of the overall enterprise.
These Azure resources are part of your enterprise environment and security organizations require visibility into
them. Security organizations need this access to assess risk and to identify whether organizational policies and
applicable regulatory requirements are being followed.
Ensure all Azure environments that connect to your production environment/network apply your organization’s
policy and IT governance controls for security. You can discover existing connected tenants using a tool provided by
Microsoft. Guidance on permissions you may assign to security is in the Assign privileges for managing the
environment section.

Clear lines of responsibility


Designate the parties responsible for specific functions in Azure
Clearly documenting and sharing the contacts responsible for each of these functions will create consistency and
facilitate communication. Based on our experience with many cloud adoption projects, this will avoid confusion that
can lead to human and automation errors that create security risk.
Designate groups (or individual roles) that will be responsible for these key functions:

GRO UP O R IN DIVIDUA L RO L E RESP O N SIB IL IT Y

Network Security Typically existing network security team. Configuration and


maintenance of Azure Firewall, Network Virtual Appliances
(and associated routing), WAFs, NSGs, ASGs, etc.

Network Management Typically existing network operations team. Enterprise-wide


virtual network and subnet allocation.

Ser ver Endpoint Security Typically IT operations, security, or jointly. Monitor and
remediate server security (patching, configuration, endpoint
security, etc.).
GRO UP O R IN DIVIDUA L RO L E RESP O N SIB IL IT Y

Incident Monitoring and Response Typically security operations team. Investigate and remediate
security incidents in Security Information and Event
Management (SIEM) or source console.

Policy Management Typically GRC team + Architecture. Set Direction for use of
Role-Based Access Control (RBAC), Azure Security Center,
Administrator protection strategy, and Azure Policy to govern
Azure resources.

Identity Security and Standards Typically Security Team + Identity Team jointly. Set direction for
Azure AD directories, PIM/PAM usage, MFA,
password/synchronization configuration, Application Identity
Standards.

Enterprise segmentation strategy


Identify groups of resources that can be isolated from other parts of the enterprise to contain (and detect)
adversary movement within your enterprise. This unified enterprise segmentation strategy will guide all technical
teams to consistently segment access using networking, applications, identity, and any other access controls.
A clear and simple segmentation strategy helps contain risk while enabling productivity and business operations.
An enterprise segmentation strategy is defined higher than a traditional “network segmentation” security strategy.
Traditional segmentation approaches for on premises environments frequently failed to achieve their goals because
they were developed “bottom-up” by different technical teams and were not aligned well with business use cases
and application workloads. This resulted in overwhelming complexity that generates support issues and often
undermines the original purpose with broad network firewall exceptions.
Creating a unified enterprise segmentation strategy enables to guide all technical teams stakeholders (IT, Security,
Applications, etc.) Business Units that is built around the business risks and needs will increase alignment to and
understand and support sustainability of the security containment promises. This clarity and alignment will also
reduce s the risk of human errors and automation failures that can lead to security vulnerabilities, operational
downtime, or both.
While network micro-segmentation also offers promise to reduce risk (discussed more in Network Security and
Containment section), it doesn’t eliminate the need to align technical teams. Micro segmentation should be
considered after to and plans to ensure the ensuring technical teams are aligned so you can avoid a recurrence of
the internal conflicts that plagued and confusion of the on-premises network generation segmentation strategies.
Here are Microsoft's recommendations for prioritizing initiatives on containment and segmentation (based on Zero
Trust principles). These recommendations are listed in priority order by highest importance.
Ensure alignment of technical teams to a single enterprise segmentation strategy.
Invest in broadening containment by establishing a modern perimeter based on zero trust principles focused
on identity, device, applications, and other signals (to overcome limitation of network controls to protect new
resources and attack types).
Bolster network controls for legacy applications by exploring micro segmentation strategies.
A good enterprise segmentation strategy meets these criteria:
Enables Operations – Minimizes operation friction by aligning to business practices and applications
Contains Risk - Adds cost and friction to attackers by
Isolating sensitive workloads from compromise of other assets
Isolating high exposure systems from being used as a pivot to other systems
Monitored – Security Operations should monitor for potential violations of the integrity of the segments
(account usage, unexpected traffic, etc.)

Security team visibility


Provide security teams read-only access to the security aspects of all technical resources in their purview
Security organizations require visibility into the technical environment to perform their duties of assessing and
reporting on organizational risk. Without this visibility, security will have to rely on information provided from
groups, operating the environment, who have a potential conflict of interest (and other priorities).
Note that security teams may separately be granted additional privileges if they have operational responsibilities or
a requirement to enforce compliance on Azure resources.
For example in Azure, assign security teams to the Security Readers permission that provides access to measure
security risk (without providing access to the data itself)
For enterprise security groups with broad responsibility for security of Azure, you can assign this permission using:
Root management group – for teams responsible for assessing and reporting risk on all resources
Segment management group(s) – for teams with limited scope of responsibility (typically required because
of organizational boundaries or regulatory requirements)
Because security will have broad access to the environment (and visibility into potentially exploitable
vulnerabilities), you should consider them critical impact accounts and apply the same protections as
administrators. The Administration section details these controls for Azure.

Assign privileges for managing the environment


Grant roles with operational responsibilities in Azure the appropriate permissions based on a clearly documented
strategy built from the principle of least privilege and your operational needs.
Providing clear guidance that follows a reference model will reduce risk because by increasing it provides clarity for
your technical teams implementing these permissions. This clarity makes it easier to detect and correct human
errors like overpermissioning, reducing your overall risk.
Microsoft recommends starting from these Microsoft reference models and adapting to your organization.

Core Services Reference Permissions


This segment hosts shared services utilized across the organization. These shared services typically include Active
Directory Domain Services, DNS/DHCP, System Management Tools hosted on Azure Infrastructure as a Service
(IaaS) virtual machines.
Security Visibility across all resources – For security teams, grant read-only access to security attributes for all
technical environments. This access level is needed to assess risk factors, identify potential mitigations, and advise
organizational stakeholders who accept the risk. See Security Team Visibility for more details.
Policy management across some or all resources – To monitor and enforce compliance with external (or
internal) regulations, standards, and security policy, assign appropriate permission to those roles. The roles and
permissions you choose will depend on the organizational culture and expectations of the policy program. See
Microsoft Cloud Adoption Framework for Azure.
Central IT operations across all resources – Grant permissions to the central IT department (often the
infrastructure team) to create, modify, and delete resources like virtual machines and storage.
Central networking group across network resources – To ensure consistency and avoid technical conflicts,
assign network resource responsibilities to a single central networking organization. These resources should
include virtual networks, subnets, Network Security Groups (NSG), and the virtual machines hosting virtual
network appliances. See Centralize Network Management And Security for more details
Resource Role Permissions – For most core services, administrative privileges required to manage them are
granted via the application itself (Active Directory, DNS/DHCP, System Management Tools, etc.), so no additional
Azure resource permissions are required. If your organizational model requires these teams to manage their own
VMs, storage, or other Azure resources, you can assign these permissions to those roles.
Ser vice admin (Break Glass Account) – Use the service admin role only for emergencies (and initial setup if
required). Do not use this role for daily tasks. See Emergency Access (‘Break Glass’ Accounts) for more details.
Segment reference permissions
This segment permission design provides consistency while allowing flexibility to accommodate the range of
organizational models from a single centralized IT group to mostly independent IT and DevOps teams.
Security visibility across all resources – For security teams, grant read-only access to security attributes for all
technical environments. This access level is needed to assess risk factors, identify potential mitigations, and advise
organizational stakeholders who accept the risk. See Security Team Visibility.
Policy management across some or all resources – To monitor and enforce compliance with external (or
internal) regulations, standards, and security policy assign appropriate permission to those roles. The roles and
permissions you choose will depend on the organizational culture and expectations of the policy program. See
Microsoft Cloud Adoption Framework for Azure.
IT Operations across all resources – Grant permission to create, modify, and delete resources. The purpose of
the segment (and resulting permissions) will depend on your organization structure.
Segments with resources managed by a centralized IT organization can grant the central IT department
(often the infrastructure team) permission to modify these resources.
Segments managed by independent business units or functions (such as a Human Resources IT Team) can
grant those teams permission to all resources in the segment.
Segments with autonomous DevOps teams don’t need to grant permissions across all resources because the
resource role (below) grants permissions to application teams. For emergencies, use the service admin
account (break-glass account).
Central networking group across network resources – To ensure consistency and avoid technical conflicts,
assign network resource responsibilities to a single central networking organization. These resources should
include virtual networks, subnets, Network Security Groups (NSG), and the virtual machines hosting virtual
network appliances. See Centralize Network Management And Security.
Resource Role Permissions – Segments with autonomous DevOps teams will manage the resources associated
with each application. The actual roles and their permissions depend on the application size and complexity, the
application team size and complexity, and the culture of the organization and application team.
Ser vice Admin (Break Glass Account) – Use the service admin role only for emergencies (and initial setup if
required). Do not use this role for daily tasks. See Emergency Access (‘Break Glass’ Accounts) for more details.
Permission Guidance and Tips
To drive consistency and ensure application to future subscriptions, permissions should be assigned at
management group for the segment rather than the individual subscriptions. See Avoid Granular and
Custom Permissions for more details.
You should first review the built-in roles to see if one is applicable before creating a custom role to grant the
appropriate permissions to VMs and other objects. See Use Built in Roles for more details
Security managers group membership may be appropriate for smaller teams/organizations where
security teams have extensive operational responsibilities.

Establish segmentation with management groups


Structure management groups into a simple design that guides the enterprise segmentation model.
Management groups offer the ability to consistently and efficiently manage resources (including multiple
subscriptions as needed). However, because of their flexibility, it's possible to create an overly complex design.
Complexity creates confusion and negatively impacts both operations and security (as illustrated by overly complex
Organizational Unit (OU) and Group Policy Object (GPO) designs for Active Directory).
Microsoft recommends aligning the top level of management groups (MGs) into a simple enterprise segmentation
strategy limited to 1 or 2 levels.

Use root management group carefully


Use the Root Management Group (MG) for enterprise consistency, but test changes carefully to minimize risk of
operational disruption.
The root management group enables you to ensure consistency across the enterprise by applying policies,
permissions, and tags across all subscriptions. Care must be taken when planning and implementing assignments
to the root management group because this can affect every resource on Azure and potentially cause downtime or
other negative impacts on productivity in the event of errors or unanticipated effects.
Root management group guidance:
Plan Carefully - Select enterprise-wide elements to the root management group that have a clear
requirement to be applied across every resource and/or low impact.
Good candidates include:
Regulator y requirements with clear business risk/impact (for example, restrictions related to data
sovereignty).
Near-zero potential negative impact on operations such as policy with audit effect, Tag
assignment, RBAC permissions assignments that have been carefully reviewed.
Test First - Carefully test all enterprise-wide changes on the root management group before applying
(policy, tags, RBAC model, etc.) using a
Test Lab - Representative lab tenant or lab segment in production tenant.
Production Pilot - Segment MG or Designated subset in subscription(s) / MG.
Validate Changes – to ensure they have the desired effect.

Virtual Machine (VM) security updates and strong passwords


Ensure policy and processes enable (and require) rapid application of security updates to virtual machines.
Attackers constantly scan public cloud IP ranges for open management ports and attempt “easy” attacks like
common passwords and unpatched vulnerabilities.
Enable Azure Security Center to identify missing security updates & apply them.
Local Admin Password Solution (LAPS) or a third-party Privileged Access Management can set strong local admin
passwords and just in time access to them.

Remove Virtual Machine (VM) direct internet connectivity


Make sure policies and processes require restricting and monitoring direct internet connectivity by virtual
machines.
For Azure, you can enforce policies by,
Enterprise-wide prevention - Prevent inadvertent exposure by following the permissions and roles
described in the reference model.
Ensures that network traffic is routed through approved egress points by default.
Exceptions (such as adding a public IP address to a resource) must go through a centralized group
that evaluates exception requests and makes sure appropriate controls are applied.
Identify and remediate exposed virtual machines by using the Azure Security Center network visualization
to quickly identify internet exposed resources.
Restrict management por ts (RDP, SSH) using Just in Time access in Azure Security Center.
One way of managing VMs in the virtual network is by using Azure Bastion. This service allows you to log into VMs
in the virtual network through SSH or remote desktop protocol (RDP) without exposing the VMs directly to the
internet. To see a reference architecture that uses Bastion, see Network DMZ between Azure and an on-premises
datacenter.

Assign incident notification contact


Ensure a security contact receives Azure incident notifications from Microsoft typically a notification that your
resource is compromised and/or attacking another customer.
This enables your security operations team to rapidly respond to potential security risks and remediate them.
Ensure administrator contact information in the Azure enrollment portal includes contact information that will
notify security operations (directly or rapidly via an internal process)

Regularly review critical access


Regularly review roles that are assigned privileges with a business-critical impact.
Set up a recurring review pattern to ensure that accounts are removed from permissions as roles change. You can
conduct the review manually or through an automated process by using tools such as Azure AD access reviews.

Discover and remediate common risks


Identity well-known risks for your Azure tenants, remediate those risks, and track your progress using Secure Score.
Identifying and remediating common security hygiene risks significantly reduces overall risk to your organization
by increasing cost to attackers. When you remove cheap and well-established attack vectors, attackers are forced to
acquire and use advanced or untested attack methods.
Azure Secure Score in Azure Security Center monitors the security posture of machines, networks, storage and data
services, and applications to discover potential security issues (internet connected VMs, or missing security updates,
missing endpoint protection or encryption, deviations from baseline security configurations, missing Web
Application Firewall (WAF), and more). You should enable this capability (no additional cost), review the findings,
and follow the included recommendations to plan and execute technical remediations starting with the highest
priority items.
As you address risks, track progress and prioritize ongoing investments in your governance and risk reduction
programs.

Increase automation with Azure Blueprints


Use Azure’s native automation capabilities to increase consistency, compliance, and deployment speed for
workloads.
Automation of deployment and maintenance tasks reduces security and compliance risk by limiting opportunity to
introduce human errors during manual tasks. This will also allow both IT Operations teams and security teams to
shift their focus from repeated manual tasks to higher value tasks like enabling developers and business initiatives,
protecting information, and so on.
Utilize the Azure Blueprint service to rapidly and consistently deploy application environments that are compliant
with your organization’s policies and external regulations. Azure Blueprint Service automates deployment of
environments including RBAC roles, policies, resources (VM/Net/Storage/etc.), and more. Azure Blueprints builds on
Microsoft’s significant investment into the Azure Resource Manager to standardize resource deployment in Azure
and enable resource deployment and governance based on a desired-state approach. You can use built in
configurations in Azure Blueprint, make your own, or just use Resource Manager scripts for smaller scope.
Several Security and Compliance Blueprints samples are available to use as a starting template.

Evaluate security using benchmarks


Use an industry standard benchmark to evaluate your organizations current security posture.
Benchmarking allows you to improve your security program by learning from external organizations.
Benchmarking lets you know how your current security state compares to that of other organizations, providing
both external validation for successful elements of your current system as well as identifying gaps that serve as
opportunities to enrich your team’s overall security strategy. Even if your security program isn’t tied to a specific
benchmark or regulatory standard, you will benefit from understanding the documented ideal states by those
outside and inside of your industry.
As an example, the Center for Internet Security (CIS) has created security benchmarks for Azure that map to the
CIS Control Framework. Another reference example is the MITRE ATT&CK™ framework that defines the various
adversary tactics and techniques based on real-world observations. These external references control mappings
help you to understand any gaps between your current strategy what you have and what other experts in the
industry.

Audit and enforce policy compliance


Ensure that the security team is auditing the environment to report on compliance with the security policy of the
organization. Security teams may also enforce compliance with these policies.
Organizations of all sizes will have security compliance requirements. Industry, government, and internal corporate
security policies all need to be audited and enforced. Policy monitoring is critical to check that initial configurations
are correct and that it continues to be compliant over time.
In Azure, you can take advantage of Azure Policy to create and manage policies that enforce compliance. Like Azure
Blueprints, Azure Policies are built on the underlying Azure Resource Manager capabilities in the Azure platform
(and Azure Policy can also be assigned via Azure Blueprints).
For more information on how to do this in Azure, please review Tutorial: Create and manage policies to
enforce compliance .

Monitor identity Risk


Monitor identity-related risk events for warning on potentially compromised identities and remediate those risks.
Most security incidents take place after an attacker initially gains access using a stolen identity. These identities can
often start with low privileges, but the attackers then use that identity to traverse laterally and gain access to more
privileged identities. This repeats as needed until the attacker controls access to the ultimate target data or systems.
Azure Active Directory uses adaptive machine learning algorithms, heuristics, and known compromised credentials
(username/password pairs) to detect suspicious actions that are related to your user accounts. These
username/password pairs come from monitoring public and dark web sites (where attackers often dump
compromised passwords) and by working with security researchers, law enforcement, Security teams at Microsoft,
and others.
There are two places where you review reported risk events:
Azure AD repor ting - Risk events are part of Azure AD's security reports. For more information, see
the users at risk security report and the risky sign-ins security report.
Azure AD Identity Protection - Risk events are also part of the reporting capabilities of Azure Active
Directory Identity Protection.
In addition, you can use the Identity Protection risk events API to gain programmatic access to security detections
using Microsoft Graph.
Remediate these risks by manually addressing each reported account or by setting up a user risk policy to require a
password change for these high risk events.

Penetration testing
Use Penetration Testing to validate security defenses.
Real world validation of security defenses is critical to validate your defense strategy and implementation. This can
be accomplished by a penetration test (simulates a one time attack) or a red team program (simulates a persistent
threat actor targeting your environment).
Follow the guidance published by Microsoft for planning and executing simulated attacks.

Discover & replace insecure protocols


Discover and disable the use of legacy insecure protocols SMBv1, LM/NTLMv1, wDigest, Unsigned LDAP Binds, and
Weak ciphers in Kerberos.
Authentication protocols are a critical foundation of nearly all security assurances. These older versions can be
exploited by attackers with access to your network and are often used extensively on legacy systems on
Infrastructure as a Service (IaaS).
Here are ways to reduce your risk:
Discover protocol usage by reviewing logs with Azure Sentinel’s Insecure Protocol Dashboard or third-party
tools.
Restrict or Disable use of these protocols by following guidance for SMB, NTLM, WDigest
We recommend implementing changes using pilot or other testing method to mitigate risk of operational
interruption.

Elevated security capabilities


Consider whether to utilize specialized security capabilities in your enterprise architecture.
These measures have the potential to enhance security and meet regulatory requirements, but can introduce
complexity that may negatively impact your operations and efficiency.
We recommend careful consideration and judicious use of these security measures as required:
Dedicated Hardware Security Modules (HSMs)
Dedicated Hardware Security Modules (HSMs) may help meet regulatory or security requirements.
Confidential Computing
Confidential Computing may help meet regulatory or security requirements.
Governance, risk, and compliance
11/13/2020 • 22 minutes to read • Edit Online

Organizations of all sizes are constrained by their available resources; financial, people, and time. To achieve an
effective return on investment (ROI) organizations must prioritize where they will invest. Implementation of security
across the organization is also constrained by this, so to achieve an appropriate ROI on security the organization
needs to first understand and define its security priorities.
Governance: How is the organization’s security going to be monitored, audited, and reported? Design and
implementation of security controls within an organization is only the beginning of the story. How does the
organization know that things are actually working? Are they improving? Are there new requirements? Is there
mandatory reporting? Similar to compliance there may be external industry, government or regulatory standards
that need to be considered.
Risk : What types of risks does the organization face while trying to protect identifiable information, Intellectual
Property (IP), financial information? Who may be interested or could use this information if stolen, including
external and internal threats as well as unintentional or malicious? A commonly forgotten but extremely important
consideration within risk is addressing Disaster Recovery and Business Continuity.
Compliance: Is there a specific industry, government, or regulatory requirements that dictate or provide
recommendation on criteria that your organization’s security controls must meet? Examples of such standards,
organizations, controls, and legislation are ISO27001, NIST, PCI-DSS.
The collective role of organization(s) is to manage the security standards of the organization through their lifecycle:
Define - Set organizational standards and policies for practices, technologies, and configurations based on
internal factors (organizational culture, risk appetite, asset valuation, business initiatives, etc.) and external
factors (benchmarks, regulatory standards, threat environment, and more)
Improve – Continually push these standards incrementally forward towards the ideal state to ensure
continual risk reduction.
Sustain – Ensure the security posture doesn’t degrade naturally over time by instituting auditing and
monitoring compliance with organizational standards.

Prioritize security best practices investments


Security best practices are ideally applied proactively and completely to all systems as you build your cloud
program, but this isn’t reality for most enterprise organizations. Business goals, project constraints, and other
factors often cause organizations to balance security risk against other risks and apply a subset of best practices at
any given point.
We recommend applying as many as of the best practices as early as possible, and then working to retrofit any
gaps over time as you mature your security program. We recommend evaluating the following considerations
when prioritizing which to follow first:
High business impact and highly exposed systems: These include systems with direct intrinsic value as
well as the systems that provide attackers a path to them. For more information, see Identify and classify
business critical applications.
Easiest to implement mitigations: Identify quick wins by prioritizing the best practices, which your
organization can execute quickly because you already have the required skills, tools, and knowledge to do it
(for example, implementing a Web App Firewall (WAF) to protect a legacy application). Be careful not to
exclusively use (or overuse) this short-term prioritization method. Doing so can increase your risk by
preventing your program from growing and leaving critical risks exposed for extended periods.
Microsoft has provided some prioritized lists of security initiatives to help organizations start with these decisions
based on our experience with threats and mitigation initiatives in our own environments and across our customers.
See Module 4a of the Microsoft CISO Workshop.

Operationalize Azure Secure Score


Assign stakeholders to use Secure Score in Azure Security Center to monitor risk profile and continuously improve
security posture. This will help you continuously improve your security posture by rapidly identifying and
remediating common security hygiene risks, which can significantly reduce your organization's risk. Set up a
regular cadence (typically monthly) to review Azure secure score and plan initiatives with specific improvement
goals. You should also look into gamifying the activity if possible to increase engagement and focus from the
responsible teams (e.g. creating a friendly competition on who increased their score more).
NOTE: Security posture management is an emerging function, this may be a single dedicated team or the
responsibility may be taken on by one or more existing teams described below.
Monitor Secure Score
SC O RE A REA RESP O N SIB L E T EA M

Monitor Secure Score and Follow up with technical teams Security Posture Management Team
Vulnerability Management (or Governance/Risk/Compliance
team)
Architecture Team
Responsible Technical Team (listed below)

Improve Secure Score


C O M P UT E A N D A P P S RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M

App Services Application Development/Security Team(s)

Containers Application Development and/or Infrastructure/IT Operations

VMs/Scale sets/compute IT/Infrastructure Operations

DATA & STO RA GE RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M

SQL/Redis/Data Lake Analytics/Data Lake Store Database Team

Storage Accounts Storage/Infrastructure Team

IDEN T IT Y A N D A C C ESS RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M

Subscriptions Identity Team(s)

Key Vault Information/Data Security Team

N ET W O RK IN G RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M


N ET W O RK IN G RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M

Networking Resources Networking Team


Network Security Team

IOT SEC URIT Y RESP O N SIB L E T EC H N IC A L T EA M

IoT Resources IoT Operations Team

NOTE: Note that in the DevOps model, some application teams may be responsible for their own application
resources.

The Azure Secure Score sample shows how to get your Azure Secure Score for a subscription by calling the
Azure Security Center REST API. The API methods provide the flexibility to query the data and build your own
reporting mechanism of your secure scores over time.

Manage connected tenants


Ensure your security organization is aware of all enrollments and associated subscriptions connected to your
existing environment (via ExpressRoute or Site-Site VPN) and monitoring as part of the overall enterprise.
These Azure resources are part of your enterprise environment and security organizations require visibility into
them. Security organizations need this access to assess risk and to identify whether organizational policies and
applicable regulatory requirements are being followed.
Ensure all Azure environments that connect to your production environment/network apply your organization’s
policy and IT governance controls for security. You can discover existing connected tenants using a tool provided by
Microsoft. Guidance on permissions you may assign to security is in the Assign privileges for managing the
environment section.

Clear lines of responsibility


Designate the parties responsible for specific functions in Azure
Clearly documenting and sharing the contacts responsible for each of these functions will create consistency and
facilitate communication. Based on our experience with many cloud adoption projects, this will avoid confusion that
can lead to human and automation errors that create security risk.
Designate groups (or individual roles) that will be responsible for these key functions:

GRO UP O R IN DIVIDUA L RO L E RESP O N SIB IL IT Y

Network Security Typically existing network security team. Configuration and


maintenance of Azure Firewall, Network Virtual Appliances
(and associated routing), WAFs, NSGs, ASGs, etc.

Network Management Typically existing network operations team. Enterprise-wide


virtual network and subnet allocation.

Ser ver Endpoint Security Typically IT operations, security, or jointly. Monitor and
remediate server security (patching, configuration, endpoint
security, etc.).
GRO UP O R IN DIVIDUA L RO L E RESP O N SIB IL IT Y

Incident Monitoring and Response Typically security operations team. Investigate and remediate
security incidents in Security Information and Event
Management (SIEM) or source console.

Policy Management Typically GRC team + Architecture. Set Direction for use of
Role-Based Access Control (RBAC), Azure Security Center,
Administrator protection strategy, and Azure Policy to govern
Azure resources.

Identity Security and Standards Typically Security Team + Identity Team jointly. Set direction for
Azure AD directories, PIM/PAM usage, MFA,
password/synchronization configuration, Application Identity
Standards.

Enterprise segmentation strategy


Identify groups of resources that can be isolated from other parts of the enterprise to contain (and detect)
adversary movement within your enterprise. This unified enterprise segmentation strategy will guide all technical
teams to consistently segment access using networking, applications, identity, and any other access controls.
A clear and simple segmentation strategy helps contain risk while enabling productivity and business operations.
An enterprise segmentation strategy is defined higher than a traditional “network segmentation” security strategy.
Traditional segmentation approaches for on premises environments frequently failed to achieve their goals because
they were developed “bottom-up” by different technical teams and were not aligned well with business use cases
and application workloads. This resulted in overwhelming complexity that generates support issues and often
undermines the original purpose with broad network firewall exceptions.
Creating a unified enterprise segmentation strategy enables to guide all technical teams stakeholders (IT, Security,
Applications, etc.) Business Units that is built around the business risks and needs will increase alignment to and
understand and support sustainability of the security containment promises. This clarity and alignment will also
reduce s the risk of human errors and automation failures that can lead to security vulnerabilities, operational
downtime, or both.
While network micro-segmentation also offers promise to reduce risk (discussed more in Network Security and
Containment section), it doesn’t eliminate the need to align technical teams. Micro segmentation should be
considered after to and plans to ensure the ensuring technical teams are aligned so you can avoid a recurrence of
the internal conflicts that plagued and confusion of the on-premises network generation segmentation strategies.
Here are Microsoft's recommendations for prioritizing initiatives on containment and segmentation (based on Zero
Trust principles). These recommendations are listed in priority order by highest importance.
Ensure alignment of technical teams to a single enterprise segmentation strategy.
Invest in broadening containment by establishing a modern perimeter based on zero trust principles focused
on identity, device, applications, and other signals (to overcome limitation of network controls to protect new
resources and attack types).
Bolster network controls for legacy applications by exploring micro segmentation strategies.
A good enterprise segmentation strategy meets these criteria:
Enables Operations – Minimizes operation friction by aligning to business practices and applications
Contains Risk - Adds cost and friction to attackers by
Isolating sensitive workloads from compromise of other assets
Isolating high exposure systems from being used as a pivot to other systems
Monitored – Security Operations should monitor for potential violations of the integrity of the segments
(account usage, unexpected traffic, etc.)

Security team visibility


Provide security teams read-only access to the security aspects of all technical resources in their purview
Security organizations require visibility into the technical environment to perform their duties of assessing and
reporting on organizational risk. Without this visibility, security will have to rely on information provided from
groups, operating the environment, who have a potential conflict of interest (and other priorities).
Note that security teams may separately be granted additional privileges if they have operational responsibilities or
a requirement to enforce compliance on Azure resources.
For example in Azure, assign security teams to the Security Readers permission that provides access to measure
security risk (without providing access to the data itself)
For enterprise security groups with broad responsibility for security of Azure, you can assign this permission using:
Root management group – for teams responsible for assessing and reporting risk on all resources
Segment management group(s) – for teams with limited scope of responsibility (typically required because
of organizational boundaries or regulatory requirements)
Because security will have broad access to the environment (and visibility into potentially exploitable
vulnerabilities), you should consider them critical impact accounts and apply the same protections as
administrators. The Administration section details these controls for Azure.

Assign privileges for managing the environment


Grant roles with operational responsibilities in Azure the appropriate permissions based on a clearly documented
strategy built from the principle of least privilege and your operational needs.
Providing clear guidance that follows a reference model will reduce risk because by increasing it provides clarity for
your technical teams implementing these permissions. This clarity makes it easier to detect and correct human
errors like overpermissioning, reducing your overall risk.
Microsoft recommends starting from these Microsoft reference models and adapting to your organization.

Core Services Reference Permissions


This segment hosts shared services utilized across the organization. These shared services typically include Active
Directory Domain Services, DNS/DHCP, System Management Tools hosted on Azure Infrastructure as a Service
(IaaS) virtual machines.
Security Visibility across all resources – For security teams, grant read-only access to security attributes for all
technical environments. This access level is needed to assess risk factors, identify potential mitigations, and advise
organizational stakeholders who accept the risk. See Security Team Visibility for more details.
Policy management across some or all resources – To monitor and enforce compliance with external (or
internal) regulations, standards, and security policy, assign appropriate permission to those roles. The roles and
permissions you choose will depend on the organizational culture and expectations of the policy program. See
Microsoft Cloud Adoption Framework for Azure.
Central IT operations across all resources – Grant permissions to the central IT department (often the
infrastructure team) to create, modify, and delete resources like virtual machines and storage.
Central networking group across network resources – To ensure consistency and avoid technical conflicts,
assign network resource responsibilities to a single central networking organization. These resources should
include virtual networks, subnets, Network Security Groups (NSG), and the virtual machines hosting virtual
network appliances. See Centralize Network Management And Security for more details
Resource Role Permissions – For most core services, administrative privileges required to manage them are
granted via the application itself (Active Directory, DNS/DHCP, System Management Tools, etc.), so no additional
Azure resource permissions are required. If your organizational model requires these teams to manage their own
VMs, storage, or other Azure resources, you can assign these permissions to those roles.
Ser vice admin (Break Glass Account) – Use the service admin role only for emergencies (and initial setup if
required). Do not use this role for daily tasks. See Emergency Access (‘Break Glass’ Accounts) for more details.
Segment reference permissions
This segment permission design provides consistency while allowing flexibility to accommodate the range of
organizational models from a single centralized IT group to mostly independent IT and DevOps teams.
Security visibility across all resources – For security teams, grant read-only access to security attributes for all
technical environments. This access level is needed to assess risk factors, identify potential mitigations, and advise
organizational stakeholders who accept the risk. See Security Team Visibility.
Policy management across some or all resources – To monitor and enforce compliance with external (or
internal) regulations, standards, and security policy assign appropriate permission to those roles. The roles and
permissions you choose will depend on the organizational culture and expectations of the policy program. See
Microsoft Cloud Adoption Framework for Azure.
IT Operations across all resources – Grant permission to create, modify, and delete resources. The purpose of
the segment (and resulting permissions) will depend on your organization structure.
Segments with resources managed by a centralized IT organization can grant the central IT department
(often the infrastructure team) permission to modify these resources.
Segments managed by independent business units or functions (such as a Human Resources IT Team) can
grant those teams permission to all resources in the segment.
Segments with autonomous DevOps teams don’t need to grant permissions across all resources because the
resource role (below) grants permissions to application teams. For emergencies, use the service admin
account (break-glass account).
Central networking group across network resources – To ensure consistency and avoid technical conflicts,
assign network resource responsibilities to a single central networking organization. These resources should
include virtual networks, subnets, Network Security Groups (NSG), and the virtual machines hosting virtual
network appliances. See Centralize Network Management And Security.
Resource Role Permissions – Segments with autonomous DevOps teams will manage the resources associated
with each application. The actual roles and their permissions depend on the application size and complexity, the
application team size and complexity, and the culture of the organization and application team.
Ser vice Admin (Break Glass Account) – Use the service admin role only for emergencies (and initial setup if
required). Do not use this role for daily tasks. See Emergency Access (‘Break Glass’ Accounts) for more details.
Permission Guidance and Tips
To drive consistency and ensure application to future subscriptions, permissions should be assigned at
management group for the segment rather than the individual subscriptions. See Avoid Granular and
Custom Permissions for more details.
You should first review the built-in roles to see if one is applicable before creating a custom role to grant the
appropriate permissions to VMs and other objects. See Use Built in Roles for more details
Security managers group membership may be appropriate for smaller teams/organizations where
security teams have extensive operational responsibilities.

Establish segmentation with management groups


Structure management groups into a simple design that guides the enterprise segmentation model.
Management groups offer the ability to consistently and efficiently manage resources (including multiple
subscriptions as needed). However, because of their flexibility, it's possible to create an overly complex design.
Complexity creates confusion and negatively impacts both operations and security (as illustrated by overly complex
Organizational Unit (OU) and Group Policy Object (GPO) designs for Active Directory).
Microsoft recommends aligning the top level of management groups (MGs) into a simple enterprise segmentation
strategy limited to 1 or 2 levels.

Use root management group carefully


Use the Root Management Group (MG) for enterprise consistency, but test changes carefully to minimize risk of
operational disruption.
The root management group enables you to ensure consistency across the enterprise by applying policies,
permissions, and tags across all subscriptions. Care must be taken when planning and implementing assignments
to the root management group because this can affect every resource on Azure and potentially cause downtime or
other negative impacts on productivity in the event of errors or unanticipated effects.
Root management group guidance:
Plan Carefully - Select enterprise-wide elements to the root management group that have a clear
requirement to be applied across every resource and/or low impact.
Good candidates include:
Regulator y requirements with clear business risk/impact (for example, restrictions related to data
sovereignty).
Near-zero potential negative impact on operations such as policy with audit effect, Tag
assignment, RBAC permissions assignments that have been carefully reviewed.
Test First - Carefully test all enterprise-wide changes on the root management group before applying
(policy, tags, RBAC model, etc.) using a
Test Lab - Representative lab tenant or lab segment in production tenant.
Production Pilot - Segment MG or Designated subset in subscription(s) / MG.
Validate Changes – to ensure they have the desired effect.

Virtual Machine (VM) security updates and strong passwords


Ensure policy and processes enable (and require) rapid application of security updates to virtual machines.
Attackers constantly scan public cloud IP ranges for open management ports and attempt “easy” attacks like
common passwords and unpatched vulnerabilities.
Enable Azure Security Center to identify missing security updates & apply them.
Local Admin Password Solution (LAPS) or a third-party Privileged Access Management can set strong local admin
passwords and just in time access to them.

Remove Virtual Machine (VM) direct internet connectivity


Make sure policies and processes require restricting and monitoring direct internet connectivity by virtual
machines.
For Azure, you can enforce policies by,
Enterprise-wide prevention - Prevent inadvertent exposure by following the permissions and roles
described in the reference model.
Ensures that network traffic is routed through approved egress points by default.
Exceptions (such as adding a public IP address to a resource) must go through a centralized group
that evaluates exception requests and makes sure appropriate controls are applied.
Identify and remediate exposed virtual machines by using the Azure Security Center network visualization
to quickly identify internet exposed resources.
Restrict management por ts (RDP, SSH) using Just in Time access in Azure Security Center.
One way of managing VMs in the virtual network is by using Azure Bastion. This service allows you to log into VMs
in the virtual network through SSH or remote desktop protocol (RDP) without exposing the VMs directly to the
internet. To see a reference architecture that uses Bastion, see Network DMZ between Azure and an on-premises
datacenter.

Assign incident notification contact


Ensure a security contact receives Azure incident notifications from Microsoft typically a notification that your
resource is compromised and/or attacking another customer.
This enables your security operations team to rapidly respond to potential security risks and remediate them.
Ensure administrator contact information in the Azure enrollment portal includes contact information that will
notify security operations (directly or rapidly via an internal process)

Regularly review critical access


Regularly review roles that are assigned privileges with a business-critical impact.
Set up a recurring review pattern to ensure that accounts are removed from permissions as roles change. You can
conduct the review manually or through an automated process by using tools such as Azure AD access reviews.

Discover and remediate common risks


Identity well-known risks for your Azure tenants, remediate those risks, and track your progress using Secure Score.
Identifying and remediating common security hygiene risks significantly reduces overall risk to your organization
by increasing cost to attackers. When you remove cheap and well-established attack vectors, attackers are forced to
acquire and use advanced or untested attack methods.
Azure Secure Score in Azure Security Center monitors the security posture of machines, networks, storage and data
services, and applications to discover potential security issues (internet connected VMs, or missing security updates,
missing endpoint protection or encryption, deviations from baseline security configurations, missing Web
Application Firewall (WAF), and more). You should enable this capability (no additional cost), review the findings,
and follow the included recommendations to plan and execute technical remediations starting with the highest
priority items.
As you address risks, track progress and prioritize ongoing investments in your governance and risk reduction
programs.

Increase automation with Azure Blueprints


Use Azure’s native automation capabilities to increase consistency, compliance, and deployment speed for
workloads.
Automation of deployment and maintenance tasks reduces security and compliance risk by limiting opportunity to
introduce human errors during manual tasks. This will also allow both IT Operations teams and security teams to
shift their focus from repeated manual tasks to higher value tasks like enabling developers and business initiatives,
protecting information, and so on.
Utilize the Azure Blueprint service to rapidly and consistently deploy application environments that are compliant
with your organization’s policies and external regulations. Azure Blueprint Service automates deployment of
environments including RBAC roles, policies, resources (VM/Net/Storage/etc.), and more. Azure Blueprints builds on
Microsoft’s significant investment into the Azure Resource Manager to standardize resource deployment in Azure
and enable resource deployment and governance based on a desired-state approach. You can use built in
configurations in Azure Blueprint, make your own, or just use Resource Manager scripts for smaller scope.
Several Security and Compliance Blueprints samples are available to use as a starting template.

Evaluate security using benchmarks


Use an industry standard benchmark to evaluate your organizations current security posture.
Benchmarking allows you to improve your security program by learning from external organizations.
Benchmarking lets you know how your current security state compares to that of other organizations, providing
both external validation for successful elements of your current system as well as identifying gaps that serve as
opportunities to enrich your team’s overall security strategy. Even if your security program isn’t tied to a specific
benchmark or regulatory standard, you will benefit from understanding the documented ideal states by those
outside and inside of your industry.
As an example, the Center for Internet Security (CIS) has created security benchmarks for Azure that map to the
CIS Control Framework. Another reference example is the MITRE ATT&CK™ framework that defines the various
adversary tactics and techniques based on real-world observations. These external references control mappings
help you to understand any gaps between your current strategy what you have and what other experts in the
industry.

Audit and enforce policy compliance


Ensure that the security team is auditing the environment to report on compliance with the security policy of the
organization. Security teams may also enforce compliance with these policies.
Organizations of all sizes will have security compliance requirements. Industry, government, and internal corporate
security policies all need to be audited and enforced. Policy monitoring is critical to check that initial configurations
are correct and that it continues to be compliant over time.
In Azure, you can take advantage of Azure Policy to create and manage policies that enforce compliance. Like Azure
Blueprints, Azure Policies are built on the underlying Azure Resource Manager capabilities in the Azure platform
(and Azure Policy can also be assigned via Azure Blueprints).
For more information on how to do this in Azure, please review Tutorial: Create and manage policies to
enforce compliance .

Monitor identity Risk


Monitor identity-related risk events for warning on potentially compromised identities and remediate those risks.
Most security incidents take place after an attacker initially gains access using a stolen identity. These identities can
often start with low privileges, but the attackers then use that identity to traverse laterally and gain access to more
privileged identities. This repeats as needed until the attacker controls access to the ultimate target data or systems.
Azure Active Directory uses adaptive machine learning algorithms, heuristics, and known compromised credentials
(username/password pairs) to detect suspicious actions that are related to your user accounts. These
username/password pairs come from monitoring public and dark web sites (where attackers often dump
compromised passwords) and by working with security researchers, law enforcement, Security teams at Microsoft,
and others.
There are two places where you review reported risk events:
Azure AD repor ting - Risk events are part of Azure AD's security reports. For more information, see
the users at risk security report and the risky sign-ins security report.
Azure AD Identity Protection - Risk events are also part of the reporting capabilities of Azure Active
Directory Identity Protection.
In addition, you can use the Identity Protection risk events API to gain programmatic access to security detections
using Microsoft Graph.
Remediate these risks by manually addressing each reported account or by setting up a user risk policy to require a
password change for these high risk events.

Penetration testing
Use Penetration Testing to validate security defenses.
Real world validation of security defenses is critical to validate your defense strategy and implementation. This can
be accomplished by a penetration test (simulates a one time attack) or a red team program (simulates a persistent
threat actor targeting your environment).
Follow the guidance published by Microsoft for planning and executing simulated attacks.

Discover & replace insecure protocols


Discover and disable the use of legacy insecure protocols SMBv1, LM/NTLMv1, wDigest, Unsigned LDAP Binds, and
Weak ciphers in Kerberos.
Authentication protocols are a critical foundation of nearly all security assurances. These older versions can be
exploited by attackers with access to your network and are often used extensively on legacy systems on
Infrastructure as a Service (IaaS).
Here are ways to reduce your risk:
Discover protocol usage by reviewing logs with Azure Sentinel’s Insecure Protocol Dashboard or third-party
tools.
Restrict or Disable use of these protocols by following guidance for SMB, NTLM, WDigest
We recommend implementing changes using pilot or other testing method to mitigate risk of operational
interruption.

Elevated security capabilities


Consider whether to utilize specialized security capabilities in your enterprise architecture.
These measures have the potential to enhance security and meet regulatory requirements, but can introduce
complexity that may negatively impact your operations and efficiency.
We recommend careful consideration and judicious use of these security measures as required:
Dedicated Hardware Security Modules (HSMs)
Dedicated Hardware Security Modules (HSMs) may help meet regulatory or security requirements.
Confidential Computing
Confidential Computing may help meet regulatory or security requirements.
Governance, risk, and compliance
11/13/2020 • 22 minutes to read • Edit Online

Organizations of all sizes are constrained by their available resources; financial, people, and time. To achieve an
effective return on investment (ROI) organizations must prioritize where they will invest. Implementation of
security across the organization is also constrained by this, so to achieve an appropriate ROI on security the
organization needs to first understand and define its security priorities.
Governance: How is the organization’s security going to be monitored, audited, and reported? Design and
implementation of security controls within an organization is only the beginning of the story. How does the
organization know that things are actually working? Are they improving? Are there new requirements? Is there
mandatory reporting? Similar to compliance there may be external industry, government or regulatory standards
that need to be considered.
Risk : What types of risks does the organization face while trying to protect identifiable information, Intellectual
Property (IP), financial information? Who may be interested or could use this information if stolen, including
external and internal threats as well as unintentional or malicious? A commonly forgotten but extremely important
consideration within risk is addressing Disaster Recovery and Business Continuity.
Compliance: Is there a specific industry, government, or regulatory requirements that dictate or provide
recommendation on criteria that your organization’s security controls must meet? Examples of such standards,
organizations, controls, and legislation are ISO27001, NIST, PCI-DSS.
The collective role of organization(s) is to manage the security standards of the organization through their
lifecycle:
Define - Set organizational standards and policies for practices, technologies, and configurations based on
internal factors (organizational culture, risk appetite, asset valuation, business initiatives, etc.) and external
factors (benchmarks, regulatory standards, threat environment, and more)
Improve – Continually push these standards incrementally forward towards the ideal state to ensure
continual risk reduction.
Sustain – Ensure the security posture doesn’t degrade naturally over time by instituting auditing and
monitoring compliance with organizational standards.

Prioritize security best practices investments


Security best practices are ideally applied proactively and completely to all systems as you build your cloud
program, but this isn’t reality for most enterprise organizations. Business goals, project constraints, and other
factors often cause organizations to balance security risk against other risks and apply a subset of best practices at
any given point.
We recommend applying as many as of the best practices as early as possible, and then working to retrofit any
gaps over time as you mature your security program. We recommend evaluating the following considerations
when prioritizing which to follow first:
High business impact and highly exposed systems: These include systems with direct intrinsic value
as well as the systems that provide attackers a path to them. For more information, see Identify and classify
business critical applications.
Easiest to implement mitigations: Identify quick wins by prioritizing the best practices, which your
organization can execute quickly because you already have the required skills, tools, and knowledge to do it
(for example, implementing a Web App Firewall (WAF) to protect a legacy application). Be careful not to
exclusively use (or overuse) this short-term prioritization method. Doing so can increase your risk by
preventing your program from growing and leaving critical risks exposed for extended periods.
Microsoft has provided some prioritized lists of security initiatives to help organizations start with these decisions
based on our experience with threats and mitigation initiatives in our own environments and across our
customers. See Module 4a of the Microsoft CISO Workshop.

Operationalize Azure Secure Score


Assign stakeholders to use Secure Score in Azure Security Center to monitor risk profile and continuously improve
security posture. This will help you continuously improve your security posture by rapidly identifying and
remediating common security hygiene risks, which can significantly reduce your organization's risk. Set up a
regular cadence (typically monthly) to review Azure secure score and plan initiatives with specific improvement
goals. You should also look into gamifying the activity if possible to increase engagement and focus from the
responsible teams (e.g. creating a friendly competition on who increased their score more).
NOTE: Security posture management is an emerging function, this may be a single dedicated team or the
responsibility may be taken on by one or more existing teams described below.
Monitor Secure Score
SC O RE A REA RESP O N SIB L E T EA M

Monitor Secure Score and Follow up with technical teams Security Posture Management Team
Vulnerability Management (or Governance/Risk/Compliance
team)
Architecture Team
Responsible Technical Team (listed below)

Improve Secure Score


C O M P UT E A N D A P P S RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M

App Services Application Development/Security Team(s)

Containers Application Development and/or Infrastructure/IT Operations

VMs/Scale sets/compute IT/Infrastructure Operations

DATA & STO RA GE RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M

SQL/Redis/Data Lake Analytics/Data Lake Store Database Team

Storage Accounts Storage/Infrastructure Team

IDEN T IT Y A N D A C C ESS RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M

Subscriptions Identity Team(s)

Key Vault Information/Data Security Team


N ET W O RK IN G RESO URC ES RESP O N SIB L E T EC H N IC A L T EA M

Networking Resources Networking Team


Network Security Team

IOT SEC URIT Y RESP O N SIB L E T EC H N IC A L T EA M

IoT Resources IoT Operations Team

NOTE: Note that in the DevOps model, some application teams may be responsible for their own application
resources.

The Azure Secure Score sample shows how to get your Azure Secure Score for a subscription by calling the
Azure Security Center REST API. The API methods provide the flexibility to query the data and build your own
reporting mechanism of your secure scores over time.

Manage connected tenants


Ensure your security organization is aware of all enrollments and associated subscriptions connected to your
existing environment (via ExpressRoute or Site-Site VPN) and monitoring as part of the overall enterprise.
These Azure resources are part of your enterprise environment and security organizations require visibility into
them. Security organizations need this access to assess risk and to identify whether organizational policies and
applicable regulatory requirements are being followed.
Ensure all Azure environments that connect to your production environment/network apply your organization’s
policy and IT governance controls for security. You can discover existing connected tenants using a tool provided
by Microsoft. Guidance on permissions you may assign to security is in the Assign privileges for managing the
environment section.

Clear lines of responsibility


Designate the parties responsible for specific functions in Azure
Clearly documenting and sharing the contacts responsible for each of these functions will create consistency and
facilitate communication. Based on our experience with many cloud adoption projects, this will avoid confusion
that can lead to human and automation errors that create security risk.
Designate groups (or individual roles) that will be responsible for these key functions:

GRO UP O R IN DIVIDUA L RO L E RESP O N SIB IL IT Y

Network Security Typically existing network security team. Configuration and


maintenance of Azure Firewall, Network Virtual Appliances
(and associated routing), WAFs, NSGs, ASGs, etc.

Network Management Typically existing network operations team. Enterprise-wide


virtual network and subnet allocation.

Ser ver Endpoint Security Typically IT operations, security, or jointly. Monitor and
remediate server security (patching, configuration, endpoint
security, etc.).
GRO UP O R IN DIVIDUA L RO L E RESP O N SIB IL IT Y

Incident Monitoring and Response Typically security operations team. Investigate and remediate
security incidents in Security Information and Event
Management (SIEM) or source console.

Policy Management Typically GRC team + Architecture. Set Direction for use of
Role-Based Access Control (RBAC), Azure Security Center,
Administrator protection strategy, and Azure Policy to govern
Azure resources.

Identity Security and Standards Typically Security Team + Identity Team jointly. Set direction
for Azure AD directories, PIM/PAM usage, MFA,
password/synchronization configuration, Application Identity
Standards.

Enterprise segmentation strategy


Identify groups of resources that can be isolated from other parts of the enterprise to contain (and detect)
adversary movement within your enterprise. This unified enterprise segmentation strategy will guide all technical
teams to consistently segment access using networking, applications, identity, and any other access controls.
A clear and simple segmentation strategy helps contain risk while enabling productivity and business operations.
An enterprise segmentation strategy is defined higher than a traditional “network segmentation” security strategy.
Traditional segmentation approaches for on premises environments frequently failed to achieve their goals
because they were developed “bottom-up” by different technical teams and were not aligned well with business
use cases and application workloads. This resulted in overwhelming complexity that generates support issues and
often undermines the original purpose with broad network firewall exceptions.
Creating a unified enterprise segmentation strategy enables to guide all technical teams stakeholders (IT, Security,
Applications, etc.) Business Units that is built around the business risks and needs will increase alignment to and
understand and support sustainability of the security containment promises. This clarity and alignment will also
reduce s the risk of human errors and automation failures that can lead to security vulnerabilities, operational
downtime, or both.
While network micro-segmentation also offers promise to reduce risk (discussed more in Network Security and
Containment section), it doesn’t eliminate the need to align technical teams. Micro segmentation should be
considered after to and plans to ensure the ensuring technical teams are aligned so you can avoid a recurrence of
the internal conflicts that plagued and confusion of the on-premises network generation segmentation strategies.
Here are Microsoft's recommendations for prioritizing initiatives on containment and segmentation (based on
Zero Trust principles). These recommendations are listed in priority order by highest importance.
Ensure alignment of technical teams to a single enterprise segmentation strategy.
Invest in broadening containment by establishing a modern perimeter based on zero trust principles
focused on identity, device, applications, and other signals (to overcome limitation of network controls to
protect new resources and attack types).
Bolster network controls for legacy applications by exploring micro segmentation strategies.
A good enterprise segmentation strategy meets these criteria:
Enables Operations – Minimizes operation friction by aligning to business practices and applications
Contains Risk - Adds cost and friction to attackers by
Isolating sensitive workloads from compromise of other assets
Isolating high exposure systems from being used as a pivot to other systems
Monitored – Security Operations should monitor for potential violations of the integrity of the segments
(account usage, unexpected traffic, etc.)

Security team visibility


Provide security teams read-only access to the security aspects of all technical resources in their purview
Security organizations require visibility into the technical environment to perform their duties of assessing and
reporting on organizational risk. Without this visibility, security will have to rely on information provided from
groups, operating the environment, who have a potential conflict of interest (and other priorities).
Note that security teams may separately be granted additional privileges if they have operational responsibilities
or a requirement to enforce compliance on Azure resources.
For example in Azure, assign security teams to the Security Readers permission that provides access to measure
security risk (without providing access to the data itself)
For enterprise security groups with broad responsibility for security of Azure, you can assign this permission
using:
Root management group – for teams responsible for assessing and reporting risk on all resources
Segment management group(s) – for teams with limited scope of responsibility (typically required because
of organizational boundaries or regulatory requirements)
Because security will have broad access to the environment (and visibility into potentially exploitable
vulnerabilities), you should consider them critical impact accounts and apply the same protections as
administrators. The Administration section details these controls for Azure.

Assign privileges for managing the environment


Grant roles with operational responsibilities in Azure the appropriate permissions based on a clearly documented
strategy built from the principle of least privilege and your operational needs.
Providing clear guidance that follows a reference model will reduce risk because by increasing it provides clarity
for your technical teams implementing these permissions. This clarity makes it easier to detect and correct human
errors like overpermissioning, reducing your overall risk.
Microsoft recommends starting from these Microsoft reference models and adapting to your organization.

Core Services Reference Permissions


This segment hosts shared services utilized across the organization. These shared services typically include Active
Directory Domain Services, DNS/DHCP, System Management Tools hosted on Azure Infrastructure as a Service
(IaaS) virtual machines.
Security Visibility across all resources – For security teams, grant read-only access to security attributes for
all technical environments. This access level is needed to assess risk factors, identify potential mitigations, and
advise organizational stakeholders who accept the risk. See Security Team Visibility for more details.
Policy management across some or all resources – To monitor and enforce compliance with external (or
internal) regulations, standards, and security policy, assign appropriate permission to those roles. The roles and
permissions you choose will depend on the organizational culture and expectations of the policy program. See
Microsoft Cloud Adoption Framework for Azure.
Central IT operations across all resources – Grant permissions to the central IT department (often the
infrastructure team) to create, modify, and delete resources like virtual machines and storage.
Central networking group across network resources – To ensure consistency and avoid technical conflicts,
assign network resource responsibilities to a single central networking organization. These resources should
include virtual networks, subnets, Network Security Groups (NSG), and the virtual machines hosting virtual
network appliances. See Centralize Network Management And Security for more details
Resource Role Permissions – For most core services, administrative privileges required to manage them are
granted via the application itself (Active Directory, DNS/DHCP, System Management Tools, etc.), so no additional
Azure resource permissions are required. If your organizational model requires these teams to manage their own
VMs, storage, or other Azure resources, you can assign these permissions to those roles.
Ser vice admin (Break Glass Account) – Use the service admin role only for emergencies (and initial setup if
required). Do not use this role for daily tasks. See Emergency Access (‘Break Glass’ Accounts) for more details.
Segment reference permissions
This segment permission design provides consistency while allowing flexibility to accommodate the range of
organizational models from a single centralized IT group to mostly independent IT and DevOps teams.
Security visibility across all resources – For security teams, grant read-only access to security attributes for
all technical environments. This access level is needed to assess risk factors, identify potential mitigations, and
advise organizational stakeholders who accept the risk. See Security Team Visibility.
Policy management across some or all resources – To monitor and enforce compliance with external (or
internal) regulations, standards, and security policy assign appropriate permission to those roles. The roles and
permissions you choose will depend on the organizational culture and expectations of the policy program. See
Microsoft Cloud Adoption Framework for Azure.
IT Operations across all resources – Grant permission to create, modify, and delete resources. The purpose of
the segment (and resulting permissions) will depend on your organization structure.
Segments with resources managed by a centralized IT organization can grant the central IT department
(often the infrastructure team) permission to modify these resources.
Segments managed by independent business units or functions (such as a Human Resources IT Team) can
grant those teams permission to all resources in the segment.
Segments with autonomous DevOps teams don’t need to grant permissions across all resources because
the resource role (below) grants permissions to application teams. For emergencies, use the service admin
account (break-glass account).
Central networking group across network resources – To ensure consistency and avoid technical conflicts,
assign network resource responsibilities to a single central networking organization. These resources should
include virtual networks, subnets, Network Security Groups (NSG), and the virtual machines hosting virtual
network appliances. See Centralize Network Management And Security.
Resource Role Permissions – Segments with autonomous DevOps teams will manage the resources associated
with each application. The actual roles and their permissions depend on the application size and complexity, the
application team size and complexity, and the culture of the organization and application team.
Ser vice Admin (Break Glass Account) – Use the service admin role only for emergencies (and initial setup if
required). Do not use this role for daily tasks. See Emergency Access (‘Break Glass’ Accounts) for more details.
Permission Guidance and Tips
To drive consistency and ensure application to future subscriptions, permissions should be assigned at
management group for the segment rather than the individual subscriptions. See Avoid Granular and
Custom Permissions for more details.
You should first review the built-in roles to see if one is applicable before creating a custom role to grant the
appropriate permissions to VMs and other objects. See Use Built in Roles for more details
Security managers group membership may be appropriate for smaller teams/organizations where
security teams have extensive operational responsibilities.

Establish segmentation with management groups


Structure management groups into a simple design that guides the enterprise segmentation model.
Management groups offer the ability to consistently and efficiently manage resources (including multiple
subscriptions as needed). However, because of their flexibility, it's possible to create an overly complex design.
Complexity creates confusion and negatively impacts both operations and security (as illustrated by overly
complex Organizational Unit (OU) and Group Policy Object (GPO) designs for Active Directory).
Microsoft recommends aligning the top level of management groups (MGs) into a simple enterprise segmentation
strategy limited to 1 or 2 levels.

Use root management group carefully


Use the Root Management Group (MG) for enterprise consistency, but test changes carefully to minimize risk of
operational disruption.
The root management group enables you to ensure consistency across the enterprise by applying policies,
permissions, and tags across all subscriptions. Care must be taken when planning and implementing assignments
to the root management group because this can affect every resource on Azure and potentially cause downtime or
other negative impacts on productivity in the event of errors or unanticipated effects.
Root management group guidance:
Plan Carefully - Select enterprise-wide elements to the root management group that have a clear
requirement to be applied across every resource and/or low impact.
Good candidates include:
Regulator y requirements with clear business risk/impact (for example, restrictions related to data
sovereignty).
Near-zero potential negative impact on operations such as policy with audit effect, Tag
assignment, RBAC permissions assignments that have been carefully reviewed.
Test First - Carefully test all enterprise-wide changes on the root management group before applying
(policy, tags, RBAC model, etc.) using a
Test Lab - Representative lab tenant or lab segment in production tenant.
Production Pilot - Segment MG or Designated subset in subscription(s) / MG.
Validate Changes – to ensure they have the desired effect.

Virtual Machine (VM) security updates and strong passwords


Ensure policy and processes enable (and require) rapid application of security updates to virtual machines.
Attackers constantly scan public cloud IP ranges for open management ports and attempt “easy” attacks like
common passwords and unpatched vulnerabilities.
Enable Azure Security Center to identify missing security updates & apply them.
Local Admin Password Solution (LAPS) or a third-party Privileged Access Management can set strong local admin
passwords and just in time access to them.

Remove Virtual Machine (VM) direct internet connectivity


Make sure policies and processes require restricting and monitoring direct internet connectivity by virtual
machines.
For Azure, you can enforce policies by,
Enterprise-wide prevention - Prevent inadvertent exposure by following the permissions and roles
described in the reference model.
Ensures that network traffic is routed through approved egress points by default.
Exceptions (such as adding a public IP address to a resource) must go through a centralized group
that evaluates exception requests and makes sure appropriate controls are applied.
Identify and remediate exposed virtual machines by using the Azure Security Center network
visualization to quickly identify internet exposed resources.
Restrict management por ts (RDP, SSH) using Just in Time access in Azure Security Center.
One way of managing VMs in the virtual network is by using Azure Bastion. This service allows you to log into
VMs in the virtual network through SSH or remote desktop protocol (RDP) without exposing the VMs directly to
the internet. To see a reference architecture that uses Bastion, see Network DMZ between Azure and an on-
premises datacenter.

Assign incident notification contact


Ensure a security contact receives Azure incident notifications from Microsoft typically a notification that your
resource is compromised and/or attacking another customer.
This enables your security operations team to rapidly respond to potential security risks and remediate them.
Ensure administrator contact information in the Azure enrollment portal includes contact information that will
notify security operations (directly or rapidly via an internal process)

Regularly review critical access


Regularly review roles that are assigned privileges with a business-critical impact.
Set up a recurring review pattern to ensure that accounts are removed from permissions as roles change. You can
conduct the review manually or through an automated process by using tools such as Azure AD access reviews.

Discover and remediate common risks


Identity well-known risks for your Azure tenants, remediate those risks, and track your progress using Secure
Score.
Identifying and remediating common security hygiene risks significantly reduces overall risk to your organization
by increasing cost to attackers. When you remove cheap and well-established attack vectors, attackers are forced
to acquire and use advanced or untested attack methods.
Azure Secure Score in Azure Security Center monitors the security posture of machines, networks, storage and
data services, and applications to discover potential security issues (internet connected VMs, or missing security
updates, missing endpoint protection or encryption, deviations from baseline security configurations, missing
Web Application Firewall (WAF), and more). You should enable this capability (no additional cost), review the
findings, and follow the included recommendations to plan and execute technical remediations starting with the
highest priority items.
As you address risks, track progress and prioritize ongoing investments in your governance and risk reduction
programs.

Increase automation with Azure Blueprints


Use Azure’s native automation capabilities to increase consistency, compliance, and deployment speed for
workloads.
Automation of deployment and maintenance tasks reduces security and compliance risk by limiting opportunity to
introduce human errors during manual tasks. This will also allow both IT Operations teams and security teams to
shift their focus from repeated manual tasks to higher value tasks like enabling developers and business
initiatives, protecting information, and so on.
Utilize the Azure Blueprint service to rapidly and consistently deploy application environments that are compliant
with your organization’s policies and external regulations. Azure Blueprint Service automates deployment of
environments including RBAC roles, policies, resources (VM/Net/Storage/etc.), and more. Azure Blueprints builds
on Microsoft’s significant investment into the Azure Resource Manager to standardize resource deployment in
Azure and enable resource deployment and governance based on a desired-state approach. You can use built in
configurations in Azure Blueprint, make your own, or just use Resource Manager scripts for smaller scope.
Several Security and Compliance Blueprints samples are available to use as a starting template.

Evaluate security using benchmarks


Use an industry standard benchmark to evaluate your organizations current security posture.
Benchmarking allows you to improve your security program by learning from external organizations.
Benchmarking lets you know how your current security state compares to that of other organizations, providing
both external validation for successful elements of your current system as well as identifying gaps that serve as
opportunities to enrich your team’s overall security strategy. Even if your security program isn’t tied to a specific
benchmark or regulatory standard, you will benefit from understanding the documented ideal states by those
outside and inside of your industry.
As an example, the Center for Internet Security (CIS) has created security benchmarks for Azure that map to the
CIS Control Framework. Another reference example is the MITRE ATT&CK™ framework that defines the
various adversary tactics and techniques based on real-world observations. These external references control
mappings help you to understand any gaps between your current strategy what you have and what other
experts in the industry.

Audit and enforce policy compliance


Ensure that the security team is auditing the environment to report on compliance with the security policy of the
organization. Security teams may also enforce compliance with these policies.
Organizations of all sizes will have security compliance requirements. Industry, government, and internal
corporate security policies all need to be audited and enforced. Policy monitoring is critical to check that initial
configurations are correct and that it continues to be compliant over time.
In Azure, you can take advantage of Azure Policy to create and manage policies that enforce compliance. Like
Azure Blueprints, Azure Policies are built on the underlying Azure Resource Manager capabilities in the Azure
platform (and Azure Policy can also be assigned via Azure Blueprints).
For more information on how to do this in Azure, please review Tutorial: Create and manage policies to
enforce compliance .

Monitor identity Risk


Monitor identity-related risk events for warning on potentially compromised identities and remediate those risks.
Most security incidents take place after an attacker initially gains access using a stolen identity. These identities can
often start with low privileges, but the attackers then use that identity to traverse laterally and gain access to more
privileged identities. This repeats as needed until the attacker controls access to the ultimate target data or
systems.
Azure Active Directory uses adaptive machine learning algorithms, heuristics, and known compromised
credentials (username/password pairs) to detect suspicious actions that are related to your user accounts. These
username/password pairs come from monitoring public and dark web sites (where attackers often dump
compromised passwords) and by working with security researchers, law enforcement, Security teams at Microsoft,
and others.
There are two places where you review reported risk events:
Azure AD repor ting - Risk events are part of Azure AD's security reports. For more information, see
the users at risk security report and the risky sign-ins security report.
Azure AD Identity Protection - Risk events are also part of the reporting capabilities of Azure Active
Directory Identity Protection.
In addition, you can use the Identity Protection risk events API to gain programmatic access to security detections
using Microsoft Graph.
Remediate these risks by manually addressing each reported account or by setting up a user risk policy to require
a password change for these high risk events.

Penetration testing
Use Penetration Testing to validate security defenses.
Real world validation of security defenses is critical to validate your defense strategy and implementation. This can
be accomplished by a penetration test (simulates a one time attack) or a red team program (simulates a persistent
threat actor targeting your environment).
Follow the guidance published by Microsoft for planning and executing simulated attacks.

Discover & replace insecure protocols


Discover and disable the use of legacy insecure protocols SMBv1, LM/NTLMv1, wDigest, Unsigned LDAP Binds,
and Weak ciphers in Kerberos.
Authentication protocols are a critical foundation of nearly all security assurances. These older versions can be
exploited by attackers with access to your network and are often used extensively on legacy systems on
Infrastructure as a Service (IaaS).
Here are ways to reduce your risk:
Discover protocol usage by reviewing logs with Azure Sentinel’s Insecure Protocol Dashboard or third-
party tools.
Restrict or Disable use of these protocols by following guidance for SMB, NTLM, WDigest
We recommend implementing changes using pilot or other testing method to mitigate risk of operational
interruption.

Elevated security capabilities


Consider whether to utilize specialized security capabilities in your enterprise architecture.
These measures have the potential to enhance security and meet regulatory requirements, but can introduce
complexity that may negatively impact your operations and efficiency.
We recommend careful consideration and judicious use of these security measures as required:
Dedicated Hardware Security Modules (HSMs)
Dedicated Hardware Security Modules (HSMs) may help meet regulatory or security requirements.
Confidential Computing
Confidential Computing may help meet regulatory or security requirements.
Cloud Design Patterns
11/2/2020 • 7 minutes to read • Edit Online

These design patterns are useful for building reliable, scalable, secure applications in the cloud.
Each pattern describes the problem that the pattern addresses, considerations for applying the pattern, and an
example based on Microsoft Azure. Most of the patterns include code samples or snippets that show how to
implement the pattern on Azure. However, most of the patterns are relevant to any distributed system, whether
hosted on Azure or on other cloud platforms.

Challenges in cloud development


Availability
Availability is the proportion of time that the system is
functional and working, usually measured as a percentage
of uptime. It can be affected by system errors,
infrastructure problems, malicious attacks, and system
load. Cloud applications typically provide users with a
service level agreement (SLA), so applications must be
designed to maximize availability.

Data Management
Data management is the key element of cloud
applications, and influences most of the quality attributes.
Data is typically hosted in different locations and across
multiple servers for reasons such as performance,
scalability or availability, and this can present a range of
challenges. For example, data consistency must be
maintained, and data will typically need to be
synchronized across different locations.

Design and Implementation


Good design encompasses factors such as consistency
and coherence in component design and deployment,
maintainability to simplify administration and
development, and reusability to allow components and
subsystems to be used in other applications and in other
scenarios. Decisions made during the design and
implementation phase have a huge impact on the quality
and the total cost of ownership of cloud hosted
applications and services.
Messaging
The distributed nature of cloud applications requires a
messaging infrastructure that connects the components
and services, ideally in a loosely coupled manner in order
to maximize scalability. Asynchronous messaging is widely
used, and provides many benefits, but also brings
challenges such as the ordering of messages, poison
message management, idempotency, and more.

Management and Monitoring


Cloud applications run in a remote datacenter where you
do not have full control of the infrastructure or, in some
cases, the operating system. This can make management
and monitoring more difficult than an on-premises
deployment. Applications must expose runtime
information that administrators and operators can use to
manage and monitor the system, as well as supporting
changing business requirements and customization
without requiring the application to be stopped or
redeployed.

Performance and Scalability


Performance is an indication of the responsiveness of a
system to execute any action within a given time interval,
while scalability is ability of a system either to handle
increases in load without impact on performance or for
the available resources to be readily increased. Cloud
applications typically encounter variable workloads and
peaks in activity. Predicting these, especially in a
multitenant scenario, is almost impossible. Instead,
applications should be able to scale out within limits to
meet peaks in demand, and scale in when demand
decreases. Scalability concerns not just compute instances,
but other elements such as data storage, messaging
infrastructure, and more.

Resiliency
Resiliency is the ability of a system to gracefully handle
and recover from failures. The nature of cloud hosting,
where applications are often multitenant, use shared
platform services, compete for resources and bandwidth,
communicate over the Internet, and run on commodity
hardware means there is an increased likelihood that both
transient and more permanent faults will arise. Detecting
failures, and recovering quickly and efficiently, is necessary
to maintain resiliency.
Security
Security provides confidentiality, integrity, and availability
assurances against malicious attacks on information
systems (and safety assurances for attacks on operational
technology systems). Losing these assurances can
negatively impact your business operations and revenue,
as well as your organization’s reputation in the
marketplace. Maintaining security requires following well-
established practices (security hygiene) and being vigilant
to detect and rapidly remediate vulnerabilities and active
attacks.

Catalog of patterns
PAT T ERN SUM M A RY C AT EGO RY

Ambassador Create helper services that send Design and Implementation,


network requests on behalf of a
consumer service or application. Management and Monitoring

Anti-Corruption Layer Implement a façade or adapter layer Design and Implementation,


between a modern application and a
legacy system. Management and Monitoring

Asynchronous Request-Reply Decouple backend processing from a Messaging


frontend host, where backend
processing needs to be asynchronous,
but the frontend still needs a clear
response.

Backends for Frontends Create separate backend services to be Design and Implementation
consumed by specific frontend
applications or interfaces.

Bulkhead Isolate elements of an application into Resiliency


pools so that if one fails, the others will
continue to function.

Cache-Aside Load data on demand into a cache from Data Management,


a data store
Performance and Scalability

Choreography Let each service decide when and how a Messaging,


business operation is processed, instead
of depending on a central orchestrator. Performance and Scalability

Circuit Breaker Handle faults that might take a variable Resiliency


amount of time to fix when connecting
to a remote service or resource.

Claim Check Split a large message into a claim check Messaging


and a payload to avoid overwhelming a
message bus.
PAT T ERN SUM M A RY C AT EGO RY

Compensating Transaction Undo the work performed by a series of Resiliency


steps, which together define an
eventually consistent operation.

Competing Consumers Enable multiple concurrent consumers Messaging


to process messages received on the
same messaging channel.

Compute Resource Consolidation Consolidate multiple tasks or Design and Implementation


operations into a single computational
unit

CQRS Segregate operations that read data Data Management,


from operations that update data by
using separate interfaces. Design and Implementation,
Performance and Scalability

Deployment Stamps Deploy multiple independent copies of Availability,


application components, including data
stores. Performance and Scalability

Event Sourcing Use an append-only store to record the Data Management,


full series of events that describe
actions taken on data in a domain. Performance and Scalability

External Configuration Store Move configuration information out of Design and Implementation,
the application deployment package to
a centralized location. Management and Monitoring

Federated Identity Delegate authentication to an external Security


identity provider.

Gatekeeper Protect applications and services by Security


using a dedicated host instance that
acts as a broker between clients and the
application or service, validates and
sanitizes requests, and passes requests
and data between them.

Gateway Aggregation Use a gateway to aggregate multiple Design and Implementation,


individual requests into a single request.
Management and Monitoring

Gateway Offloading Offload shared or specialized service Design and Implementation,


functionality to a gateway proxy.
Management and Monitoring

Gateway Routing Route requests to multiple services Design and Implementation,


using a single endpoint.
Management and Monitoring

Geodes Deploy backend services into a set of Availability,


geographical nodes, each of which can
service any client request in any region. Performance and Scalability
PAT T ERN SUM M A RY C AT EGO RY

Health Endpoint Monitoring Implement functional checks in an Availability,


application that external tools can
access through exposed endpoints at Management and Monitoring,
regular intervals.
Resiliency

Index Table Create indexes over the fields in data Data Management,
stores that are frequently referenced by
queries. Performance and Scalability

Leader Election Coordinate the actions performed by a Design and Implementation,


collection of collaborating task instances
in a distributed application by electing Resiliency
one instance as the leader that assumes
responsibility for managing the other
instances.

Materialized View Generate prepopulated views over the Data Management,


data in one or more data stores when
the data isn't ideally formatted for Performance and Scalability
required query operations.

Pipes and Filters Break down a task that performs Design and Implementation,
complex processing into a series of
separate elements that can be reused. Messaging

Priority Queue Prioritize requests sent to services so Messaging,


that requests with a higher priority are
received and processed more quickly Performance and Scalability
than those with a lower priority.

Publisher/Subscriber Enable an application to announce Messaging


events to multiple interested consumers
asynchronously, without coupling the
senders to the receivers.

Queue-Based Load Leveling Use a queue that acts as a buffer Availability,


between a task and a service that it
invokes in order to smooth intermittent Messaging,
heavy loads.
Resiliency,
Performance and Scalability

Retry Enable an application to handle Resiliency


anticipated, temporary failures when it
tries to connect to a service or network
resource by transparently retrying an
operation that's previously failed.

Scheduler Agent Supervisor Coordinate a set of actions across a Messaging,


distributed set of services and other
remote resources. Resiliency

Sequential Convoy Process a set of related messages in a Messaging


defined order, without blocking
processing of other groups of
messages.
PAT T ERN SUM M A RY C AT EGO RY

Sharding Divide a data store into a set of Data Management,


horizontal partitions or shards.
Performance and Scalability

Sidecar Deploy components of an application Design and Implementation,


into a separate process or container to
provide isolation and encapsulation. Management and Monitoring

Static Content Hosting Deploy static content to a cloud-based Design and Implementation,
storage service that can deliver them
directly to the client. Data Management,
Performance and Scalability

Strangler Incrementally migrate a legacy system Design and Implementation,


by gradually replacing specific pieces of
functionality with new applications and Management and Monitoring
services.

Throttling Control the consumption of resources Availability,


used by an instance of an application,
an individual tenant, or an entire Performance and Scalability
service.

Valet Key Use a token or key that provides clients Data Management,
with restricted direct access to a specific
resource or service. Security
Availability patterns
11/2/2020 • 2 minutes to read • Edit Online

Availability is measured as a percentage of uptime, and defines the proportion of time that a system is functional
and working. Availability is affected by system errors, infrastructure problems, malicious attacks, and system
load. Cloud applications typically provide users with a service level agreement (SLA), which means that
applications must be designed and implemented to maximize availability.

PAT T ERN SUM M A RY

Deployment Stamps Deploy multiple independent copies of application


components, including data stores.

Geodes Deploy backend services into a set of geographical nodes,


each of which can service any client request in any region.

Health Endpoint Monitoring Implement functional checks in an application that external


tools can access through exposed endpoints at regular
intervals.

Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a
service that it invokes, to smooth intermittent heavy loads.

Throttling Control the consumption of resources by an instance of an


application, an individual tenant, or an entire service.

To mitigate against availability risks from malicious Distributed Denial of Service (DDoS) attacks, implement the
native Azure DDoS protection standard service or a 3rd party capability.
Data Management patterns
11/2/2020 • 2 minutes to read • Edit Online

Data management is the key element of cloud applications, and influences most of the quality attributes. Data is
typically hosted in different locations and across multiple servers for reasons such as performance, scalability or
availability, and this can present a range of challenges. For example, data consistency must be maintained, and
data will typically need to be synchronized across different locations.
Additionally data should be protected at rest, in transit, and via authorized access mechanisms to maintain
security assurances of confidentiality, integrity, and availability. Refer to the Azure Security Benchmark Data
Protection Control for more information.

PAT T ERN SUM M A RY

Cache-Aside Load data on demand into a cache from a data store

CQRS Segregate operations that read data from operations that


update data by using separate interfaces.

Event Sourcing Use an append-only store to record the full series of events
that describe actions taken on data in a domain.

Index Table Create indexes over the fields in data stores that are
frequently referenced by queries.

Materialized View Generate prepopulated views over the data in one or more
data stores when the data isn't ideally formatted for required
query operations.

Sharding Divide a data store into a set of horizontal partitions or


shards.

Static Content Hosting Deploy static content to a cloud-based storage service that
can deliver them directly to the client.

Valet Key Use a token or key that provides clients with restricted direct
access to a specific resource or service.
Design and Implementation patterns
11/2/2020 • 2 minutes to read • Edit Online

Good design encompasses factors such as consistency and coherence in component design and deployment,
maintainability to simplify administration and development, and reusability to allow components and
subsystems to be used in other applications and in other scenarios. Decisions made during the design and
implementation phase have a huge impact on the quality and the total cost of ownership of cloud hosted
applications and services.

PAT T ERN SUM M A RY

Ambassador Create helper services that send network requests on behalf


of a consumer service or application.

Anti-Corruption Layer Implement a façade or adapter layer between a modern


application and a legacy system.

Backends for Frontends Create separate backend services to be consumed by


specific frontend applications or interfaces.

CQRS Segregate operations that read data from operations that


update data by using separate interfaces.

Compute Resource Consolidation Consolidate multiple tasks or operations into a single


computational unit

External Configuration Store Move configuration information out of the application


deployment package to a centralized location.

Gateway Aggregation Use a gateway to aggregate multiple individual requests


into a single request.

Gateway Offloading Offload shared or specialized service functionality to a


gateway proxy.

Gateway Routing Route requests to multiple services using a single endpoint.

Leader Election Coordinate the actions performed by a collection of


collaborating task instances in a distributed application by
electing one instance as the leader that assumes
responsibility for managing the other instances.

Pipes and Filters Break down a task that performs complex processing into a
series of separate elements that can be reused.

Sidecar Deploy components of an application into a separate


process or container to provide isolation and encapsulation.

Static Content Hosting Deploy static content to a cloud-based storage service that
can deliver them directly to the client.
PAT T ERN SUM M A RY

Strangler Incrementally migrate a legacy system by gradually


replacing specific pieces of functionality with new
applications and services.
Management and Monitoring patterns
11/2/2020 • 2 minutes to read • Edit Online

Cloud applications run in a remote datacenter where you do not have full control of the infrastructure or, in
some cases, the operating system. This can make management and monitoring more difficult than an on-
premises deployment. Applications must expose runtime information that administrators and operators can use
to manage and monitor the system, as well as supporting changing business requirements and customization
without requiring the application to be stopped or redeployed.

PAT T ERN SUM M A RY

Ambassador Create helper services that send network requests on behalf


of a consumer service or application.

Anti-Corruption Layer Implement a façade or adapter layer between a modern


application and a legacy system.

External Configuration Store Move configuration information out of the application


deployment package to a centralized location.

Gateway Aggregation Use a gateway to aggregate multiple individual requests into


a single request.

Gateway Offloading Offload shared or specialized service functionality to a


gateway proxy.

Gateway Routing Route requests to multiple services using a single endpoint.

Health Endpoint Monitoring Implement functional checks in an application that external


tools can access through exposed endpoints at regular
intervals.

Sidecar Deploy components of an application into a separate


process or container to provide isolation and encapsulation.

Strangler Incrementally migrate a legacy system by gradually replacing


specific pieces of functionality with new applications and
services.
Messaging patterns
11/2/2020 • 2 minutes to read • Edit Online

The distributed nature of cloud applications requires a messaging infrastructure that connects the components
and services, ideally in a loosely coupled manner in order to maximize scalability. Asynchronous messaging is
widely used, and provides many benefits, but also brings challenges such as the ordering of messages, poison
message management, idempotency, and more.

PAT T ERN SUM M A RY

Asynchronous Request-Reply Decouple backend processing from a frontend host, where


backend processing needs to be asynchronous, but the
frontend still needs a clear response.

Claim Check Split a large message into a claim check and a payload to
avoid overwhelming a message bus.

Choreography Have each component of the system participate in the


decision-making process about the workflow of a business
transaction, instead of relying on a central point of control.

Competing Consumers Enable multiple concurrent consumers to process messages


received on the same messaging channel.

Pipes and Filters Break down a task that performs complex processing into a
series of separate elements that can be reused.

Priority Queue Prioritize requests sent to services so that requests with a


higher priority are received and processed more quickly than
those with a lower priority.

Publisher-Subscriber Enable an application to announce events to multiple


interested consumers asynchronously, without coupling the
senders to the receivers.

Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a
service that it invokes in order to smooth intermittent heavy
loads.

Scheduler Agent Supervisor Coordinate a set of actions across a distributed set of


services and other remote resources.

Sequential Convoy Process a set of related messages in a defined order, without


blocking processing of other groups of messages.
Performance and Scalability patterns
11/2/2020 • 2 minutes to read • Edit Online

Performance is an indication of the responsiveness of a system to execute any action within a given time
interval, while scalability is ability of a system either to handle increases in load without impact on
performance or for the available resources to be readily increased. Cloud applications typically encounter
variable workloads and peaks in activity. Predicting these, especially in a multi-tenant scenario, is almost
impossible. Instead, applications should be able to scale out within limits to meet peaks in demand, and scale
in when demand decreases. Scalability concerns not just compute instances, but other elements such as data
storage, messaging infrastructure, and more.

PAT T ERN SUM M A RY

Cache-Aside Load data on demand into a cache from a data store

Choreography Have each component of the system participate in the


decision-making process about the workflow of a business
transaction, instead of relying on a central point of control.

CQRS Segregate operations that read data from operations that


update data by using separate interfaces.

Event Sourcing Use an append-only store to record the full series of events
that describe actions taken on data in a domain.

Deployment Stamps Deploy multiple independent copies of application


components, including data stores.

Geodes Deploy backend services into a set of geographical nodes,


each of which can service any client request in any region.

Index Table Create indexes over the fields in data stores that are
frequently referenced by queries.

Materialized View Generate prepopulated views over the data in one or more
data stores when the data isn't ideally formatted for
required query operations.

Priority Queue Prioritize requests sent to services so that requests with a


higher priority are received and processed more quickly
than those with a lower priority.

Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a
service that it invokes in order to smooth intermittent
heavy loads.

Sharding Divide a data store into a set of horizontal partitions or


shards.

Static Content Hosting Deploy static content to a cloud-based storage service that
can deliver them directly to the client.
PAT T ERN SUM M A RY

Throttling Control the consumption of resources used by an instance


of an application, an individual tenant, or an entire service.
Resiliency patterns
11/13/2020 • 2 minutes to read • Edit Online

Resiliency is the ability of a system to gracefully handle and recover from failures, both inadvertent and
malicious.
The nature of cloud hosting, where applications are often multi-tenant, use shared platform services, compete
for resources and bandwidth, communicate over the Internet, and run on commodity hardware means there is
an increased likelihood that both transient and more permanent faults will arise. The connected nature of the
internet and the rise in sophistication and volume of attacks increase the likelihood of a security disruption.
Detecting failures and recovering quickly and efficiently, is necessary to maintain resiliency.

PAT T ERN SUM M A RY

Bulkhead Isolate elements of an application into pools so that if one


fails, the others will continue to function.

Circuit Breaker Handle faults that might take a variable amount of time to
fix when connecting to a remote service or resource.

Compensating Transaction Undo the work performed by a series of steps, which


together define an eventually consistent operation.

Health Endpoint Monitoring Implement functional checks in an application that external


tools can access through exposed endpoints at regular
intervals.

Leader Election Coordinate the actions performed by a collection of


collaborating task instances in a distributed application by
electing one instance as the leader that assumes
responsibility for managing the other instances.

Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a
service that it invokes in order to smooth intermittent
heavy loads.

Retry Enable an application to handle anticipated, temporary


failures when it tries to connect to a service or network
resource by transparently retrying an operation that's
previously failed.

Scheduler Agent Supervisor Coordinate a set of actions across a distributed set of


services and other remote resources.

Security Resliency
Achieving security resilience requires a combination of preventive measures to block attacks, responsive
measures detect and quickly remediate active attacks, and governance to ensure consistent application of best
practices.
Security strategy should include lessons learned described in security strategy guidance
Azure security configurations should align to the best practices and controls in the Azure Security
Benchmark (ASB). Security configurations for Azure services should align to the Security baselines for
Azure in the ASB
Azure architectures should integrate native security capabilities to protect and monitor workloads
including Azure Defender, Azure DDoS protection, Azure Firewall, and Azure Web Application Firewall
(WAF)
For a more detailed dsicussion, see the Cybersecurity Resilience module in the CISO workshop
Security patterns
11/13/2020 • 2 minutes to read • Edit Online

Security provides confidentiality, integrity, and availability assurances against malicious attacks on information
systems (and safety assurances for attacks on operational technology systems). Losing these assurances can
negatively impact your business operations and revenue, as well as your organization’s reputation in the
marketplace. Maintaining security requires following well-established practices (security hygiene) and being
vigilant to detect and rapidly remediate vulnerabilities and active attacks.

Patterns
PAT T ERN SUM M A RY

Federated Identity Delegate authentication to an external identity provider.

Gatekeeper Protect applications and services by using a dedicated host


instance that acts as a broker between clients and the
application or service, validates and sanitizes requests, and
passes requests and data between them.

Valet Key Use a token or key that provides clients with restricted direct
access to a specific resource or service.

Key Security Resources


RESO URC E SUM M A RY

Azure Security Benchmarks Prescriptive best practices and recommendations to integrate


into architectures for securing workloads, data, services, and
enterprise environments on Azure.

Azure Defender Native security controls to simplify integration of threat


detection and monitoring in Azure architectures

Security Strategy Guidance Building and updating a security strategy for cloud adoption
and modern threat environment

Security Roles and Responsibilities Guidance on security roles and responsibilities including
definitions of mission/outcome for each organizational
function and how each should evolve with the adoption of
cloud.

Getting Started Guide for Security Guidance for planning and implementing security throughout
cloud adoption
Ambassador pattern
11/2/2020 • 3 minutes to read • Edit Online

Create helper services that send network requests on behalf of a consumer service or application. An ambassador
service can be thought of as an out-of-process proxy that is co-located with the client.
This pattern can be useful for offloading common client connectivity tasks such as monitoring, logging, routing,
security (such as TLS), and resiliency patterns in a language agnostic way. It is often used with legacy applications,
or other applications that are difficult to modify, in order to extend their networking capabilities. It can also enable
a specialized team to implement those features.

Context and problem


Resilient cloud-based applications require features such as circuit breaking, routing, metering and monitoring,
and the ability to make network-related configuration updates. It may be difficult or impossible to update legacy
applications or existing code libraries to add these features, because the code is no longer maintained or can't be
easily modified by the development team.
Network calls may also require substantial configuration for connection, authentication, and authorization. If these
calls are used across multiple applications, built using multiple languages and frameworks, the calls must be
configured for each of these instances. In addition, network and security functionality may need to be managed by
a central team within your organization. With a large code base, it can be risky for that team to update application
code they aren't familiar with.

Solution
Put client frameworks and libraries into an external process that acts as a proxy between your application and
external services. Deploy the proxy on the same host environment as your application to allow control over
routing, resiliency, security features, and to avoid any host-related access restrictions. You can also use the
ambassador pattern to standardize and extend instrumentation. The proxy can monitor performance metrics such
as latency or resource usage, and this monitoring happens in the same host environment as the application.

Features that are offloaded to the ambassador can be managed independently of the application. You can update
and modify the ambassador without disturbing the application's legacy functionality. It also allows for separate,
specialized teams to implement and maintain security, networking, or authentication features that have been
moved to the ambassador.
Ambassador services can be deployed as a sidecar to accompany the lifecycle of a consuming application or
service. Alternatively, if an ambassador is shared by multiple separate processes on a common host, it can be
deployed as a daemon or Windows service. If the consuming service is containerized, the ambassador should be
created as a separate container on the same host, with the appropriate links configured for communication.

Issues and considerations


The proxy adds some latency overhead. Consider whether a client library, invoked directly by the application, is
a better approach.
Consider the possible impact of including generalized features in the proxy. For example, the ambassador
could handle retries, but that might not be safe unless all operations are idempotent.
Consider a mechanism to allow the client to pass some context to the proxy, as well as back to the client. For
example, include HTTP request headers to opt out of retry or specify the maximum number of times to retry.
Consider how you will package and deploy the proxy.
Consider whether to use a single shared instance for all clients or an instance for each client.

When to use this pattern


Use this pattern when you:
Need to build a common set of client connectivity features for multiple languages or frameworks.
Need to offload cross-cutting client connectivity concerns to infrastructure developers or other more
specialized teams.
Need to support cloud or cluster connectivity requirements in a legacy application or an application that is
difficult to modify.
This pattern may not be suitable:
When network request latency is critical. A proxy will introduce some overhead, although minimal, and in
some cases this may affect the application.
When client connectivity features are consumed by a single language. In that case, a better option might be a
client library that is distributed to the development teams as a package.
When connectivity features cannot be generalized and require deeper integration with the client application.

Example
The following diagram shows an application making a request to a remote service via an ambassador proxy. The
ambassador provides routing, circuit breaking, and logging. It calls the remote service and then returns the
response to the client application:
Related guidance
Sidecar pattern
Anti-Corruption Layer pattern
11/2/2020 • 2 minutes to read • Edit Online

Implement a façade or adapter layer between different subsystems that don't share the same semantics. This layer
translates requests that one subsystem makes to the other subsystem. Use this pattern to ensure that an
application's design is not limited by dependencies on outside subsystems. This pattern was first described by Eric
Evans in Domain-Driven Design.

Context and problem


Most applications rely on other systems for some data or functionality. For example, when a legacy application is
migrated to a modern system, it may still need existing legacy resources. New features must be able to call the
legacy system. This is especially true of gradual migrations, where different features of a larger application are
moved to a modern system over time.
Often these legacy systems suffer from quality issues such as convoluted data schemas or obsolete APIs. The
features and technologies used in legacy systems can vary widely from more modern systems. To interoperate
with the legacy system, the new application may need to support outdated infrastructure, protocols, data models,
APIs, or other features that you wouldn't otherwise put into a modern application.
Maintaining access between new and legacy systems can force the new system to adhere to at least some of the
legacy system's APIs or other semantics. When these legacy features have quality issues, supporting them
"corrupts" what might otherwise be a cleanly designed modern application.
Similar issues can arise with any external system that your development team doesn't control, not just legacy
systems.

Solution
Isolate the different subsystems by placing an anti-corruption layer between them. This layer translates
communications between the two systems, allowing one system to remain unchanged while the other can avoid
compromising its design and technological approach.

The diagram above shows an application with two subsystems. Subsystem A calls to subsystem B through an anti-
corruption layer. Communication between subsystem A and the anti-corruption layer always uses the data model
and architecture of subsystem A. Calls from the anti-corruption layer to subsystem B conform to that subsystem's
data model or methods. The anti-corruption layer contains all of the logic necessary to translate between the two
systems. The layer can be implemented as a component within the application or as an independent service.

Issues and considerations


The anti-corruption layer may add latency to calls made between the two systems.
The anti-corruption layer adds an additional service that must be managed and maintained.
Consider how your anti-corruption layer will scale.
Consider whether you need more than one anti-corruption layer. You may want to decompose functionality
into multiple services using different technologies or languages, or there may be other reasons to partition the
anti-corruption layer.
Consider how the anti-corruption layer will be managed in relation with your other applications or services.
How will it be integrated into your monitoring, release, and configuration processes?
Make sure transaction and data consistency are maintained and can be monitored.
Consider whether the anti-corruption layer needs to handle all communication between different subsystems,
or just a subset of features.
If the anti-corruption layer is part of an application migration strategy, consider whether it will be permanent,
or will be retired after all legacy functionality has been migrated.

When to use this pattern


Use this pattern when:
A migration is planned to happen over multiple stages, but integration between new and legacy systems needs
to be maintained.
Two or more subsystems have different semantics, but still need to communicate.
This pattern may not be suitable if there are no significant semantic differences between new and legacy systems.

Related guidance
Strangler pattern
Asynchronous Request-Reply pattern
11/2/2020 • 9 minutes to read • Edit Online

Decouple backend processing from a frontend host, where backend processing needs to be asynchronous, but the
frontend still needs a clear response.

Context and problem


In modern application development, it's normal for client applications — often code running in a web-client
(browser) — to depend on remote APIs to provide business logic and compose functionality. These APIs may be
directly related to the application or may be shared services provided by a third party. Commonly these API calls
take place over the HTTP(S) protocol and follow REST semantics.
In most cases, APIs for a client application are designed to respond quickly, on the order of 100 ms or less. Many
factors can affect the response latency, including:
An application's hosting stack.
Security components.
The relative geographic location of the caller and the backend.
Network infrastructure.
Current load.
The size of the request payload.
Processing queue length.
The time for the backend to process the request.
Any of these factors can add latency to the response. Some can be mitigated by scaling out the backend. Others,
such as network infrastructure, are largely out of the control of the application developer. Most APIs can respond
quickly enough for responses to arrive back over the same connection. Application code can make a synchronous
API call in a non-blocking way, giving the appearance of asynchronous processing, which is recommended for I/O-
bound operations.
In some scenarios, however, the work done by backend may be long-running, on the order of seconds, or might be
a background process that is executed in minutes or even hours. In that case, it isn't feasible to wait for the work to
complete before responding to the request. This situation is a potential problem for any synchronous request-
reply pattern.
Some architectures solve this problem by using a message broker to separate the request and response stages.
This separation is often achieved by use of the Queue-Based Load Leveling pattern. This separation can allow the
client process and the backend API to scale independently. But this separation also brings additional complexity
when the client requires success notification, as this step needs to become asynchronous.
Many of the same considerations discussed for client applications also apply for server-to-server REST API calls in
distributed systems — for example, in a microservices architecture.

Solution
One solution to this problem is to use HTTP polling. Polling is useful to client-side code, as it can be hard to
provide call-back endpoints or use long running connections. Even when callbacks are possible, the extra libraries
and services that are required can sometimes add too much extra complexity.
The client application makes a synchronous call to the API, triggering a long-running operation on the
backend.
The API responds synchronously as quickly as possible. It returns an HTTP 202 (Accepted) status code,
acknowledging that the request has been received for processing.

NOTE
The API should validate both the request and the action to be performed before starting the long running process.
If the request is invalid, reply immediately with an error code such as HTTP 400 (Bad Request).

The response holds a location reference pointing to an endpoint that the client can poll to check for the
result of the long running operation.
The API offloads processing to another component, such as a message queue.
While the work is still pending, the status endpoint returns HTTP 202. Once the work is complete, the status
endpoint can either return a resource that indicates completion, or redirect to another resource URL. For
example, if the asynchronous operation creates a new resource, the status endpoint would redirect to the
URL for that resource.
The following diagram shows a typical flow:

1. The client sends a request and receives an HTTP 202 (Accepted) response.
2. The client sends an HTTP GET request to the status endpoint. The work is still pending, so this call also returns
HTTP 202.
3. At some point, the work is complete and the status endpoint returns 302 (Found) redirecting to the resource.
4. The client fetches the resource at the specified URL.

Issues and considerations


There are a number of possible ways to implement this pattern over HTTP and not all upstream services
have the same semantics. For example, most services won't return an HTTP 202 response back from a GET
method when a remote process hasn't finished. Following pure REST semantics, they should return HTTP
404 (Not Found). This response makes sense when you consider the result of the call isn't present yet.
An HTTP 202 response should indicate the location and frequency that the client should poll for the
response. It should have the following additional headers:
H EA DER DESC RIP T IO N N OT ES

Location A URL the client should poll for a This URL could be a SAS token with
response status. the Valet Key Pattern being
appropriate if this location needs
access control. The valet key pattern
is also valid when response polling
needs offloading to another backend

Retry-After An estimate of when processing will This header is designed to prevent


complete polling clients from overwhelming
the back-end with retries.

You may need to use a processing proxy or facade to manipulate the response headers or payload
depending on the underlying services used.
If the status endpoint redirects on completion, either HTTP 302 or HTTP 303 are appropriate return codes,
depending on the exact semantics you support.
Upon successful processing, the resource specified by the Location header should return an appropriate
HTTP response code such as 200 (OK), 201 (Created), or 204 (No Content).
If an error occurs during processing, persist the error at the resource URL described in the Location header
and ideally return an appropriate response code to the client from that resource (4xx code).
Not all solutions will implement this pattern in the same way and some services will include additional or
alternate headers. For example, Azure Resource Manager uses a modified variant of this pattern. For more
information, see Azure Resource Manager Async Operations.
Legacy clients might not support this pattern. In that case, you might need to place a facade over the
asynchronous API to hide the asynchronous processing from the original client. For example, Azure Logic
Apps supports this pattern natively can be used as an integration layer between an asynchronous API and a
client that makes synchronous calls. See Perform long-running tasks with the webhook action pattern.
In some scenarios, you might want to provide a way for clients to cancel a long-running request. In that
case, the backend service must support some form of cancellation instruction.

When to use this pattern


Use this pattern for:
Client-side code, such as browser applications, where it's difficult to provide call-back endpoints, or the use
of long-running connections adds too much additional complexity.
Service calls where only the HTTP protocol is available and the return service can't fire callbacks because of
firewall restrictions on the client-side.
Service calls that need to be integrated with legacy architectures that don't support modern callback
technologies such as WebSockets or webhooks.
This pattern might not be suitable when:
You can use a service built for asynchronous notifications instead, such as Azure Event Grid.
Responses must stream in real time to the client.
The client needs to collect many results, and received latency of those results is important. Consider a service
bus pattern instead.
You can use server-side persistent network connections such as WebSockets or SignalR. These services can be
used to notify the caller of the result.
The network design allows you to open up ports to receive asynchronous callbacks or webhooks.

Example
The following code shows excerpts from an application that uses Azure Functions to implement this pattern. There
are three functions in the solution:
The asynchronous API endpoint.
The status endpoint.
A backend function that takes queued work items and executes them.

This sample is available on GitHub.


AsyncProcessingWorkAcceptor function
The AsyncProcessingWorkAcceptor function implements an endpoint that accepts work from a client application
and puts it on a queue for processing.
The function generates a request ID and adds it as metadata to the queue message.
The HTTP response includes a location header pointing to a status endpoint. The request ID is part of the URL
path.
public static class AsyncProcessingWorkAcceptor
{
[FunctionName("AsyncProcessingWorkAcceptor")]
public static async Task<IActionResult> Run(
[HttpTrigger(AuthorizationLevel.Anonymous, "post", Route = null)] CustomerPOCO customer,
[ServiceBus("outqueue", Connection = "ServiceBusConnectionAppSetting")] IAsyncCollector<Message>
OutMessage,
ILogger log)
{
if (String.IsNullOrEmpty(customer.id) || String.IsNullOrEmpty(customer.customername))
{
return new BadRequestResult();
}

string reqid = Guid.NewGuid().ToString();

string rqs =
$"http://{Environment.GetEnvironmentVariable("WEBSITE_HOSTNAME")}/api/RequestStatus/{reqid}";

var messagePayload = JsonConvert.SerializeObject(customer);


Message m = new Message(Encoding.UTF8.GetBytes(messagePayload));
m.UserProperties["RequestGUID"] = reqid;
m.UserProperties["RequestSubmittedAt"] = DateTime.Now;
m.UserProperties["RequestStatusURL"] = rqs;

await OutMessage.AddAsync(m);

return (ActionResult) new AcceptedResult(rqs, $"Request Accepted for


Processing{Environment.NewLine}ProxyStatus: {rqs}");
}
}

AsyncProcessingBackgroundWorker function
The AsyncProcessingBackgroundWorker function picks up the operation from the queue, does some work based on
the message payload, and writes the result to the SAS signature location.

public static class AsyncProcessingBackgroundWorker


{
[FunctionName("AsyncProcessingBackgroundWorker")]
public static void Run(
[ServiceBusTrigger("outqueue", Connection = "ServiceBusConnectionAppSetting")]Message myQueueItem,
[Blob("data", FileAccess.ReadWrite, Connection = "StorageConnectionAppSetting")] CloudBlobContainer
inputBlob,
ILogger log)
{
// Perform an actual action against the blob data source for the async readers to be able to check
against.
// This is where your actual service worker processing will be performed.

var id = myQueueItem.UserProperties["RequestGUID"] as string;

CloudBlockBlob cbb = inputBlob.GetBlockBlobReference($"{id}.blobdata");

// Now write the results to blob storage.


cbb.UploadFromByteArrayAsync(myQueueItem.Body, 0, myQueueItem.Body.Length);
}
}

AsyncOperationStatusChecker function
The AsyncOperationStatusChecker function implements the status endpoint. This function first checks whether the
request was completed
If the request was completed, the function either returns a valet-key to the response, or redirects the call
immediately to the valet-key URL.
If the request is still pending, then we should return a 202 accepted with a self-referencing Location header,
putting an ETA for a completed response in the http Retry-After header.

public static class AsyncOperationStatusChecker


{
[FunctionName("AsyncOperationStatusChecker")]
public static async Task<IActionResult> Run(
[HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "RequestStatus/{thisGUID}")] HttpRequest
req,
[Blob("data/{thisGuid}.blobdata", FileAccess.Read, Connection = "StorageConnectionAppSetting")]
CloudBlockBlob inputBlob, string thisGUID,
ILogger log)
{

OnCompleteEnum OnComplete = Enum.Parse<OnCompleteEnum>(req.Query["OnComplete"].FirstOrDefault() ??


"Redirect");
OnPendingEnum OnPending = Enum.Parse<OnPendingEnum>(req.Query["OnPending"].FirstOrDefault() ??
"Accepted");

log.LogInformation($"C# HTTP trigger function processed a request for status on {thisGUID} -


OnComplete {OnComplete} - OnPending {OnPending}");

// Check to see if the blob is present.


if (await inputBlob.ExistsAsync())
{
// If it's present, depending on the value of the optional "OnComplete" parameter choose what to
do.
return await OnCompleted(OnComplete, inputBlob, thisGUID);
}
else
{
// If it's NOT present, check the optional "OnPending" parameter.
string rqs =
$"http://{Environment.GetEnvironmentVariable("WEBSITE_HOSTNAME")}/api/RequestStatus/{thisGUID}";

switch (OnPending)
{
case OnPendingEnum.Accepted:
{
// Return an HTTP 202 status code.
return (ActionResult)new AcceptedResult() { Location = rqs };
}

case OnPendingEnum.Synchronous:
{
// Back off and retry. Time out if the backoff period hits one minute
int backoff = 250;

while (!await inputBlob.ExistsAsync() && backoff < 64000)


{
log.LogInformation($"Synchronous mode {thisGUID}.blob - retrying in {backoff}
ms");
backoff = backoff * 2;
await Task.Delay(backoff);
}

if (await inputBlob.ExistsAsync())
{
log.LogInformation($"Synchronous Redirect mode {thisGUID}.blob - completed after
{backoff} ms");
return await OnCompleted(OnComplete, inputBlob, thisGUID);
}
else
{
log.LogInformation($"Synchronous mode {thisGUID}.blob - NOT FOUND after timeout
log.LogInformation($"Synchronous mode {thisGUID}.blob - NOT FOUND after timeout
{backoff} ms");
return (ActionResult)new NotFoundResult();
}
}

default:
{
throw new InvalidOperationException($"Unexpected value: {OnPending}");
}
}
}
}

private static async Task<IActionResult> OnCompleted(OnCompleteEnum OnComplete, CloudBlockBlob inputBlob,


string thisGUID)
{
switch (OnComplete)
{
case OnCompleteEnum.Redirect:
{
// Redirect to the SAS URI to blob storage
return (ActionResult)new RedirectResult(inputBlob.GenerateSASURI());
}

case OnCompleteEnum.Stream:
{
// Download the file and return it directly to the caller.
// For larger files, use a stream to minimize RAM usage.
return (ActionResult)new OkObjectResult(await inputBlob.DownloadTextAsync());
}

default:
{
throw new InvalidOperationException($"Unexpected value: {OnComplete}");
}
}
}
}

public enum OnCompleteEnum {

Redirect,
Stream
}

public enum OnPendingEnum {

Accepted,
Synchronous
}

Related guidance and next steps


The following information may be relevant when implementing this pattern:
Asynchronous operations in REST
Azure Logic Apps - Perform long-running tasks with the polling action pattern.
For general best practices when designing a web API, see Web API design.
Backends for Frontends pattern
11/2/2020 • 3 minutes to read • Edit Online

Create separate backend services to be consumed by specific frontend applications or interfaces. This pattern is
useful when you want to avoid customizing a single backend for multiple interfaces. This pattern was first
described by Sam Newman.

Context and problem


An application may initially be targeted at a desktop web UI. Typically, a backend service is developed in parallel
that provides the features needed for that UI. As the application's user base grows, a mobile application is
developed that must interact with the same backend. The backend service becomes a general-purpose backend,
serving the requirements of both the desktop and mobile interfaces.
But the capabilities of a mobile device differ significantly from a desktop browser, in terms of screen size,
performance, and display limitations. As a result, the requirements for a mobile application backend differ from
the desktop web UI.
These differences result in competing requirements for the backend. The backend requires regular and significant
changes to serve both the desktop web UI and the mobile application. Often, separate interface teams work on
each frontend, causing the backend to become a bottleneck in the development process. Conflicting update
requirements, and the need to keep the service working for both frontends, can result in spending a lot of effort
on a single deployable resource.

As the development activity focuses on the backend service, a separate team may be created to manage and
maintain the backend. Ultimately, this results in a disconnect between the interface and backend development
teams, placing a burden on the backend team to balance the competing requirements of the different UI teams.
When one interface team requires changes to the backend, those changes must be validated with other interface
teams before they can be integrated into the backend.

Solution
Create one backend per user interface. Fine-tune the behavior and performance of each backend to best match the
needs of the frontend environment, without worrying about affecting other frontend experiences.

Because each backend is specific to one interface, it can be optimized for that interface. As a result, it will be
smaller, less complex, and likely faster than a generic backend that tries to satisfy the requirements for all
interfaces. Each interface team has autonomy to control their own backend and doesn't rely on a centralized
backend development team. This gives the interface team flexibility in language selection, release cadence,
prioritization of workload, and feature integration in their backend.
For more information, see Pattern: Backends For Frontends.

Issues and considerations


Consider how many backends to deploy.
If different interfaces (such as mobile clients) will make the same requests, consider whether it is necessary to
implement a backend for each interface, or if a single backend will suffice.
Code duplication across services is highly likely when implementing this pattern.
Frontend-focused backend services should only contain client-specific logic and behavior. General business
logic and other global features should be managed elsewhere in your application.
Think about how this pattern might be reflected in the responsibilities of a development team.
Consider how long it will take to implement this pattern. Will the effort of building the new backends incur
technical debt, while you continue to support the existing generic backend?

When to use this pattern


Use this pattern when:
A shared or general purpose backend service must be maintained with significant development overhead.
You want to optimize the backend for the requirements of specific client interfaces.
Customizations are made to a general-purpose backend to accommodate multiple interfaces.
An alternative language is better suited for the backend of a different user interface.
This pattern may not be suitable:
When interfaces make the same or similar requests to the backend.
When only one interface is used to interact with the backend.

Related guidance
Pattern: Backends For Frontends
Gateway Aggregation pattern
Gateway Offloading pattern
Gateway Routing pattern
Bulkhead pattern
11/2/2020 • 4 minutes to read • Edit Online

The Bulkhead pattern is a type of application design that is tolerant of failure. In a bulkhead architecture, elements
of an application are isolated into pools so that if one fails, the others will continue to function. It's named after
the sectioned partitions (bulkheads) of a ship's hull. If the hull of a ship is compromised, only the damaged
section fills with water, which prevents the ship from sinking.

Context and problem


A cloud-based application may include multiple services, with each service having one or more consumers.
Excessive load or failure in a service will impact all consumers of the service.
Moreover, a consumer may send requests to multiple services simultaneously, using resources for each request.
When the consumer sends a request to a service that is misconfigured or not responding, the resources used by
the client's request may not be freed in a timely manner. As requests to the service continue, those resources may
be exhausted. For example, the client's connection pool may be exhausted. At that point, requests by the
consumer to other services are affected. Eventually the consumer can no longer send requests to other services,
not just the original unresponsive service.
The same issue of resource exhaustion affects services with multiple consumers. A large number of requests
originating from one client may exhaust available resources in the service. Other consumers are no longer able to
consume the service, causing a cascading failure effect.

Solution
Partition service instances into different groups, based on consumer load and availability requirements. This
design helps to isolate failures, and allows you to sustain service functionality for some consumers, even during a
failure.
A consumer can also partition resources, to ensure that resources used to call one service don't affect the
resources used to call another service. For example, a consumer that calls multiple services may be assigned a
connection pool for each service. If a service begins to fail, it only affects the connection pool assigned for that
service, allowing the consumer to continue using the other services.
The benefits of this pattern include:
Isolates consumers and services from cascading failures. An issue affecting a consumer or service can be
isolated within its own bulkhead, preventing the entire solution from failing.
Allows you to preserve some functionality in the event of a service failure. Other services and features of the
application will continue to work.
Allows you to deploy services that offer a different quality of service for consuming applications. A high-
priority consumer pool can be configured to use high-priority services.
The following diagram shows bulkheads structured around connection pools that call individual services. If
Service A fails or causes some other issue, the connection pool is isolated, so only workloads using the thread
pool assigned to Service A are affected. Workloads that use Service B and C are not affected and can continue
working without interruption.
The next diagram shows multiple clients calling a single service. Each client is assigned a separate service
instance. Client 1 has made too many requests and overwhelmed its instance. Because each service instance is
isolated from the others, the other clients can continue making calls.

Issues and considerations


Define partitions around the business and technical requirements of the application.
When partitioning services or consumers into bulkheads, consider the level of isolation offered by the
technology as well as the overhead in terms of cost, performance and manageability.
Consider combining bulkheads with retry, circuit breaker, and throttling patterns to provide more
sophisticated fault handling.
When partitioning consumers into bulkheads, consider using processes, thread pools, and semaphores.
Projects like resilience4j and Polly offer a framework for creating consumer bulkheads.
When partitioning services into bulkheads, consider deploying them into separate virtual machines,
containers, or processes. Containers offer a good balance of resource isolation with fairly low overhead.
Services that communicate using asynchronous messages can be isolated through different sets of queues.
Each queue can have a dedicated set of instances processing messages on the queue, or a single group of
instances using an algorithm to dequeue and dispatch processing.
Determine the level of granularity for the bulkheads. For example, if you want to distribute tenants across
partitions, you could place each tenant into a separate partition, or put several tenants into one partition.
Monitor each partition’s performance and SLA.
When to use this pattern
Use this pattern to:
Isolate resources used to consume a set of backend services, especially if the application can provide some
level of functionality even when one of the services is not responding.
Isolate critical consumers from standard consumers.
Protect the application from cascading failures.
This pattern may not be suitable when:
Less efficient use of resources may not be acceptable in the project.
The added complexity is not necessary

Example
The following Kubernetes configuration file creates an isolated container to run a single service, with its own CPU
and memory resources and limits.

apiVersion: v1
kind: Pod
metadata:
name: drone-management
spec:
containers:
- name: drone-management-container
image: drone-service
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "1"

Related guidance
Designing reliable Azure applications
Circuit Breaker pattern
Retry pattern
Throttling pattern
Cache-Aside pattern
11/2/2020 • 7 minutes to read • Edit Online

Load data on demand into a cache from a data store. This can improve performance and also helps to maintain
consistency between data held in the cache and data in the underlying data store.

Context and problem


Applications use a cache to improve repeated access to information held in a data store. However, it's impractical
to expect that cached data will always be completely consistent with the data in the data store. Applications
should implement a strategy that helps to ensure that the data in the cache is as up-to-date as possible, but can
also detect and handle situations that arise when the data in the cache has become stale.

Solution
Many commercial caching systems provide read-through and write-through/write-behind operations. In these
systems, an application retrieves data by referencing the cache. If the data isn't in the cache, it's retrieved from
the data store and added to the cache. Any modifications to data held in the cache are automatically written back
to the data store as well.
For caches that don't provide this functionality, it's the responsibility of the applications that use the cache to
maintain the data.
An application can emulate the functionality of read-through caching by implementing the cache-aside strategy.
This strategy loads data into the cache on demand. The figure illustrates using the Cache-Aside pattern to store
data in the cache.

If an application updates information, it can follow the write-through strategy by making the modification to the
data store, and by invalidating the corresponding item in the cache.
When the item is next required, using the cache-aside strategy will cause the updated data to be retrieved from
the data store and added back into the cache.
Issues and considerations
Consider the following points when deciding how to implement this pattern:
Lifetime of cached data . Many caches implement an expiration policy that invalidates data and removes it
from the cache if it's not accessed for a specified period. For cache-aside to be effective, ensure that the
expiration policy matches the pattern of access for applications that use the data. Don't make the expiration
period too short because this can cause applications to continually retrieve data from the data store and add it to
the cache. Similarly, don't make the expiration period so long that the cached data is likely to become stale.
Remember that caching is most effective for relatively static data, or data that is read frequently.
Evicting data . Most caches have a limited size compared to the data store where the data originates, and they'll
evict data if necessary. Most caches adopt a least-recently-used policy for selecting items to evict, but this might
be customizable. Configure the global expiration property and other properties of the cache, and the expiration
property of each cached item, to ensure that the cache is cost effective. It isn't always appropriate to apply a
global eviction policy to every item in the cache. For example, if a cached item is very expensive to retrieve from
the data store, it can be beneficial to keep this item in the cache at the expense of more frequently accessed but
less costly items.
Priming the cache . Many solutions prepopulate the cache with the data that an application is likely to need as
part of the startup processing. The Cache-Aside pattern can still be useful if some of this data expires or is
evicted.
Consistency . Implementing the Cache-Aside pattern doesn't guarantee consistency between the data store and
the cache. An item in the data store can be changed at any time by an external process, and this change might
not be reflected in the cache until the next time the item is loaded. In a system that replicates data across data
stores, this problem can become serious if synchronization occurs frequently.
Local (in-memor y) caching . A cache could be local to an application instance and stored in-memory. Cache-
aside can be useful in this environment if an application repeatedly accesses the same data. However, a local
cache is private and so different application instances could each have a copy of the same cached data. This data
could quickly become inconsistent between caches, so it might be necessary to expire data held in a private
cache and refresh it more frequently. In these scenarios, consider investigating the use of a shared or a
distributed caching mechanism.

When to use this pattern


Use this pattern when:
A cache doesn't provide native read-through and write-through operations.
Resource demand is unpredictable. This pattern enables applications to load data on demand. It makes no
assumptions about which data an application will require in advance.
This pattern might not be suitable:
When the cached data set is static. If the data will fit into the available cache space, prime the cache with the
data on startup and apply a policy that prevents the data from expiring.
For caching session state information in a web application hosted in a web farm. In this environment, you
should avoid introducing dependencies based on client-server affinity.

Example
In Microsoft Azure you can use Azure Cache for Redis to create a distributed cache that can be shared by multiple
instances of an application.
This following code examples use the StackExchange.Redis client, which is a Redis client library written for .NET.
To connect to an Azure Cache for Redis instance, call the static ConnectionMultiplexer.Connect method and pass
in the connection string. The method returns a ConnectionMultiplexer that represents the connection. One
approach to sharing a ConnectionMultiplexer instance in your application is to have a static property that returns
a connected instance, similar to the following example. This approach provides a thread-safe way to initialize
only a single connected instance.

private static ConnectionMultiplexer Connection;

// Redis connection string information


private static Lazy<ConnectionMultiplexer> lazyConnection = new Lazy<ConnectionMultiplexer>(() =>
{
string cacheConnection = ConfigurationManager.AppSettings["CacheConnection"].ToString();
return ConnectionMultiplexer.Connect(cacheConnection);
});

public static ConnectionMultiplexer Connection => lazyConnection.Value;

The GetMyEntityAsync method in the following code example shows an implementation of the Cache-Aside
pattern. This method retrieves an object from the cache using the read-through approach.
An object is identified by using an integer ID as the key. The GetMyEntityAsync method tries to retrieve an item
with this key from the cache. If a matching item is found, it's returned. If there's no match in the cache, the
GetMyEntityAsync method retrieves the object from a data store, adds it to the cache, and then returns it. The
code that actually reads the data from the data store is not shown here, because it depends on the data store.
Note that the cached item is configured to expire to prevent it from becoming stale if it's updated elsewhere.

// Set five minute expiration as a default


private const double DefaultExpirationTimeInMinutes = 5.0;

public async Task<MyEntity> GetMyEntityAsync(int id)


{
// Define a unique key for this method and its parameters.
var key = $"MyEntity:{id}";
var cache = Connection.GetDatabase();

// Try to get the entity from the cache.


var json = await cache.StringGetAsync(key).ConfigureAwait(false);
var value = string.IsNullOrWhiteSpace(json)
? default(MyEntity)
: JsonConvert.DeserializeObject<MyEntity>(json);

if (value == null) // Cache miss


{
// If there's a cache miss, get the entity from the original store and cache it.
// Code has been omitted because it is data store dependent.
value = ...;

// Avoid caching a null value.


if (value != null)
{
// Put the item in the cache with a custom expiration time that
// depends on how critical it is to have stale data.
await cache.StringSetAsync(key, JsonConvert.SerializeObject(value)).ConfigureAwait(false);
await cache.KeyExpireAsync(key,
TimeSpan.FromMinutes(DefaultExpirationTimeInMinutes)).ConfigureAwait(false);
}
}

return value;
}
The examples use Azure Cache for Redis to access the store and retrieve information from the cache. For
more information, see Using Azure Cache for Redis and How to create a Web App with Azure Cache for
Redis.

The UpdateEntityAsync method shown below demonstrates how to invalidate an object in the cache when the
value is changed by the application. The code updates the original data store and then removes the cached item
from the cache.

public async Task UpdateEntityAsync(MyEntity entity)


{
// Update the object in the original data store.
await this.store.UpdateEntityAsync(entity).ConfigureAwait(false);

// Invalidate the current cache object.


var cache = Connection.GetDatabase();
var id = entity.Id;
var key = $"MyEntity:{id}"; // The key for the cached object.
await cache.KeyDeleteAsync(key).ConfigureAwait(false); // Delete this key from the cache.
}

NOTE
The order of the steps is important. Update the data store before removing the item from the cache. If you remove the
cached item first, there is a small window of time when a client might fetch the item before the data store is updated. That
will result in a cache miss (because the item was removed from the cache), causing the earlier version of the item to be
fetched from the data store and added back into the cache. The result will be stale cache data.

Related guidance
The following information may be relevant when implementing this pattern:
Caching Guidance. Provides additional information on how you can cache data in a cloud solution, and the
issues that you should consider when you implement a cache.
Data Consistency Primer. Cloud applications typically use data that's spread across data stores. Managing
and maintaining data consistency in this environment is a critical aspect of the system, particularly the
concurrency and availability issues that can arise. This primer describes issues about consistency across
distributed data, and summarizes how an application can implement eventual consistency to maintain the
availability of data.
Choreography
11/2/2020 • 7 minutes to read • Edit Online

Have each component of the system participate in the decision-making process about the workflow of a business
transaction, instead of relying on a central point of control.

Context and problem


In microservices architecture, it’s often the case that a cloud-based application is divided into several small
services that work together to process a business transaction end-to-end. To lower coupling between services,
each service is responsible for a single business operation. Some benefits include faster development, smaller
code base, and scalability. However, designing an efficient and scalable workflow is a challenge and often requires
complex interservice communication.
The services communicate with each other by using well-defined APIs. Even a single business operation can result
in multiple point-to-point calls among all services. A common pattern for communication is to use a centralized
service that acts as the orchestrator. It acknowledges all incoming requests and delegates operations to the
respective services. In doing so, it also manages the workflow of the entire business transaction. Each service just
completes an operation and is not aware of the overall workflow.
The orchestrator pattern reduces point-to-point communication between services but has some drawbacks
because of the tight coupling between the orchestrator and other services that participate in processing of the
business transaction. To execute tasks in a sequence, the orchestrator needs to have some domain knowledge
about the responsibilities of those services. If you want to add or remove services, existing logic will break, and
you'll need to rewire portions of the communication path. While you can configure the workflow, add or remove
services easily with a well-designed orchestrator, such an implementation is complex and hard to maintain.

Solution
Let each service decide when and how a business operation is processed, instead of depending on a central
orchestrator.
One way to implement choreography is to use the asynchronous messaging pattern to coordinate the business
operations.
A client request publishes messages to a message queue. As messages arrive, they are pushed to subscribers, or
services, interested in that message. Each subscribed service does their operation as indicated by the message
and responds to the message queue with success or failure of the operation. In case of success, the service can
push a message back to the same queue or a different message queue so that another service can continue the
workflow if needed. If an operation fails, the message bus can retry that operation.
This way, the services choreograph the workflow among themselves without depending on an orchestrator or
having direct communication between them.
Because there isn't point-to-point communication, this pattern helps reduce coupling between services. Also, it
can remove the performance bottleneck caused by the orchestrator when it has to deal with all transactions.

When to use this pattern


Use the choreography pattern if you expect to update, remove, or add new services frequently. The entire app can
be modified with lesser effort and minimal disruption to existing services.
Consider this pattern if you experience performance bottlenecks in the central orchestrator.
This pattern is a natural model for the serverless architecture where all services can be short lived, or event
driven. Services can spin up because of an event, do their task, and are removed when the task is finished.

Issues and considerations


Decentralizing the orchestrator can cause issues while managing the workflow.
If a service fails to complete a business operation, it can be difficult to recover from that failure. One way is to
have the service indicate failure by firing an event. Another service subscribes to those failed events takes
necessary actions such as applying compensating transactions to undo successful operations in a request. The
failed service might also fail to fire an event for the failure. In that case, consider using a retry and, or time out
mechanism to recognize that operation as a failure. For an example, see the Example section.
It's simple to implement a workflow when you want to process independent business operations in parallel. You
can use a single message bus. However, the workflow can become complicated when choreography needs to
occur in a sequence. For instance, Service C can start its operation only after Service A and Service B have
completed their operations with success. One approach is to have multiple message buses that get messages in
the required order. For more information, see the Example section.
The choreography pattern becomes a challenge if the number of services grow rapidly. Given the high number of
independent moving parts, the workflow between services tends to get complex. Also, distributed tracing
becomes difficult.
The orchestrator centrally manages the resiliency of the workflow and it can become a single point of failure. On
the other hand, for choreography, the role is distributed between all services and resiliency becomes less robust.
Each service isn't only responsible for the resiliency of its operation but also the workflow. This responsibility can
be burdensome for the service and hard to implement. Each service must retry transient, nontransient, and time-
out failures, so that the request terminates gracefully, if needed. Also, the service must be diligent about
communicating the success or failure of the operation so that other services can act accordingly.

Example
This example shows the choreography pattern with the Drone Delivery app. When a client requests a pickup, the
app assigns a drone and notifies the client.

An example of this pattern is available on GitHub.

A single client business transaction requires three distinct business operations: creating or updating a package,
assigning a drone to deliver the package, and checking the delivery status. Those operations are performed by
three microservices: Package, Drone Scheduler, and Delivery services. Instead of a central orchestrator, the
services use messaging to collaborate and coordinate the request among themselves.
Design
The business transaction is processed in a sequence through multiple hops. Each hop has a message bus and the
respective business service.
When a client sends a delivery request through an HTTP endpoint, the Ingestion service receives it, raises an
operation event, and sends it to a message bus. The bus invokes the subscribed business service and sends the
event in a POST request. On receiving the event, the business service can complete the operation with success,
failure, or the request can time out. If successful, the service responds to the bus with the Ok status code, raises a
new operation event, and sends it to the message bus of the next hop. In case of a failure or time-out, the service
reports failure by sending the BadRequest code to the message bus that sent the original POST request. The
message bus retries the operation based on a retry policy. After that period expires, message bus flags the failed
operation and further processing of the entire request stops.
This workflow continues until the entire request has been processed.
The design uses multiple message buses to process the entire business transaction. Microsoft Azure Event Grid
provides the messaging service. The app is deployed in an Azure Kubernetes Service (AKS) cluster with two
containers in the same pod. One container runs the ambassador that interacts with Event Grid while the other
runs a business service. The approach with two containers in the same pod improves performance and scalability.
The ambassador and the business service share the same network allowing for low latency and high throughput.
To avoid cascading retry operations that might lead to multiple efforts, only Event Grid retries an operation
instead of the business service. It flags a failed request by sending a messaging to a dead letter queue (DLQ).
The business services are idempotent to make sure retry operations don’t result in duplicate resources. For
example, the Package service uses upsert operations to add data to the data store.
The example implements a custom solution to correlate calls across all services and Event Grid hops.
Here’s a code example that shows the choreography pattern between all business services. It shows the workflow
of the Drone Delivery app transactions. Code for exception handling and logging have been removed for brevity.
[HttpPost]
[Route("/api/[controller]/operation")]
[ProducesResponseType(typeof(void), 200)]
[ProducesResponseType(typeof(void), 400)]
[ProducesResponseType(typeof(void), 500)]

public async Task<IActionResult> Post([FromBody] EventGridEvent[] events)


{

if (events == null)
{
return BadRequest("No Event for Choreography");
}

foreach(var e in events)
{

List<EventGridEvent> listEvents = new List<EventGridEvent>();


e.Topic = eventRepository.GetTopic();
e.EventTime = DateTime.Now;
switch (e.EventType)
{
case Operations.ChoreographyOperation.ScheduleDelivery:
{
var packageGen = await
packageServiceCaller.UpsertPackageAsync(delivery.PackageInfo).ConfigureAwait(false);
if (packageGen is null)
{
//BadRequest allows the event to be reprocessed by Event Grid
return BadRequest("Package creation failed.");
}

//we set the event type to the next choreography step


e.EventType = Operations.ChoreographyOperation.CreatePackage;
listEvents.Add(e);
await eventRepository.SendEventAsync(listEvents);
return Ok("Created Package Completed");
}
case Operations.ChoreographyOperation.CreatePackage:
{
var droneId = await
droneSchedulerServiceCaller.GetDroneIdAsync(delivery).ConfigureAwait(false);
if (droneId is null)
{
//BadRequest allows the event to be reprocessed by Event Grid
return BadRequest("could not get a drone id");
}
e.Subject = droneId;
e.EventType = Operations.ChoreographyOperation.GetDrone;
listEvents.Add(e);
await eventRepository.SendEventAsync(listEvents);
return Ok("Drone Completed");
}
case Operations.ChoreographyOperation.GetDrone:
{
var deliverySchedule = await deliveryServiceCaller.ScheduleDeliveryAsync(delivery,
e.Subject);
return Ok("Delivery Completed");
}
return BadRequest();
}
}

Related patterns and guidance


Consider these patterns in your design for choreography.
Modularize the business service by using the ambassador design pattern.
Implement queue-based load leveling pattern to handle spikes of the workload.
Use asynchronous distributed messaging through the publisher-subscriber pattern.
Use compensating transactions to undo a series of successful operations in case one or more related
operation fails.
For information about using a message broker in a messaging infrastructure, see Asynchronous
messaging options in Azure.
Circuit Breaker pattern
11/2/2020 • 17 minutes to read • Edit Online

Handle faults that might take a variable amount of time to recover from, when connecting to a remote service
or resource. This can improve the stability and resiliency of an application.

Context and problem


In a distributed environment, calls to remote resources and services can fail due to transient faults, such as
slow network connections, timeouts, or the resources being overcommitted or temporarily unavailable. These
faults typically correct themselves after a short period of time, and a robust cloud application should be
prepared to handle them by using a strategy such as the Retry pattern.
However, there can also be situations where faults are due to unanticipated events, and that might take much
longer to fix. These faults can range in severity from a partial loss of connectivity to the complete failure of a
service. In these situations it might be pointless for an application to continually retry an operation that is
unlikely to succeed, and instead the application should quickly accept that the operation has failed and handle
this failure accordingly.
Additionally, if a service is very busy, failure in one part of the system might lead to cascading failures. For
example, an operation that invokes a service could be configured to implement a timeout, and reply with a
failure message if the service fails to respond within this period. However, this strategy could cause many
concurrent requests to the same operation to be blocked until the timeout period expires. These blocked
requests might hold critical system resources such as memory, threads, database connections, and so on.
Consequently, these resources could become exhausted, causing failure of other possibly unrelated parts of
the system that need to use the same resources. In these situations, it would be preferable for the operation to
fail immediately, and only attempt to invoke the service if it's likely to succeed. Note that setting a shorter
timeout might help to resolve this problem, but the timeout shouldn't be so short that the operation fails
most of the time, even if the request to the service would eventually succeed.

Solution
The Circuit Breaker pattern, popularized by Michael Nygard in his book, Release It!, can prevent an application
from repeatedly trying to execute an operation that's likely to fail. Allowing it to continue without waiting for
the fault to be fixed or wasting CPU cycles while it determines that the fault is long lasting. The Circuit Breaker
pattern also enables an application to detect whether the fault has been resolved. If the problem appears to
have been fixed, the application can try to invoke the operation.

The purpose of the Circuit Breaker pattern is different than the Retry pattern. The Retry pattern enables an
application to retry an operation in the expectation that it'll succeed. The Circuit Breaker pattern prevents
an application from performing an operation that is likely to fail. An application can combine these two
patterns by using the Retry pattern to invoke an operation through a circuit breaker. However, the retry
logic should be sensitive to any exceptions returned by the circuit breaker and abandon retry attempts if
the circuit breaker indicates that a fault is not transient.

A circuit breaker acts as a proxy for operations that might fail. The proxy should monitor the number of recent
failures that have occurred, and use this information to decide whether to allow the operation to proceed, or
simply return an exception immediately.
The proxy can be implemented as a state machine with the following states that mimic the functionality of an
electrical circuit breaker:
Closed : The request from the application is routed to the operation. The proxy maintains a count of the
number of recent failures, and if the call to the operation is unsuccessful the proxy increments this
count. If the number of recent failures exceeds a specified threshold within a given time period, the
proxy is placed into the Open state. At this point the proxy starts a timeout timer, and when this timer
expires the proxy is placed into the Half-Open state.

The purpose of the timeout timer is to give the system time to fix the problem that caused the
failure before allowing the application to try to perform the operation again.

Open : The request from the application fails immediately and an exception is returned to the
application.
Half-Open : A limited number of requests from the application are allowed to pass through and invoke
the operation. If these requests are successful, it's assumed that the fault that was previously causing
the failure has been fixed and the circuit breaker switches to the Closed state (the failure counter is
reset). If any request fails, the circuit breaker assumes that the fault is still present so it reverts back to
the Open state and restarts the timeout timer to give the system a further period of time to recover
from the failure.

The Half-Open state is useful to prevent a recovering service from suddenly being flooded with
requests. As a service recovers, it might be able to support a limited volume of requests until the
recovery is complete, but while recovery is in progress a flood of work can cause the service to
time out or fail again.

In the figure, the failure counter used by the Closed state is time based. It's automatically reset at periodic
intervals. This helps to prevent the circuit breaker from entering the Open state if it experiences occasional
failures. The failure threshold that trips the circuit breaker into the Open state is only reached when a
specified number of failures have occurred during a specified interval. The counter used by the Half-Open
state records the number of successful attempts to invoke the operation. The circuit breaker reverts to the
Closed state after a specified number of consecutive operation invocations have been successful. If any
invocation fails, the circuit breaker enters the Open state immediately and the success counter will be reset
the next time it enters the Half-Open state.

How the system recovers is handled externally, possibly by restoring or restarting a failed component or
repairing a network connection.

The Circuit Breaker pattern provides stability while the system recovers from a failure and minimizes the
impact on performance. It can help to maintain the response time of the system by quickly rejecting a request
for an operation that's likely to fail, rather than waiting for the operation to time out, or never return. If the
circuit breaker raises an event each time it changes state, this information can be used to monitor the health
of the part of the system protected by the circuit breaker, or to alert an administrator when a circuit breaker
trips to the Open state.
The pattern is customizable and can be adapted according to the type of the possible failure. For example, you
can apply an increasing timeout timer to a circuit breaker. You could place the circuit breaker in the Open
state for a few seconds initially, and then if the failure hasn't been resolved increase the timeout to a few
minutes, and so on. In some cases, rather than the Open state returning failure and raising an exception, it
could be useful to return a default value that is meaningful to the application.

Issues and considerations


You should consider the following points when deciding how to implement this pattern:
Exception Handling . An application invoking an operation through a circuit breaker must be prepared to
handle the exceptions raised if the operation is unavailable. The way exceptions are handled will be
application specific. For example, an application could temporarily degrade its functionality, invoke an
alternative operation to try to perform the same task or obtain the same data, or report the exception to the
user and ask them to try again later.
Types of Exceptions . A request might fail for many reasons, some of which might indicate a more severe
type of failure than others. For example, a request might fail because a remote service has crashed and will
take several minutes to recover, or because of a timeout due to the service being temporarily overloaded. A
circuit breaker might be able to examine the types of exceptions that occur and adjust its strategy depending
on the nature of these exceptions. For example, it might require a larger number of timeout exceptions to trip
the circuit breaker to the Open state compared to the number of failures due to the service being completely
unavailable.
Logging . A circuit breaker should log all failed requests (and possibly successful requests) to enable an
administrator to monitor the health of the operation.
Recoverability . You should configure the circuit breaker to match the likely recovery pattern of the operation
it's protecting. For example, if the circuit breaker remains in the Open state for a long period, it could raise
exceptions even if the reason for the failure has been resolved. Similarly, a circuit breaker could fluctuate and
reduce the response times of applications if it switches from the Open state to the Half-Open state too
quickly.
Testing Failed Operations . In the Open state, rather than using a timer to determine when to switch to the
Half-Open state, a circuit breaker can instead periodically ping the remote service or resource to determine
whether it's become available again. This ping could take the form of an attempt to invoke an operation that
had previously failed, or it could use a special operation provided by the remote service specifically for testing
the health of the service, as described by the Health Endpoint Monitoring pattern.
Manual Override . In a system where the recovery time for a failing operation is extremely variable, it's
beneficial to provide a manual reset option that enables an administrator to close a circuit breaker (and reset
the failure counter). Similarly, an administrator could force a circuit breaker into the Open state (and restart
the timeout timer) if the operation protected by the circuit breaker is temporarily unavailable.
Concurrency . The same circuit breaker could be accessed by a large number of concurrent instances of an
application. The implementation shouldn't block concurrent requests or add excessive overhead to each call to
an operation.
Resource Differentiation . Be careful when using a single circuit breaker for one type of resource if there
might be multiple underlying independent providers. For example, in a data store that contains multiple
shards, one shard might be fully accessible while another is experiencing a temporary issue. If the error
responses in these scenarios are merged, an application might try to access some shards even when failure is
highly likely, while access to other shards might be blocked even though it's likely to succeed.
Accelerated Circuit Breaking . Sometimes a failure response can contain enough information for the circuit
breaker to trip immediately and stay tripped for a minimum amount of time. For example, the error response
from a shared resource that's overloaded could indicate that an immediate retry isn't recommended and that
the application should instead try again in a few minutes.

NOTE
A service can return HTTP 429 (Too Many Requests) if it is throttling the client, or HTTP 503 (Service Unavailable) if the
service is not currently available. The response can include additional information, such as the anticipated duration of
the delay.

Replaying Failed Requests . In the Open state, rather than simply failing quickly, a circuit breaker could also
record the details of each request to a journal and arrange for these requests to be replayed when the remote
resource or service becomes available.
Inappropriate Timeouts on External Ser vices . A circuit breaker might not be able to fully protect
applications from operations that fail in external services that are configured with a lengthy timeout period. If
the timeout is too long, a thread running a circuit breaker might be blocked for an extended period before the
circuit breaker indicates that the operation has failed. In this time, many other application instances might also
try to invoke the service through the circuit breaker and tie up a significant number of threads before they all
fail.

When to use this pattern


Use this pattern:
To prevent an application from trying to invoke a remote service or access a shared resource if this
operation is highly likely to fail.
This pattern isn't recommended:
For handling access to local private resources in an application, such as in-memory data structure. In this
environment, using a circuit breaker would add overhead to your system.
As a substitute for handling exceptions in the business logic of your applications.

Example
In a web application, several of the pages are populated with data retrieved from an external service. If the
system implements minimal caching, most hits to these pages will cause a round trip to the service.
Connections from the web application to the service could be configured with a timeout period (typically 60
seconds), and if the service doesn't respond in this time the logic in each web page will assume that the
service is unavailable and throw an exception.
However, if the service fails and the system is very busy, users could be forced to wait for up to 60 seconds
before an exception occurs. Eventually resources such as memory, connections, and threads could be
exhausted, preventing other users from connecting to the system, even if they aren't accessing pages that
retrieve data from the service.
Scaling the system by adding further web servers and implementing load balancing might delay when
resources become exhausted, but it won't resolve the issue because user requests will still be unresponsive
and all web servers could still eventually run out of resources.
Wrapping the logic that connects to the service and retrieves the data in a circuit breaker could help to solve
this problem and handle the service failure more elegantly. User requests will still fail, but they'll fail more
quickly and the resources won't be blocked.
The class maintains state information about a circuit breaker in an object that implements the
CircuitBreaker
ICircuitBreakerStateStore interface shown in the following code.

interface ICircuitBreakerStateStore
{
CircuitBreakerStateEnum State { get; }

Exception LastException { get; }

DateTime LastStateChangedDateUtc { get; }

void Trip(Exception ex);

void Reset();

void HalfOpen();

bool IsClosed { get; }


}

The State property indicates the current state of the circuit breaker, and will be either Open , HalfOpen , or
Closed as defined by the CircuitBreakerStateEnum enumeration. The IsClosed property should be true if the
circuit breaker is closed, but false if it's open or half open. The Trip method switches the state of the circuit
breaker to the open state and records the exception that caused the change in state, together with the date
and time that the exception occurred. The LastException and the LastStateChangedDateUtc properties return
this information. The Reset method closes the circuit breaker, and the HalfOpen method sets the circuit
breaker to half open.
The InMemoryCircuitBreakerStateStore class in the example contains an implementation of the
ICircuitBreakerStateStore interface. The CircuitBreaker class creates an instance of this class to hold the
state of the circuit breaker.
The ExecuteAction method in the CircuitBreaker class wraps an operation, specified as an Action delegate.
If the circuit breaker is closed, ExecuteAction invokes the Action delegate. If the operation fails, an exception
handler calls TrackException , which sets the circuit breaker state to open. The following code example
highlights this flow.
public class CircuitBreaker
{
private readonly ICircuitBreakerStateStore stateStore =
CircuitBreakerStateStoreFactory.GetCircuitBreakerStateStore();

private readonly object halfOpenSyncObject = new object ();


...
public bool IsClosed { get { return stateStore.IsClosed; } }

public bool IsOpen { get { return !IsClosed; } }

public void ExecuteAction(Action action)


{
...
if (IsOpen)
{
// The circuit breaker is Open.
... (see code sample below for details)
}

// The circuit breaker is Closed, execute the action.


try
{
action();
}
catch (Exception ex)
{
// If an exception still occurs here, simply
// retrip the breaker immediately.
this.TrackException(ex);

// Throw the exception so that the caller can tell


// the type of exception that was thrown.
throw;
}
}

private void TrackException(Exception ex)


{
// For simplicity in this example, open the circuit breaker on the first exception.
// In reality this would be more complex. A certain type of exception, such as one
// that indicates a service is offline, might trip the circuit breaker immediately.
// Alternatively it might count exceptions locally or across multiple instances and
// use this value over time, or the exception/success ratio based on the exception
// types, to open the circuit breaker.
this.stateStore.Trip(ex);
}
}

The following example shows the code (omitted from the previous example) that is executed if the circuit
breaker isn't closed. It first checks if the circuit breaker has been open for a period longer than the time
specified by the local OpenToHalfOpenWaitTime field in the CircuitBreaker class. If this is the case, the
ExecuteAction method sets the circuit breaker to half open, then tries to perform the operation specified by
the Action delegate.
If the operation is successful, the circuit breaker is reset to the closed state. If the operation fails, it is tripped
back to the open state and the time the exception occurred is updated so that the circuit breaker will wait for a
further period before trying to perform the operation again.
If the circuit breaker has only been open for a short time, less than the OpenToHalfOpenWaitTime value, the
ExecuteAction method simply throws a CircuitBreakerOpenException exception and returns the error that
caused the circuit breaker to transition to the open state.
Additionally, it uses a lock to prevent the circuit breaker from trying to perform concurrent calls to the
operation while it's half open. A concurrent attempt to invoke the operation will be handled as if the circuit
breaker was open, and it'll fail with an exception as described later.

...
if (IsOpen)
{
// The circuit breaker is Open. Check if the Open timeout has expired.
// If it has, set the state to HalfOpen. Another approach might be to
// check for the HalfOpen state that had be set by some other operation.
if (stateStore.LastStateChangedDateUtc + OpenToHalfOpenWaitTime < DateTime.UtcNow)
{
// The Open timeout has expired. Allow one operation to execute. Note that, in
// this example, the circuit breaker is set to HalfOpen after being
// in the Open state for some period of time. An alternative would be to set
// this using some other approach such as a timer, test method, manually, and
// so on, and check the state here to determine how to handle execution
// of the action.
// Limit the number of threads to be executed when the breaker is HalfOpen.
// An alternative would be to use a more complex approach to determine which
// threads or how many are allowed to execute, or to execute a simple test
// method instead.
bool lockTaken = false;
try
{
Monitor.TryEnter(halfOpenSyncObject, ref lockTaken);
if (lockTaken)
{
// Set the circuit breaker state to HalfOpen.
stateStore.HalfOpen();

// Attempt the operation.


action();

// If this action succeeds, reset the state and allow other operations.
// In reality, instead of immediately returning to the Closed state, a counter
// here would record the number of successful operations and return the
// circuit breaker to the Closed state only after a specified number succeed.
this.stateStore.Reset();
return;
}
}
catch (Exception ex)
{
// If there's still an exception, trip the breaker again immediately.
this.stateStore.Trip(ex);

// Throw the exception so that the caller knows which exception occurred.
throw;
}
finally
{
if (lockTaken)
{
Monitor.Exit(halfOpenSyncObject);
}
}
}
// The Open timeout hasn't yet expired. Throw a CircuitBreakerOpen exception to
// inform the caller that the call was not actually attempted,
// and return the most recent exception received.
throw new CircuitBreakerOpenException(stateStore.LastException);
}
...

To use a CircuitBreaker object to protect an operation, an application creates an instance of the


CircuitBreaker class and invokes the ExecuteAction method, specifying the operation to be performed as
the parameter. The application should be prepared to catch the CircuitBreakerOpenException exception if the
operation fails because the circuit breaker is open. The following code shows an example:

var breaker = new CircuitBreaker();

try
{
breaker.ExecuteAction(() =>
{
// Operation protected by the circuit breaker.
...
});
}
catch (CircuitBreakerOpenException ex)
{
// Perform some different action when the breaker is open.
// Last exception details are in the inner exception.
...
}
catch (Exception ex)
{
...
}

Related patterns and guidance


The following patterns might also be useful when implementing this pattern:
Retry pattern. Describes how an application can handle anticipated temporary failures when it tries to
connect to a service or network resource by transparently retrying an operation that has previously
failed.
Health Endpoint Monitoring pattern. A circuit breaker might be able to test the health of a service by
sending a request to an endpoint exposed by the service. The service should return information
indicating its status.
Claim-Check Pattern
11/2/2020 • 6 minutes to read • Edit Online

Split a large message into a claim check and a payload. Send the claim check to the messaging platform and store
the payload to an external service. This pattern allows large messages to be processed, while protecting the
message bus and the client from being overwhelmed or slowed down. This pattern also helps to reduce costs, as
storage is usually cheaper than resource units used by the messaging platform.
This pattern is also known as Reference-Based Messaging, and was originally described in the book Enterprise
Integration Patterns, by Gregor Hohpe and Bobby Woolf.

Context and problem


A messaging-based architecture at some point must be able to send, receive, and manipulate large messages.
Such messages may contain anything, including images (for example, MRI scans), sound files (for example, call-
center calls), text documents, or any kind of binary data of arbitrary size.
Sending such large messages to the message bus directly is not recommended, because they require more
resources and bandwidth to be consumed. Large messages can also slow down the entire solution, because
messaging platforms are usually fine-tuned to handle huge quantities of small messages. Also, most messaging
platforms have limits on message size, so you may need to work around these limits for large messages.

Solution
Store the entire message payload into an external service, such as a database. Get the reference to the stored
payload, and send just that reference to the message bus. The reference acts like a claim check used to retrieve a
piece of luggage, hence the name of the pattern. Clients interested in processing that specific message can use the
obtained reference to retrieve the payload, if needed.

Issues and considerations


Consider the following points when deciding how to implement this pattern:
Consider deleting the message data after consuming it, if you don't need to archive the messages. Although
blob storage is relatively cheap, it costs some money in the long run, especially if there is a lot of data.
Deleting the message can be done synchronously by the application that receives and processes the
message, or asynchronously by a separate dedicated process. The asynchronous approach removes old
data with no impact on the throughput and message processing performance of the receiving application.
Storing and retrieving the message causes some additional overhead and latency. You may want to
implement logic in the sending application to use this pattern only when the message size exceeds the data
limit of the message bus. The pattern would be skipped for smaller messages. This approach would result in
a conditional claim-check pattern.

When to use this pattern


This pattern should be used whenever a message cannot fit the supported message limit of the chosen message
bus technology. For example, Event Hubs currently has a limit of 256 KB (Basic Tier), while Event Grid supports
only 64-KB messages.
The pattern can also be used if the payload should be accessed only by services that are authorized to see it. By
offloading the payload to an external resource, stricter authentication and authorization rules can be put in place,
to ensure that security is enforced when sensitive data is stored in the payload.

Examples
On Azure, this pattern can be implemented in several ways and with different technologies, but there are two main
categories. In both cases, the receiver has the responsibility to read the claim check and use it to retrieve the
payload.
Automatic claim-check generation . This approach uses Azure Event Grid to automatically generate the
claim check and push it into the message bus.
Manual claim-check generation . In this approach, the sender is responsible for managing the payload.
The sender stores the payload using the appropriate service, gets or generates the claim check, and sends
the claim check to the message bus.
Event Grid is an event routing service and tries to deliver events within a configurable amount of time up to 24
hours. After that, events are either discarded or dead lettered. If you need to archive the event payloads or replay
the event stream, you can add an Event Grid subscription to Event Hubs or Queue Storage, where messages can be
retained for longer periods and archiving messages is supported. For information about fine tuning Event Grid
message delivery and retry, and dead letter configuration, see Dead letter and retry policies.
Automatic claim-check generation with Blob Storage and Event Grid
In this approach, the sender drops the message payload into a designated Azure Blob Storage container. Event Grid
automatically generates a tag/reference and sends it to a supported message bus, such as Azure Storage Queues.
The receiver can poll the queue, get the message, and then use the stored reference data to download the payload
directly from Blob Storage.
The same Event Grid message can be directly consumed by Azure Functions, without needing to go through a
message bus. This approach takes full advantage of the serverless nature of both Event Grid and Functions.
You can find example code for this approach here.
Event Grid with Event Hubs
Similar to the previous example, Event Grid automatically generates a message when a payload is written to an
Azure Blob container. But in this example, the message bus is implemented using Event Hubs. A client can register
itself to receive the stream of messages as they are written to the event hub. The event hub can also be configured
to archive messages, making them available as an Avro file that can be queried using tools like Apache Spark,
Apache Drill, or any of the available Avro libraries.
You can find example code for this approach here.
Claim check generation with Service Bus
This solution takes advantage of a specific Service Bus plugin, ServiceBus.AttachmentPlugin, which makes the
claim-check workflow easy to implement. The plugin converts any message body into an attachment that gets
stored in Azure Blob Storage when the message is sent.

using ServiceBus.AttachmentPlugin;
...

// Getting connection information


var serviceBusConnectionString = Environment.GetEnvironmentVariable("SERVICE_BUS_CONNECTION_STRING");
var queueName = Environment.GetEnvironmentVariable("QUEUE_NAME");
var storageConnectionString = Environment.GetEnvironmentVariable("STORAGE_CONNECTION_STRING");

// Creating config for sending message


var config = new AzureStorageAttachmentConfiguration(storageConnectionString);

// Creating and registering the sender using Service Bus Connection String and Queue Name
var sender = new MessageSender(serviceBusConnectionString, queueName);
sender.RegisterAzureStorageAttachmentPlugin(config);

// Create payload
var payload = new { data = "random data string for testing" };
var serialized = JsonConvert.SerializeObject(payload);
var payloadAsBytes = Encoding.UTF8.GetBytes(serialized);
var message = new Message(payloadAsBytes);

// Send the message


await sender.SendAsync(message);

The Service Bus message acts as a notification queue, which a client can subscribe to. When the consumer receives
the message, the plugin makes it possible to directly read the message data from Blob Storage. You can then
choose how to process the message further. An advantage of this approach is that it abstracts the claim-check
workflow from the sender and receiver.
You can find example code for this approach here.
Manual claim-check generation with Kafka
In this example, a Kafka client writes the payload to Azure Blob Storage. Then it sends a notification message using
Kafka-enabled Event Hubs. The consumer receives the message and can access the payload from Blob Storage.
This example shows how a different messaging protocol can be used to implement the claim-check pattern in
Azure. For example, you might need to support existing Kafka clients.
You can find example code for this approach here.

Related patterns and guidance


The examples described above are available on GitHub.
The Enterprise Integration Patterns site has a description of this pattern.
For another example, see Dealing with large Service Bus messages using claim check pattern (blog post).
An alternative pattern for handling large messages is Split and Aggregate.
Command and Query Responsibility Segregation
(CQRS) pattern
11/13/2020 • 11 minutes to read • Edit Online

The Command and Query Responsibility Segregation (CQRS) pattern separates read and update operations for a
data store. Implementing CQRS in your application can maximize its performance, scalability, and security. The
flexibility created by migrating to CQRS allows a system to better evolve over time and prevents update
commands from causing merge conflicts at the domain level.

The problem
In traditional architectures, the same data model is used to query and update a database. That's simple and works
well for basic CRUD operations. In more complex applications, however, this approach can become unwieldy. For
example, on the read side, the application may perform many different queries, returning data transfer objects
(DTOs) with different shapes. Object mapping can become complicated. On the write side, the model may
implement complex validation and business logic. As a result, you can end up with an overly complex model that
does too much.
Read and write workloads are often asymmetrical, with very different performance and scale requirements.

There is often a mismatch between the read and write representations of the data, such as additional
columns or properties that must be updated correctly even though they aren't required as part of an
operation.
Data contention can occur when operations are performed in parallel on the same set of data.
The traditional approach can have a negative effect on performance due to load on the data store and data
access layer, and the complexity of queries required to retrieve information.
Managing security and permissions can become complex, because each entity is subject to both read and
write operations, which might expose data in the wrong context.

Solution
CQRS separates reads and writes into different models, using commands to update data, and queries to read
data.
Commands should be task based, rather than data centric. ("Book hotel room", not "set ReservationStatus to
Reserved").
Commands may be placed on a queue for asynchronous processing, rather than being processed
synchronously.
Queries never modify the database. A query returns a DTO that does not encapsulate any domain knowledge.
The models can then be isolated, as shown in the following diagram, although that's not an absolute requirement.

Having separate query and update models simplifies the design and implementation. However, one disadvantage
is that CQRS code can't automatically be generated from a database schema using scaffolding mechanisms such
as O/RM tools.
For greater isolation, you can physically separate the read data from the write data. In that case, the read database
can use its own data schema that is optimized for queries. For example, it can store a materialized view of the
data, in order to avoid complex joins or complex O/RM mappings. It might even use a different type of data store.
For example, the write database might be relational, while the read database is a document database.
If separate read and write databases are used, they must be kept in sync. Typically this is accomplished by having
the write model publish an event whenever it updates the database. Updating the database and publishing the
event must occur in a single transaction.

The read store can be a read-only replica of the write store, or the read and write stores can have a different
structure altogether. Using multiple read-only replicas can increase query performance, especially in distributed
scenarios where read-only replicas are located close to the application instances.
Separation of the read and write stores also allows each to be scaled appropriately to match the load. For
example, read stores typically encounter a much higher load than write stores.
Some implementations of CQRS use the Event Sourcing pattern. With this pattern, application state is stored as a
sequence of events. Each event represents a set of changes to the data. The current state is constructed by
replaying the events. In a CQRS context, one benefit of Event Sourcing is that the same events can be used to
notify other components — in particular, to notify the read model. The read model uses the events to create a
snapshot of the current state, which is more efficient for queries. However, Event Sourcing adds complexity to the
design.
Benefits of CQRS include:
Independent scaling . CQRS allows the read and write workloads to scale independently, and may result in
fewer lock contentions.
Optimized data schemas . The read side can use a schema that is optimized for queries, while the write side
uses a schema that is optimized for updates.
Security . It's easier to ensure that only the right domain entities are performing writes on the data.
Separation of concerns . Segregating the read and write sides can result in models that are more
maintainable and flexible. Most of the complex business logic goes into the write model. The read model can
be relatively simple.
Simpler queries . By storing a materialized view in the read database, the application can avoid complex joins
when querying.

Implementation issues and considerations


Some challenges of implementing this pattern include:
Complexity . The basic idea of CQRS is simple. But it can lead to a more complex application design,
especially if they include the Event Sourcing pattern.
Messaging . Although CQRS does not require messaging, it's common to use messaging to process
commands and publish update events. In that case, the application must handle message failures or
duplicate messages.
Eventual consistency . If you separate the read and write databases, the read data may be stale. The read
model store must be updated to reflect changes to the write model store, and it can be difficult to detect
when a user has issued a request based on stale read data.

When to use this pattern


Consider CQRS for the following scenarios:
Collaborative domains where many users access the same data in parallel. CQRS allows you to define
commands with enough granularity to minimize merge conflicts at the domain level, and conflicts that do
arise can be merged by the command.
Task-based user interfaces where users are guided through a complex process as a series of steps or with
complex domain models. The write model has a full command-processing stack with business logic, input
validation, and business validation. The write model may treat a set of associated objects as a single unit
for data changes (an aggregate, in DDD terminology) and ensure that these objects are always in a
consistent state. The read model has no business logic or validation stack, and just returns a DTO for use in
a view model. The read model is eventually consistent with the write model.
Scenarios where performance of data reads must be fine tuned separately from performance of data
writes, especially when the number of reads is much greater than the number of writes. In this scenario,
you can scale out the read model, but run the write model on just a few instances. A small number of write
model instances also helps to minimize the occurrence of merge conflicts.
Scenarios where one team of developers can focus on the complex domain model that is part of the write
model, and another team can focus on the read model and the user interfaces.
Scenarios where the system is expected to evolve over time and might contain multiple versions of the
model, or where business rules change regularly.
Integration with other systems, especially in combination with event sourcing, where the temporal failure
of one subsystem shouldn't affect the availability of the others.
This pattern isn't recommended when:
The domain or the business rules are simple.
A simple CRUD-style user interface and data access operations are sufficient.
Consider applying CQRS to limited sections of your system where it will be most valuable.

Event Sourcing and CQRS


The CQRS pattern is often used along with the Event Sourcing pattern. CQRS-based systems use separate read
and write data models, each tailored to relevant tasks and often located in physically separate stores. When used
with the Event Sourcing pattern, the store of events is the write model, and is the official source of information.
The read model of a CQRS-based system provides materialized views of the data, typically as highly
denormalized views. These views are tailored to the interfaces and display requirements of the application, which
helps to maximize both display and query performance.
Using the stream of events as the write store, rather than the actual data at a point in time, avoids update conflicts
on a single aggregate and maximizes performance and scalability. The events can be used to asynchronously
generate materialized views of the data that are used to populate the read store.
Because the event store is the official source of information, it is possible to delete the materialized views and
replay all past events to create a new representation of the current state when the system evolves, or when the
read model must change. The materialized views are in effect a durable read-only cache of the data.
When using CQRS combined with the Event Sourcing pattern, consider the following:
As with any system where the write and read stores are separate, systems based on this pattern are only
eventually consistent. There will be some delay between the event being generated and the data store
being updated.
The pattern adds complexity because code must be created to initiate and handle events, and assemble or
update the appropriate views or objects required by queries or a read model. The complexity of the CQRS
pattern when used with the Event Sourcing pattern can make a successful implementation more difficult,
and requires a different approach to designing systems. However, event sourcing can make it easier to
model the domain, and makes it easier to rebuild views or create new ones because the intent of the
changes in the data is preserved.
Generating materialized views for use in the read model or projections of the data by replaying and
handling the events for specific entities or collections of entities can require significant processing time
and resource usage. This is especially true if it requires summation or analysis of values over long periods,
because all the associated events might need to be examined. Resolve this by implementing snapshots of
the data at scheduled intervals, such as a total count of the number of a specific action that have occurred,
or the current state of an entity.

Example
The following code shows some extracts from an example of a CQRS implementation that uses different
definitions for the read and the write models. The model interfaces don't dictate any features of the underlying
data stores, and they can evolve and be fine-tuned independently because these interfaces are separated.
The following code shows the read model definition.
// Query interface
namespace ReadModel
{
public interface ProductsDao
{
ProductDisplay FindById(int productId);
ICollection<ProductDisplay> FindByName(string name);
ICollection<ProductInventory> FindOutOfStockProducts();
ICollection<ProductDisplay> FindRelatedProducts(int productId);
}

public class ProductDisplay


{
public int Id { get; set; }
public string Name { get; set; }
public string Description { get; set; }
public decimal UnitPrice { get; set; }
public bool IsOutOfStock { get; set; }
public double UserRating { get; set; }
}

public class ProductInventory


{
public int Id { get; set; }
public string Name { get; set; }
public int CurrentStock { get; set; }
}
}

The system allows users to rate products. The application code does this using the RateProduct command shown
in the following code.

public interface ICommand


{
Guid Id { get; }
}

public class RateProduct : ICommand


{
public RateProduct()
{
this.Id = Guid.NewGuid();
}
public Guid Id { get; set; }
public int ProductId { get; set; }
public int Rating { get; set; }
public int UserId {get; set; }
}

The system uses the ProductsCommandHandler class to handle commands sent by the application. Clients typically
send commands to the domain through a messaging system such as a queue. The command handler accepts
these commands and invokes methods of the domain interface. The granularity of each command is designed to
reduce the chance of conflicting requests. The following code shows an outline of the ProductsCommandHandler
class.
public class ProductsCommandHandler :
ICommandHandler<AddNewProduct>,
ICommandHandler<RateProduct>,
ICommandHandler<AddToInventory>,
ICommandHandler<ConfirmItemShipped>,
ICommandHandler<UpdateStockFromInventoryRecount>
{
private readonly IRepository<Product> repository;

public ProductsCommandHandler (IRepository<Product> repository)


{
this.repository = repository;
}

void Handle (AddNewProduct command)


{
...
}

void Handle (RateProduct command)


{
var product = repository.Find(command.ProductId);
if (product != null)
{
product.RateProduct(command.UserId, command.Rating);
repository.Save(product);
}
}

void Handle (AddToInventory command)


{
...
}

void Handle (ConfirmItemsShipped command)


{
...
}

void Handle (UpdateStockFromInventoryRecount command)


{
...
}
}

Related patterns and guidance


The following patterns and guidance are useful when implementing this pattern:
Data Consistency Primer. Explains the issues that are typically encountered due to eventual consistency
between the read and write data stores when using the CQRS pattern, and how these issues can be
resolved.
Data Partitioning Guidance. Describes best practices for dividing data into partitions that can be managed
and accessed separately to improve scalability, reduce contention, and optimize performance.
Event Sourcing pattern. Describes in more detail how Event Sourcing can be used with the CQRS pattern to
simplify tasks in complex domains while improving performance, scalability, and responsiveness. As well
as how to provide consistency for transactional data while maintaining full audit trails and history that can
enable compensating actions.
Materialized View pattern. The read model of a CQRS implementation can contain materialized views of
the write model data, or the read model can be used to generate materialized views.
The patterns & practices guide CQRS Journey. In particular, Introducing the Command Query
Responsibility Segregation pattern explores the pattern and when it's useful, and Epilogue: Lessons
Learned helps you understand some of the issues that come up when using this pattern.
The post CQRS by Martin Fowler, which explains the basics of the pattern and links to other useful
resources.
Compensating Transaction pattern
11/2/2020 • 7 minutes to read • Edit Online

Undo the work performed by a series of steps, which together define an eventually consistent operation, if one
or more of the steps fail. Operations that follow the eventual consistency model are commonly found in cloud-
hosted applications that implement complex business processes and workflows.

Context and problem


Applications running in the cloud frequently modify data. This data might be spread across various data sources
held in different geographic locations. To avoid contention and improve performance in a distributed
environment, an application shouldn't try to provide strong transactional consistency. Rather, the application
should implement eventual consistency. In this model, a typical business operation consists of a series of
separate steps. While these steps are being performed, the overall view of the system state might be
inconsistent, but when the operation has completed and all of the steps have been executed the system should
become consistent again.

The Data Consistency Primer provides information about why distributed transactions don't scale well, and
the principles of the eventual consistency model.

A challenge in the eventual consistency model is how to handle a step that has failed. In this case it might be
necessary to undo all of the work completed by the previous steps in the operation. However, the data can't
simply be rolled back because other concurrent instances of the application might have changed it. Even in
cases where the data hasn't been changed by a concurrent instance, undoing a step might not simply be a
matter of restoring the original state. It might be necessary to apply various business-specific rules (see the
travel website described in the Example section).
If an operation that implements eventual consistency spans several heterogeneous data stores, undoing the
steps in the operation will require visiting each data store in turn. The work performed in every data store must
be undone reliably to prevent the system from remaining inconsistent.
Not all data affected by an operation that implements eventual consistency might be held in a database. In a
service oriented architecture (SOA) environment an operation could invoke an action in a service, and cause a
change in the state held by that service. To undo the operation, this state change must also be undone. This can
involve invoking the service again and performing another action that reverses the effects of the first.

Solution
The solution is to implement a compensating transaction. The steps in a compensating transaction must undo
the effects of the steps in the original operation. A compensating transaction might not be able to simply
replace the current state with the state the system was in at the start of the operation because this approach
could overwrite changes made by other concurrent instances of an application. Instead, it must be an intelligent
process that takes into account any work done by concurrent instances. This process will usually be application
specific, driven by the nature of the work performed by the original operation.
A common approach is to use a workflow to implement an eventually consistent operation that requires
compensation. As the original operation proceeds, the system records information about each step and how the
work performed by that step can be undone. If the operation fails at any point, the workflow rewinds back
through the steps it's completed and performs the work that reverses each step. Note that a compensating
transaction might not have to undo the work in the exact reverse order of the original operation, and it might be
possible to perform some of the undo steps in parallel.

This approach is similar to the Sagas strategy discussed in Clemens Vasters’ blog.

A compensating transaction is also an eventually consistent operation and it could also fail. The system should
be able to resume the compensating transaction at the point of failure and continue. It might be necessary to
repeat a step that's failed, so the steps in a compensating transaction should be defined as idempotent
commands. For more information, see Idempotency Patterns on Jonathan Oliver’s blog.
In some cases it might not be possible to recover from a step that has failed except through manual
intervention. In these situations the system should raise an alert and provide as much information as possible
about the reason for the failure.

Issues and considerations


Consider the following points when deciding how to implement this pattern:
It might not be easy to determine when a step in an operation that implements eventual consistency has failed.
A step might not fail immediately, but instead could block. It might be necessary to implement some form of
time-out mechanism.
-Compensation logic isn't easily generalized. A compensating transaction is application specific. It relies on the
application having sufficient information to be able to undo the effects of each step in a failed operation.
You should define the steps in a compensating transaction as idempotent commands. This enables the steps to
be repeated if the compensating transaction itself fails.
The infrastructure that handles the steps in the original operation, and the compensating transaction, must be
resilient. It must not lose the information required to compensate for a failing step, and it must be able to
reliably monitor the progress of the compensation logic.
A compensating transaction doesn't necessarily return the data in the system to the state it was in at the start of
the original operation. Instead, it compensates for the work performed by the steps that completed successfully
before the operation failed.
The order of the steps in the compensating transaction doesn't necessarily have to be the exact opposite of the
steps in the original operation. For example, one data store might be more sensitive to inconsistencies than
another, and so the steps in the compensating transaction that undo the changes to this store should occur first.
Placing a short-term timeout-based lock on each resource that's required to complete an operation, and
obtaining these resources in advance, can help increase the likelihood that the overall activity will succeed. The
work should be performed only after all the resources have been acquired. All actions must be finalized before
the locks expire.
Consider using retry logic that is more forgiving than usual to minimize failures that trigger a compensating
transaction. If a step in an operation that implements eventual consistency fails, try handling the failure as a
transient exception and repeat the step. Only stop the operation and initiate a compensating transaction if a
step fails repeatedly or irrecoverably.

Many of the challenges of implementing a compensating transaction are the same as those with
implementing eventual consistency. See the section Considerations for Implementing Eventual Consistency
in the Data Consistency Primer for more information.

When to use this pattern


Use this pattern only for operations that must be undone if they fail. If possible, design solutions to avoid the
complexity of requiring compensating transactions.

Example
A travel website lets customers book itineraries. A single itinerary might comprise a series of flights and hotels.
A customer traveling from Seattle to London and then on to Paris could perform the following steps when
creating an itinerary:
1. Book a seat on flight F1 from Seattle to London.
2. Book a seat on flight F2 from London to Paris.
3. Book a seat on flight F3 from Paris to Seattle.
4. Reserve a room at hotel H1 in London.
5. Reserve a room at hotel H2 in Paris.
These steps constitute an eventually consistent operation, although each step is a separate action. Therefore, as
well as performing these steps, the system must also record the counter operations necessary to undo each
step in case the customer decides to cancel the itinerary. The steps necessary to perform the counter operations
can then run as a compensating transaction.
Notice that the steps in the compensating transaction might not be the exact opposite of the original steps, and
the logic in each step in the compensating transaction must take into account any business-specific rules. For
example, unbooking a seat on a flight might not entitle the customer to a complete refund of any money paid.
The figure illustrates generating a compensating transaction to undo a long-running transaction to book a
travel itinerary.

NOTE
It might be possible for the steps in the compensating transaction to be performed in parallel, depending on how you've
designed the compensating logic for each step.

In many business solutions, failure of a single step doesn't always necessitate rolling the system back by using a
compensating transaction. For example, if—after having booked flights F1, F2, and F3 in the travel website
scenario—the customer is unable to reserve a room at hotel H1, it's preferable to offer the customer a room at
a different hotel in the same city rather than canceling the flights. The customer can still decide to cancel (in
which case the compensating transaction runs and undoes the bookings made on flights F1, F2, and F3), but
this decision should be made by the customer rather than by the system.

Related patterns and guidance


The following patterns and guidance might also be relevant when implementing this pattern:
Data Consistency Primer. The Compensating Transaction pattern is often used to undo operations that
implement the eventual consistency model. This primer provides information on the benefits and
tradeoffs of eventual consistency.
Scheduler-Agent-Supervisor pattern. Describes how to implement resilient systems that perform
business operations that use distributed services and resources. Sometimes, it might be necessary to
undo the work performed by an operation by using a compensating transaction.
Retry pattern. Compensating transactions can be expensive to perform, and it might be possible to
minimize their use by implementing an effective policy of retrying failing operations by following the
Retry pattern.
Competing Consumers pattern
11/2/2020 • 8 minutes to read • Edit Online

Enable multiple concurrent consumers to process messages received on the same messaging channel. This
enables a system to process multiple messages concurrently to optimize throughput, to improve scalability and
availability, and to balance the workload.

Context and problem


An application running in the cloud is expected to handle a large number of requests. Rather than process each
request synchronously, a common technique is for the application to pass them through a messaging system to
another service (a consumer service) that handles them asynchronously. This strategy helps to ensure that the
business logic in the application isn't blocked while the requests are being processed.
The number of requests can vary significantly over time for many reasons. A sudden increase in user activity or
aggregated requests coming from multiple tenants can cause an unpredictable workload. At peak hours a
system might need to process many hundreds of requests per second, while at other times the number could be
very small. Additionally, the nature of the work performed to handle these requests might be highly variable.
Using a single instance of the consumer service can cause that instance to become flooded with requests, or the
messaging system might be overloaded by an influx of messages coming from the application. To handle this
fluctuating workload, the system can run multiple instances of the consumer service. However, these consumers
must be coordinated to ensure that each message is only delivered to a single consumer. The workload also
needs to be load balanced across consumers to prevent an instance from becoming a bottleneck.

Solution
Use a message queue to implement the communication channel between the application and the instances of
the consumer service. The application posts requests in the form of messages to the queue, and the consumer
service instances receive messages from the queue and process them. This approach enables the same pool of
consumer service instances to handle messages from any instance of the application. The figure illustrates using
a message queue to distribute work to instances of a service.

This solution has the following benefits:


It provides a load-leveled system that can handle wide variations in the volume of requests sent by
application instances. The queue acts as a buffer between the application instances and the consumer
service instances. This can help to minimize the impact on availability and responsiveness for both the
application and the service instances, as described by the Queue-based Load Leveling pattern. Handling a
message that requires some long-running processing doesn't prevent other messages from being
handled concurrently by other instances of the consumer service.
It improves reliability. If a producer communicates directly with a consumer instead of using this pattern,
but doesn't monitor the consumer, there's a high probability that messages could be lost or fail to be
processed if the consumer fails. In this pattern, messages aren't sent to a specific service instance. A failed
service instance won't block a producer, and messages can be processed by any working service instance.
It doesn't require complex coordination between the consumers, or between the producer and the
consumer instances. The message queue ensures that each message is delivered at least once.
It's scalable. The system can dynamically increase or decrease the number of instances of the consumer
service as the volume of messages fluctuates.
It can improve resiliency if the message queue provides transactional read operations. If a consumer
service instance reads and processes the message as part of a transactional operation, and the consumer
service instance fails, this pattern can ensure that the message will be returned to the queue to be picked
up and handled by another instance of the consumer service.

Issues and considerations


Consider the following points when deciding how to implement this pattern:
Message ordering . The order in which consumer service instances receive messages isn't guaranteed,
and doesn't necessarily reflect the order in which the messages were created. Design the system to
ensure that message processing is idempotent because this will help to eliminate any dependency on the
order in which messages are handled. For more information, see Idempotency Patterns on Jonathon
Oliver’s blog.

Microsoft Azure Service Bus Queues can implement guaranteed first-in-first-out ordering of
messages by using message sessions. For more information, see Messaging Patterns Using Sessions.

Designing ser vices for resiliency . If the system is designed to detect and restart failed service
instances, it might be necessary to implement the processing performed by the service instances as
idempotent operations to minimize the effects of a single message being retrieved and processed more
than once.
Detecting poison messages . A malformed message, or a task that requires access to resources that
aren't available, can cause a service instance to fail. The system should prevent such messages being
returned to the queue, and instead capture and store the details of these messages elsewhere so that they
can be analyzed if necessary.
Handling results . The service instance handling a message is fully decoupled from the application logic
that generates the message, and they might not be able to communicate directly. If the service instance
generates results that must be passed back to the application logic, this information must be stored in a
location that's accessible to both. In order to prevent the application logic from retrieving incomplete data
the system must indicate when processing is complete.

If you're using Azure, a worker process can pass results back to the application logic by using a
dedicated message reply queue. The application logic must be able to correlate these results with the
original message. This scenario is described in more detail in the Asynchronous Messaging Primer.

Scaling the messaging system . In a large-scale solution, a single message queue could be
overwhelmed by the number of messages and become a bottleneck in the system. In this situation,
consider partitioning the messaging system to send messages from specific producers to a particular
queue, or use load balancing to distribute messages across multiple message queues.
Ensuring reliability of the messaging system . A reliable messaging system is needed to guarantee
that after the application enqueues a message it won't be lost. This is essential for ensuring that all
messages are delivered at least once.

When to use this pattern


Use this pattern when:
The workload for an application is divided into tasks that can run asynchronously.
Tasks are independent and can run in parallel.
The volume of work is highly variable, requiring a scalable solution.
The solution must provide high availability, and must be resilient if the processing for a task fails.
This pattern might not be useful when:
It's not easy to separate the application workload into discrete tasks, or there's a high degree of dependence
between tasks.
Tasks must be performed synchronously, and the application logic must wait for a task to complete before
continuing.
Tasks must be performed in a specific sequence.

Some messaging systems support sessions that enable a producer to group messages together and ensure
that they're all handled by the same consumer. This mechanism can be used with prioritized messages (if
they are supported) to implement a form of message ordering that delivers messages in sequence from a
producer to a single consumer.

Example
Azure provides Service Bus Queues and Azure Function queue triggers that, when combined, are a direct
implementation of this cloud design pattern. Azure Functions integrate with Azure Service Bus via triggers and
bindings. Integrating with Service Bus allows you to build functions that consume queue messages sent by
publishers. The publishing application(s) will post messages to a queue, and consumers, implemented as Azure
Functions, can retrieve messages from this queue and handle them.
For resiliency, a Service Bus queue enables a consumer to use PeekLock mode when it retrieves a message from
the queue; this mode doesn't actually remove the message, but simply hides it from other consumers. The Azure
Functions runtime receives a message in PeekLock mode, if the function finishes successfully it calls Complete
on the message, or it may call Abandon if the function fails, and the message will become visible again, allowing
another consumer to retrieve it. If the function runs for a period longer than the PeekLock timeout, the lock is
automatically renewed as long as the function is running.
Azure Functions can scale out/in based on the depth of the queue, all acting as competing consumers of the
queue. If multiple instances of the functions are created they all compete by independently pulling and
processing the messages.
For detailed information on using Azure Service Bus queues, see Service Bus queues, topics, and subscriptions.
For information on Queue triggered Azure Functions, see Azure Service Bus trigger for Azure Functions.
The following code shows how you can create a new message and send it to a Service Bus Queue by using a
QueueClient instance.
private string serviceBusConnectionString = ...;
...

public async Task SendMessagesAsync(CancellationToken ct)


{
try
{
var msgNumber = 0;

var queueClient = new QueueClient(serviceBusConnectionString, "myqueue");

while (!ct.IsCancellationRequested)
{
// Create a new message to send to the queue
string messageBody = $"Message {msgNumber}";
var message = new Message(Encoding.UTF8.GetBytes(messageBody));

// Write the body of the message to the console


this._logger.LogInformation($"Sending message: {messageBody}");

// Send the message to the queue


await queueClient.SendAsync(message);

this._logger.LogInformation("Message successfully sent.");


msgNumber++;
}
}
catch (Exception exception)
{
this._logger.LogException(exception.Message);
}
}

The following code example shows a consumer, written as a C# Azure Function, that reads message metadata
and logs a Service Bus Queue message. Note how the ServiceBusTrigger attribute is used to bind it to a Service
Bus Queue.

[FunctionName("ProcessQueueMessage")]
public static void Run(
[ServiceBusTrigger("myqueue", Connection = "ServiceBusConnectionString")]
string myQueueItem,
Int32 deliveryCount,
DateTime enqueuedTimeUtc,
string messageId,
ILogger log)
{
log.LogInformation($"C# ServiceBus queue trigger function consumed message: {myQueueItem}");
log.LogInformation($"EnqueuedTimeUtc={enqueuedTimeUtc}");
log.LogInformation($"DeliveryCount={deliveryCount}");
log.LogInformation($"MessageId={messageId}");
}

Related patterns and guidance


The following patterns and guidance might be relevant when implementing this pattern:
Asynchronous Messaging Primer. Message queues are an asynchronous communications mechanism. If
a consumer service needs to send a reply to an application, it might be necessary to implement some
form of response messaging. The Asynchronous Messaging Primer provides information on how to
implement request/reply messaging using message queues.
Autoscaling Guidance. It might be possible to start and stop instances of a consumer service since the
length of the queue applications post messages on varies. Autoscaling can help to maintain throughput
during times of peak processing.
Compute Resource Consolidation pattern. It might be possible to consolidate multiple instances of a
consumer service into a single process to reduce costs and management overhead. The Compute
Resource Consolidation pattern describes the benefits and tradeoffs of following this approach.
Queue-based Load Leveling pattern. Introducing a message queue can add resiliency to the system,
enabling service instances to handle widely varying volumes of requests from application instances. The
message queue acts as a buffer, which levels the load. The Queue-based Load Leveling pattern describes
this scenario in more detail.
Compute Resource Consolidation pattern
11/2/2020 • 12 minutes to read • Edit Online

Consolidate multiple tasks or operations into a single computational unit. This can increase compute resource
utilization, and reduce the costs and management overhead associated with performing compute processing in
cloud-hosted applications.

Context and problem


A cloud application often implements a variety of operations. In some solutions it makes sense to follow the
design principle of separation of concerns initially, and divide these operations into separate computational units
that are hosted and deployed individually (for example, as separate App Service web apps, separate Virtual
Machines, or separate Cloud Service roles). However, although this strategy can help simplify the logical design of
the solution, deploying a large number of computational units as part of the same application can increase
runtime hosting costs and make management of the system more complex.
As an example, the figure shows the simplified structure of a cloud-hosted solution that is implemented using
more than one computational unit. Each computational unit runs in its own virtual environment. Each function has
been implemented as a separate task (labeled Task A through Task E) running in its own computational unit.

Each computational unit consumes chargeable resources, even when it's idle or lightly used. Therefore, this isn't
always the most cost-effective solution.
In Azure, this concern applies to roles in a Cloud Service, App Services, and Virtual Machines. These items run in
their own virtual environment. Running a collection of separate roles, websites, or virtual machines that are
designed to perform a set of well-defined operations, but that need to communicate and cooperate as part of a
single solution, can be an inefficient use of resources.

Solution
To help reduce costs, increase utilization, improve communication speed, and reduce management it's possible to
consolidate multiple tasks or operations into a single computational unit.
Tasks can be grouped according to criteria based on the features provided by the environment and the costs
associated with these features. A common approach is to look for tasks that have a similar profile concerning their
scalability, lifetime, and processing requirements. Grouping these together allows them to scale as a unit. The
elasticity provided by many cloud environments enables additional instances of a computational unit to be started
and stopped according to the workload. For example, Azure provides autoscaling that you can apply to roles in a
Cloud Service, App Services, and Virtual Machines. For more information, see Autoscaling Guidance.
As a counter example to show how scalability can be used to determine which operations shouldn't be grouped
together, consider the following two tasks:
Task 1 polls for infrequent, time-insensitive messages sent to a queue.
Task 2 handles high-volume bursts of network traffic.
The second task requires elasticity that can involve starting and stopping a large number of instances of the
computational unit. Applying the same scaling to the first task would simply result in more tasks listening for
infrequent messages on the same queue, and is a waste of resources.
In many cloud environments it's possible to specify the resources available to a computational unit in terms of the
number of CPU cores, memory, disk space, and so on. Generally, the more resources specified, the greater the
cost. To save money, it's important to maximize the work an expensive computational unit performs, and not let it
become inactive for an extended period.
If there are tasks that require a great deal of CPU power in short bursts, consider consolidating these into a single
computational unit that provides the necessary power. However, it's important to balance this need to keep
expensive resources busy against the contention that could occur if they are over stressed. Long-running,
compute-intensive tasks shouldn't share the same computational unit, for example.

Issues and considerations


Consider the following points when implementing this pattern:
Scalability and elasticity . Many cloud solutions implement scalability and elasticity at the level of the
computational unit by starting and stopping instances of units. Avoid grouping tasks that have conflicting
scalability requirements in the same computational unit.
Lifetime . The cloud infrastructure periodically recycles the virtual environment that hosts a computational unit.
When there are many long-running tasks inside a computational unit, it might be necessary to configure the unit
to prevent it from being recycled until these tasks have finished. Alternatively, design the tasks by using a check-
pointing approach that enables them to stop cleanly, and continue at the point they were interrupted when the
computational unit is restarted.
Release cadence . If the implementation or configuration of a task changes frequently, it might be necessary to
stop the computational unit hosting the updated code, reconfigure and redeploy the unit, and then restart it. This
process will also require that all other tasks within the same computational unit are stopped, redeployed, and
restarted.
Security . Tasks in the same computational unit might share the same security context and be able to access the
same resources. There must be a high degree of trust between the tasks, and confidence that one task isn't going
to corrupt or adversely affect another. Additionally, increasing the number of tasks running in a computational unit
increases the attack surface of the unit. Each task is only as secure as the one with the most vulnerabilities.
Fault tolerance . If one task in a computational unit fails or behaves abnormally, it can affect the other tasks
running within the same unit. For example, if one task fails to start correctly it can cause the entire startup logic
for the computational unit to fail, and prevent other tasks in the same unit from running.
Contention . Avoid introducing contention between tasks that compete for resources in the same computational
unit. Ideally, tasks that share the same computational unit should exhibit different resource utilization
characteristics. For example, two compute-intensive tasks should probably not reside in the same computational
unit, and neither should two tasks that consume large amounts of memory. However, mixing a compute-intensive
task with a task that requires a large amount of memory is a workable combination.

NOTE
Consider consolidating compute resources only for a system that's been in production for a period of time so that operators
and developers can monitor the system and create a heat map that identifies how each task uses differing resources. This
map can be used to determine which tasks are good candidates for sharing compute resources.

Complexity . Combining multiple tasks into a single computational unit adds complexity to the code in the unit,
possibly making it more difficult to test, debug, and maintain.
Stable logical architecture . Design and implement the code in each task so that it shouldn't need to change,
even if the physical environment the task runs in does change.
Other strategies . Consolidating compute resources is only one way to help reduce costs associated with running
multiple tasks concurrently. It requires careful planning and monitoring to ensure that it remains an effective
approach. Other strategies might be more appropriate, depending on the nature of the work and where the users
these tasks are running are located. For example, functional decomposition of the workload (as described by the
Compute Partitioning Guidance) might be a better option.

When to use this pattern


Use this pattern for tasks that are not cost effective if they run in their own computational units. If a task spends
much of its time idle, running this task in a dedicated unit can be expensive.
This pattern might not be suitable for tasks that perform critical fault-tolerant operations, or tasks that process
highly sensitive or private data and require their own security context. These tasks should run in their own isolated
environment, in a separate computational unit.

Example
When building a cloud service on Azure, it’s possible to consolidate the processing performed by multiple tasks
into a single role. Typically this is a worker role that performs background or asynchronous processing tasks.

In some cases it's possible to include background or asynchronous processing tasks in the web role. This
technique helps to reduce costs and simplify deployment, although it can impact the scalability and
responsiveness of the public-facing interface provided by the web role.

The role is responsible for starting and stopping the tasks. When the Azure fabric controller loads a role, it raises
the Start event for the role. You can override the OnStart method of the WebRole or WorkerRole class to handle
this event, perhaps to initialize the data and other resources the tasks in this method depend on.
When the OnStart method completes, the role can start responding to requests. You can find more information
and guidance about using the OnStart and Run methods in a role in the Application Startup Processes section in
the patterns & practices guide Moving Applications to the Cloud.

Keep the code in the OnStart method as concise as possible. Azure doesn't impose any limit on the time
taken for this method to complete, but the role won't be able to start responding to network requests sent to it
until this method completes.

When the OnStart method has finished, the role executes the Run method. At this point, the fabric controller can
start sending requests to the role.
Place the code that actually creates the tasks in the Run method. Note that the Run method defines the lifetime
of the role instance. When this method completes, the fabric controller will arrange for the role to be shut down.
When a role shuts down or is recycled, the fabric controller prevents any more incoming requests being received
from the load balancer and raises the Stop event. You can capture this event by overriding the OnStop method of
the role and perform any tidying up required before the role terminates.

Any actions performed in the OnStop method must be completed within five minutes (or 30 seconds if you
are using the Azure emulator on a local computer). Otherwise the Azure fabric controller assumes that the
role has stalled and will force it to stop.

The tasks are started by the Run method that waits for the tasks to complete. The tasks implement the business
logic of the cloud service, and can respond to messages posted to the role through the Azure load balancer. The
figure shows the lifecycle of tasks and resources in a role in an Azure cloud service.

The WorkerRole.cs file in the ComputeResourceConsolidation.Worker project shows an example of how you might
implement this pattern in an Azure cloud service.

The ComputeResourceConsolidation.Worker project is part of the ComputeResourceConsolidation solution


available for download from GitHub.

The MyWorkerTask1 and the MyWorkerTask2 methods illustrate how to perform different tasks within the same
worker role. The following code shows MyWorkerTask1 . This is a simple task that sleeps for 30 seconds and then
outputs a trace message. It repeats this process until the task is canceled. The code in MyWorkerTask2 is similar.
// A sample worker role task.
private static async Task MyWorkerTask1(CancellationToken ct)
{
// Fixed interval to wake up and check for work and/or do work.
var interval = TimeSpan.FromSeconds(30);

try
{
while (!ct.IsCancellationRequested)
{
// Wake up and do some background processing if not canceled.
// TASK PROCESSING CODE HERE
Trace.TraceInformation("Doing Worker Task 1 Work");

// Go back to sleep for a period of time unless asked to cancel.


// Task.Delay will throw an OperationCanceledException when canceled.
await Task.Delay(interval, ct);
}
}
catch (OperationCanceledException)
{
// Expect this exception to be thrown in normal circumstances or check
// the cancellation token. If the role instances are shutting down, a
// cancellation request will be signaled.
Trace.TraceInformation("Stopping service, cancellation requested");

// Rethrow the exception.


throw;
}
}

The sample code shows a common implementation of a background process. In a real world application you
can follow this same structure, except that you should place your own processing logic in the body of the loop
that waits for the cancellation request.

After the worker role has initialized the resources it uses, the Run method starts the two tasks concurrently, as
shown here.
/// <summary>
/// The cancellation token source use to cooperatively cancel running tasks
/// </summary>
private readonly CancellationTokenSource cts = new CancellationTokenSource();

/// <summary>
/// List of running tasks on the role instance
/// </summary>
private readonly List<Task> tasks = new List<Task>();

// RoleEntry Run() is called after OnStart().


// Returning from Run() will cause a role instance to recycle.
public override void Run()
{
// Start worker tasks and add to the task list
tasks.Add(MyWorkerTask1(cts.Token));
tasks.Add(MyWorkerTask2(cts.Token));

foreach (var worker in this.workerTasks)


{
this.tasks.Add(worker);
}

Trace.TraceInformation("Worker host tasks started");


// The assumption is that all tasks should remain running and not return,
// similar to role entry Run() behavior.
try
{
Task.WaitAll(tasks.ToArray());
}
catch (AggregateException ex)
{
Trace.TraceError(ex.Message);

// If any of the inner exceptions in the aggregate exception


// are not cancellation exceptions then re-throw the exception.
ex.Handle(innerEx => (innerEx is OperationCanceledException));
}

// If there wasn't a cancellation request, stop all tasks and return from Run()
// An alternative to canceling and returning when a task exits would be to
// restart the task.
if (!cts.IsCancellationRequested)
{
Trace.TraceInformation("Task returned without cancellation request");
Stop(TimeSpan.FromMinutes(5));
}
}
...

In this example, the Run method waits for tasks to be completed. If a task is canceled, the Run method assumes
that the role is being shut down and waits for the remaining tasks to be canceled before finishing (it waits for a
maximum of five minutes before terminating). If a task fails due to an expected exception, the Run method
cancels the task.

You could implement more comprehensive monitoring and exception handling strategies in the Run method
such as restarting tasks that have failed, or including code that enables the role to stop and start individual
tasks.

The Stop method shown in the following code is called when the fabric controller shuts down the role instance
(it's invoked from the OnStop method). The code stops each task gracefully by canceling it. If any task takes more
than five minutes to complete, the cancellation processing in the Stop method ceases waiting and the role is
terminated.

// Stop running tasks and wait for tasks to complete before returning
// unless the timeout expires.
private void Stop(TimeSpan timeout)
{
Trace.TraceInformation("Stop called. Canceling tasks.");
// Cancel running tasks.
cts.Cancel();

Trace.TraceInformation("Waiting for canceled tasks to finish and return");

// Wait for all the tasks to complete before returning. Note that the
// emulator currently allows 30 seconds and Azure allows five
// minutes for processing to complete.
try
{
Task.WaitAll(tasks.ToArray(), timeout);
}
catch (AggregateException ex)
{
Trace.TraceError(ex.Message);

// If any of the inner exceptions in the aggregate exception


// are not cancellation exceptions then rethrow the exception.
ex.Handle(innerEx => (innerEx is OperationCanceledException));
}
}

Related patterns and guidance


The following patterns and guidance might also be relevant when implementing this pattern:
Autoscaling Guidance. Autoscaling can be used to start and stop instances of service hosting computational
resources, depending on the anticipated demand for processing.
Compute Partitioning Guidance. Describes how to allocate the services and components in a cloud service
in a way that helps to minimize running costs while maintaining the scalability, performance, availability,
and security of the service.
This pattern includes a downloadable sample application.
Deployment stamps
11/2/2020 • 12 minutes to read • Edit Online

The deployment stamp pattern involves deploying multiple independent copies of application components,
including data stores. Each individual copy is called a stamp, or sometimes a service unit or scale unit. This
approach can improve the scalability of your solution, allow you to deploy instances across multiple regions, and
separate your customer data.
When hosting an application in the cloud there are certain considerations to be made. One key thing to keep in
mind is the performance and reliability of your application. If you host a single instance of your solution, you
might be subject to the following limitations:
Scale limits. Deploying a single instance of your application may result in natural scaling limits. For example,
you may use services that have limits on the number of inbound connections, host names, TCP sockets, or
other resources.
Non-linear scaling or cost. Some of your solution's components may not scale linearly with the number of
requests or the amount of data. Instead, there can be a sudden decrease in performance or increase in cost
once a threshold has been met. For example, you may use a database and discover that the marginal cost of
adding more capacity (scaling up) becomes prohibitive, and that scaling out is a more cost-effective strategy.
Similarly, Azure Front Door has higher per-domain pricing when a high number of custom domains are
deployed, and it may be better to spread the custom domains across multiple Front Door instances.
Separation of customers. You may need to keep certain customers' data isolated from other customers'
data. Similarly, you may have some customers that require more system resources to service than others, and
consider grouping them on different sets of infrastructure.
Handling single- and multi-tenant instances. You may have some large customers who need their own
independent instances of your solution. You may also have a pool of smaller customers who can share a multi-
tenant deployment.
Complex deployment requirements. You may need to deploy updates to your service in a controlled
manner, and to deploy to different subsets of your customer base at different times.
Update frequency. You may have some customers who are tolerant of having frequent updates to your
system, while others may be risk-averse and want infrequent updates to the system that services their
requests. It may make sense to have these customers deployed to isolated environments.
Geographical or geopolitical restrictions. To architect for low latency, or to comply with data sovereignty
requirements, you may deploy some of your customers into specific regions.

Implementation
To avoid these issues, consider deploying copies of your solution's components multiple times. Stamps operate
independently of each other and can be deployed and updated independently. A single geographical region may
contain a single stamp, or may contain multiple stamps to allow for horizontal scale-out within the region. Stamps
contain a subset of your customers.
Deployment stamps can apply whether your solution uses infrastructure as a service (IaaS) or platform as a
service (PaaS) components, or a mixture of both. Typically IaaS workloads require more intervention to scale, so
the pattern may be useful for IaaS-heavy workloads to allow for scaling out.
Stamps can be used to implement deployment rings. If different customers want to receive service updates at
different frequencies, they can be grouped onto different stamps, and each stamp could have updates deployed at
different cadences.
Because stamps run independently from each other, data is implicitly sharded. Furthermore, a single stamp can
make use of further sharding to internally allow for scalability and elasticity within the stamp.
The deployment stamp pattern is used internally by many Azure services, including App Service, Azure Stack, and
Azure Storage.
Deployment stamps are related to, but distinct from, geodes. In a deployment stamp architecture, multiple
independent instances of your system are deployed and contain a subset of your customers and users. In geodes,
all instances can serve requests from any users, but this architecture is often more complex to design and build.
You may also consider mixing the two patterns within one solution; the traffic routing approach described below is
an example of such a hybrid scenario.
Deployment
Because of the complexity that is involved in deploying identical copies of the same components, good DevOps
practices are critical to ensure success when implementing this pattern. Consider describing your infrastructure as
code, such as using Azure Resource Manager templates and scripts. With this approach, you can ensure that the
deployment of each stamp is predictable. It also reduces the likelihood of human errors such as accidental
mismatches in configuration between stamps.
You can deploy updates automatically to all stamps in parallel, in which case you might consider technologies like
Resource Manager templates or Azure Deployment Manager to coordinate the deployment of your infrastructure
and applications. Alternatively, you may decide to gradually roll out updates to some stamps first, and then
progressively to others. Azure Deployment Manager can also manage this type of staged rollout, or you could
consider using a release management tool like Azure Pipelines to orchestrate deployments to each stamp.
Carefully consider the topology of the Azure subscriptions and resource groups for your deployments:
Typically a subscription will contain all resources for a single solution, so in general consider using a single
subscription for all stamps. However, some Azure services impose subscription-wide quotas, so if you are
using this pattern to allow for a high degree of scale-out, you may need to consider deploying stamps across
different subscriptions.
Resource groups are generally used to deploy components with the same lifecycle. If you plan to deploy
updates to all of your stamps at once, consider using a single resource group to contain all of the components
for all of your stamps, and use resource naming conventions and tags to identify the components that belong
to each stamp. Alternatively, if you plan to deploy updates to each stamp independently, consider deploying
each stamp into its own resource group.
Capacity planning
Use load and performance testing to determine the approximate load that a given stamp can accommodate. Load
metrics may be based on the number of customers/tenants that a single stamp can accommodate, or metrics
from the services that the components within the stamp emit. Ensure that you have sufficient instrumentation to
measure when a given stamp is approaching its capacity, and the ability to deploy new stamps quickly to respond
to demand.
Traffic routing
The Deployment Stamp pattern works well if each stamp is addressed independently. For example, if Contoso
deploys the same API application across multiple stamps, they might consider using DNS to route traffic to the
relevant stamp:
unit1.aus.myapi.contoso.com routes traffic to stamp unit1 within an Australian region.
unit2.aus.myapi.contoso.com routes traffic to stamp unit2 within an Australian region.
unit1.eu.myapi.contoso.com routes traffic to stamp unit1 within a European region.

Clients are then responsible for connecting to the correct stamp.


If a single ingress point for all traffic is required, a traffic routing service may be used to resolve the stamp for a
given request, customer, or tenant. The traffic routing service either directs the client to the relevant URL for the
stamp (for example, using an HTTP 302 response status code), or may act as a reverse proxy and forward the
traffic to the relevant stamp without the client being aware.
A centralized traffic routing service can be a complex component to design, especially when a solution runs across
multiple regions. Consider deploying the traffic routing service into multiple regions (potentially including every
region that stamps are deployed into), and then ensuring the data store (mapping tenants to stamps) is
synchronized. The traffic routing component may itself by an instance of the geode pattern.
For example, Azure API Management could be deployed to act in the traffic routing service role. It can determine
the appropriate stamp for a request by looking up data in a Cosmos DB collection storing the mapping between
tenants and stamps. API Management can then dynamically set the back-end URL to the relevant stamp's API
service.
To enable geo-distribution of requests and geo-redundancy of the traffic routing service, API Management can be
deployed across multiple regions, or Azure Front Door can be used to direct traffic to the closest instance. Front
Door can be configured with a backend pool, enabling requests to be directed to the closest available API
Management instance. The global distribution features of Cosmos DB can be used to keep the mapping
information updated across each region.

If a traffic routing service is included in your solution, consider whether it acts as a gateway and could therefore
perform gateway offloading for services such as token validation, throttling, and authorization.

Issues and considerations


You should consider the following points when deciding how to implement this pattern:
Deployment process. When deploying multiple stamps, it is highly advisable to have automated and fully
repeatable deployment processes. Consider using Resource Manager templates or Terraform templates to
declaratively define the stamp.
Cross-stamp operations. When your solution is deployed independently across multiple stamps, questions
like "how many customers do we have across all of our stamps?" can become more complex to answer.
Queries may need to be executed against each stamp and the results aggregated. Alternatively, consider having
all of the stamps publish data into a centralized data warehouse for consolidated reporting.
Determining scale-out policies. Stamps have a finite capacity, which might be defined using a proxy metric
such as the number of tenants that can be deployed to the stamp. It is important to monitor the available
capacity and used capacity for each stamp, and to proactively deploy additional stamps to allow for new
tenants to be directed to them.
Cost. The Deployment Stamp pattern involves deploying multiple copies of your infrastructure component,
which will likely involve a substantial increase in the cost of operating your solution.
Moving between stamps. As each stamp is deployed and operated independently, moving tenants between
stamps can be difficult. Your application would need custom logic to transmit the information about a given
customer to a different stamp, and then to remove the tenant's information from the original stamp. This
process may require a backplane for communication between stamps, further increasing the complexity of the
overall solution.
Traffic routing. As described above, routing traffic to the correct stamp for a given request can require an
additional component to resolve tenants to stamps. This component, in turn, may need to be made highly
available.
Shared components. You may have some components that can be shared across stamps. For example, if you
have a shared single-page app for all tenants, consider deploying that into one region and using Azure CDN to
replicate it globally.

When to use deployment stamps


This pattern is useful when you have:
Natural limits on scalability. For example, if some components cannot or should not scale beyond a certain
number of customers or requests, consider scaling out using stamps.
A requirement to separate certain tenants from others. If you have customers that cannot be deployed into a
multi-tenant stamp with other customers due to security concerns, they can be deployed onto their own
isolated stamp.
A need to have some tenants on different versions of your solution at the same time.
Multi-region applications where each tenant's data and traffic should be directed to a specific region.
A desire to achieve resiliency during outages. As stamps are independent of one another, if an outage affects a
single stamp then the tenants deployed to other stamps should not be affected. This isolation helps to contain
the 'blast radius' of an incident or outage.
This pattern is not suitable for:
Simple solutions that do not need to scale to a high degree.
Systems that can be easily scaled out or up within a single instance, such as by increasing the size of the
application layer or by increasing the reserved capacity for databases and the storage tier.
Solutions in which data should be replicated across all deployed instances. Consider the geode pattern for this
scenario.
Solutions in which only some components need to be scaled, but not others. For example, consider whether
your solution could be scaled by sharding the data store rather than deploying a new copy of all of the solution
components.
Solutions comprised solely of static content, such as a front-end JavaScript application. Consider storing such
content in a storage account and using Azure CDN.

Supporting technologies
Infrastructure as code. For example, Resource Manager templates, Azure CLI, Terraform, PowerShell, Bash.
Deployment Manager, which can orchestrate deployments of a solution across multiple stamps.
Azure Front Door, which can route traffic to a specific stamp or to a traffic routing service.
Example
The following example deploys multiple stamps of a simple PaaS solution, with an app service and a SQL
Database in each stamp. While stamps can be configured in any region that support the services deployed in the
template, for illustration purposes this template deploys two stamps within the West US 2 region and a further
stamp in the West Europe region. Within a stamp, the app service receives its own public DNS hostname and it
can receive connections independently of all other stamps.

WARNING
The example below uses a SQL Server administrator account. It's generally not a good practice to use an administrative
account from your application. For a real application, consider using a managed identity to connect from your application to
a SQL database, or use a least-privilege account.

Click the link below to deploy the solution.

NOTE
There are alternative approaches to deploying stamps with a Resource Manager template, including using nested templates
or linked templates to decouple the definition of each stamp from the iteration required to deploy multiple copies.

Example traffic routing approach


The following example deploys an implementation of a traffic routing solution that could be used with a set of
deployment stamps for a hypothetical API application. To allow for geographical distribution of incoming requests,
Front Door is deployed alongside multiple instances of API Management on the consumption tier. Each API
Management instance reads the tenant ID from the request URL and then looks up the relevant stamp for the
request from a geo-distributed Cosmos DB data store. The request is then forwarded to the relevant back-end
stamp.
Click the link below to deploy the solution.

Next steps
Sharding can be used as another, simpler, approach to scale out your data tier. Stamps implicitly shard their
data, but sharding does not require a Deployment Stamp. For more information, see Sharding Pattern.
If a traffic routing service is deployed, the Gateway Routing and Gateway Offloading patterns can be used
together to make the best use of this component.
Event Sourcing pattern
11/2/2020 • 14 minutes to read • Edit Online

Instead of storing just the current state of the data in a domain, use an append-only store to record the full series
of actions taken on that data. The store acts as the system of record and can be used to materialize the domain
objects. This can simplify tasks in complex domains, by avoiding the need to synchronize the data model and the
business domain, while improving performance, scalability, and responsiveness. It can also provide consistency
for transactional data, and maintain full audit trails and history that can enable compensating actions.

Context and problem


Most applications work with data, and the typical approach is for the application to maintain the current state of
the data by updating it as users work with it. For example, in the traditional create, read, update, and delete
(CRUD) model a typical data process is to read data from the store, make some modifications to it, and update
the current state of the data with the new values—often by using transactions that lock the data.
The CRUD approach has some limitations:
CRUD systems perform update operations directly against a data store, which can slow down performance
and responsiveness, and limit scalability, due to the processing overhead it requires.
In a collaborative domain with many concurrent users, data update conflicts are more likely because the
update operations take place on a single item of data.
Unless there's an additional auditing mechanism that records the details of each operation in a separate
log, history is lost.

For a deeper understanding of the limits of the CRUD approach, see CRUD, Only When You Can Afford It.

Solution
The Event Sourcing pattern defines an approach to handling operations on data that's driven by a sequence of
events, each of which is recorded in an append-only store. Application code sends a series of events that
imperatively describe each action that has occurred on the data to the event store, where they're persisted. Each
event represents a set of changes to the data (such as AddedItemToOrder ).
The events are persisted in an event store that acts as the system of record (the authoritative data source) about
the current state of the data. The event store typically publishes these events so that consumers can be notified
and can handle them if needed. Consumers could, for example, initiate tasks that apply the operations in the
events to other systems, or perform any other associated action that's required to complete the operation. Notice
that the application code that generates the events is decoupled from the systems that subscribe to the events.
Typical uses of the events published by the event store are to maintain materialized views of entities as actions in
the application change them, and for integration with external systems. For example, a system can maintain a
materialized view of all customer orders that's used to populate parts of the UI. As the application adds new
orders, adds or removes items on the order, and adds shipping information, the events that describe these
changes can be handled and used to update the materialized view.
In addition, at any point it's possible for applications to read the history of events, and use it to materialize the
current state of an entity by playing back and consuming all the events related to that entity. This can occur on
demand to materialize a domain object when handling a request, or through a scheduled task so that the state of
the entity can be stored as a materialized view to support the presentation layer.
The figure shows an overview of the pattern, including some of the options for using the event stream such as
creating a materialized view, integrating events with external applications and systems, and replaying events to
create projections of the current state of specific entities.

The Event Sourcing pattern provides the following advantages:


Events are immutable and can be stored using an append-only operation. The user interface, workflow, or
process that initiated an event can continue, and tasks that handle the events can run in the background.
This, combined with the fact that there's no contention during the processing of transactions, can vastly
improve performance and scalability for applications, especially for the presentation level or user interface.
Events are simple objects that describe some action that occurred, together with any associated data
required to describe the action represented by the event. Events don't directly update a data store. They're
simply recorded for handling at the appropriate time. This can simplify implementation and management.
Events typically have meaning for a domain expert, whereas object-relational impedance mismatch can
make complex database tables hard to understand. Tables are artificial constructs that represent the
current state of the system, not the events that occurred.
Event sourcing can help prevent concurrent updates from causing conflicts because it avoids the
requirement to directly update objects in the data store. However, the domain model must still be designed
to protect itself from requests that might result in an inconsistent state.
The append-only storage of events provides an audit trail that can be used to monitor actions taken
against a data store, regenerate the current state as materialized views or projections by replaying the
events at any time, and assist in testing and debugging the system. In addition, the requirement to use
compensating events to cancel changes provides a history of changes that were reversed, which wouldn't
be the case if the model simply stored the current state. The list of events can also be used to analyze
application performance and detect user behavior trends, or to obtain other useful business information.
The event store raises events, and tasks perform operations in response to those events. This decoupling
of the tasks from the events provides flexibility and extensibility. Tasks know about the type of event and
the event data, but not about the operation that triggered the event. In addition, multiple tasks can handle
each event. This enables easy integration with other services and systems that only listen for new events
raised by the event store. However, the event sourcing events tend to be very low level, and it might be
necessary to generate specific integration events instead.

Event sourcing is commonly combined with the CQRS pattern by performing the data management tasks in
response to the events, and by materializing views from the stored events.

Issues and considerations


Consider the following points when deciding how to implement this pattern:
The system will only be eventually consistent when creating materialized views or generating projections of data
by replaying events. There's some delay between an application adding events to the event store as the result of
handling a request, the events being published, and consumers of the events handling them. During this period,
new events that describe further changes to entities might have arrived at the event store.

NOTE
See the Data Consistency Primer for information about eventual consistency.

The event store is the permanent source of information, and so the event data should never be updated. The only
way to update an entity to undo a change is to add a compensating event to the event store. If the format (rather
than the data) of the persisted events needs to change, perhaps during a migration, it can be difficult to combine
existing events in the store with the new version. It might be necessary to iterate through all the events making
changes so they're compliant with the new format, or add new events that use the new format. Consider using a
version stamp on each version of the event schema to maintain both the old and the new event formats.
Multi-threaded applications and multiple instances of applications might be storing events in the event store. The
consistency of events in the event store is vital, as is the order of events that affect a specific entity (the order that
changes occur to an entity affects its current state). Adding a timestamp to every event can help to avoid issues.
Another common practice is to annotate each event resulting from a request with an incremental identifier. If two
actions attempt to add events for the same entity at the same time, the event store can reject an event that
matches an existing entity identifier and event identifier.
There's no standard approach, or existing mechanisms such as SQL queries, for reading the events to obtain
information. The only data that can be extracted is a stream of events using an event identifier as the criteria. The
event ID typically maps to individual entities. The current state of an entity can be determined only by replaying
all of the events that relate to it against the original state of that entity.
The length of each event stream affects managing and updating the system. If the streams are large, consider
creating snapshots at specific intervals such as a specified number of events. The current state of the entity can be
obtained from the snapshot and by replaying any events that occurred after that point in time. For more
information about creating snapshots of data, see Snapshot on Martin Fowler’s Enterprise Application
Architecture website and Primary-Subordinate Snapshot Replication.
Even though event sourcing minimizes the chance of conflicting updates to the data, the application must still be
able to deal with inconsistencies that result from eventual consistency and the lack of transactions. For example,
an event that indicates a reduction in stock inventory might arrive in the data store while an order for that item is
being placed, resulting in a requirement to reconcile the two operations either by advising the customer or
creating a back order.
Event publication might be "at least once," and so consumers of the events must be idempotent. They must not
reapply the update described in an event if the event is handled more than once. For example, if multiple
instances of a consumer maintain an aggregate an entity's property, such as the total number of orders placed,
only one must succeed in incrementing the aggregate when an order placed event occurs. While this isn't a key
characteristic of event sourcing, it's the usual implementation decision.
When to use this pattern
Use this pattern in the following scenarios:
When you want to capture intent, purpose, or reason in the data. For example, changes to a customer
entity can be captured as a series of specific event types such as Moved home, Closed account, or
Deceased.
When it's vital to minimize or completely avoid the occurrence of conflicting updates to data.
When you want to record events that occur, and be able to replay them to restore the state of a system, roll
back changes, or keep a history and audit log. For example, when a task involves multiple steps you might
need to execute actions to revert updates and then replay some steps to bring the data back into a
consistent state.
When using events is a natural feature of the operation of the application, and requires little additional
development or implementation effort.
When you need to decouple the process of inputting or updating data from the tasks required to apply
these actions. This might be to improve UI performance, or to distribute events to other listeners that take
action when the events occur. For example, integrating a payroll system with an expense submission
website so that events raised by the event store in response to data updates made in the website are
consumed by both the website and the payroll system.
When you want flexibility to be able to change the format of materialized models and entity data if
requirements change, or—when used in conjunction with CQRS—you need to adapt a read model or the
views that expose the data.
When used in conjunction with CQRS, and eventual consistency is acceptable while a read model is
updated, or the performance impact of rehydrating entities and data from an event stream is acceptable.
This pattern might not be useful in the following situations:
Small or simple domains, systems that have little or no business logic, or nondomain systems that
naturally work well with traditional CRUD data management mechanisms.
Systems where consistency and real-time updates to the views of the data are required.
Systems where audit trails, history, and capabilities to roll back and replay actions are not required.
Systems where there's only a very low occurrence of conflicting updates to the underlying data. For
example, systems that predominantly add data rather than updating it.

Example
A conference management system needs to track the number of completed bookings for a conference so that it
can check whether there are seats still available when a potential attendee tries to make a booking. The system
could store the total number of bookings for a conference in at least two ways:
The system could store the information about the total number of bookings as a separate entity in a
database that holds booking information. As bookings are made or canceled, the system could increment
or decrement this number as appropriate. This approach is simple in theory, but can cause scalability
issues if a large number of attendees are attempting to book seats during a short period of time. For
example, in the last day or so prior to the booking period closing.
The system could store information about bookings and cancellations as events held in an event store. It
could then calculate the number of seats available by replaying these events. This approach can be more
scalable due to the immutability of events. The system only needs to be able to read data from the event
store, or append data to the event store. Event information about bookings and cancellations is never
modified.
The following diagram illustrates how the seat reservation subsystem of the conference management system
might be implemented using event sourcing.

The sequence of actions for reserving two seats is as follows:


1. The user interface issues a command to reserve seats for two attendees. The command is handled by a
separate command handler. A piece of logic that is decoupled from the user interface and is responsible
for handling requests posted as commands.
2. An aggregate containing information about all reservations for the conference is constructed by querying
the events that describe bookings and cancellations. This aggregate is called SeatAvailability , and is
contained within a domain model that exposes methods for querying and modifying the data in the
aggregate.

Some optimizations to consider are using snapshots (so that you don’t need to query and replay the
full list of events to obtain the current state of the aggregate), and maintaining a cached copy of the
aggregate in memory.
3. The command handler invokes a method exposed by the domain model to make the reservations.
4. The SeatAvailability aggregate records an event containing the number of seats that were reserved. The
next time the aggregate applies events, all the reservations will be used to compute how many seats
remain.
5. The system appends the new event to the list of events in the event store.
If a user cancels a seat, the system follows a similar process except the command handler issues a command that
generates a seat cancellation event and appends it to the event store.
As well as providing more scope for scalability, using an event store also provides a complete history, or audit
trail, of the bookings and cancellations for a conference. The events in the event store are the accurate record.
There is no need to persist aggregates in any other way because the system can easily replay the events and
restore the state to any point in time.

You can find more information about this example in Introducing Event Sourcing.

Related patterns and guidance


The following patterns and guidance might also be relevant when implementing this pattern:
Command and Query Responsibility Segregation (CQRS) pattern. The write store that provides the
permanent source of information for a CQRS implementation is often based on an implementation of the
Event Sourcing pattern. Describes how to segregate the operations that read data in an application from
the operations that update data by using separate interfaces.
Materialized View pattern. The data store used in a system based on event sourcing is typically not well
suited to efficient querying. Instead, a common approach is to generate prepopulated views of the data at
regular intervals, or when the data changes. Shows how this can be done.
Compensating Transaction pattern. The existing data in an event sourcing store is not updated, instead new
entries are added that transition the state of entities to the new values. To reverse a change, compensating
entries are used because it isn't possible to simply reverse the previous change. Describes how to undo the
work that was performed by a previous operation.
Data Consistency Primer. When using event sourcing with a separate read store or materialized views, the
read data won't be immediately consistent, instead it'll be only eventually consistent. Summarizes the
issues surrounding maintaining consistency over distributed data.
Data Partitioning Guidance. Data is often partitioned when using event sourcing to improve scalability,
reduce contention, and optimize performance. Describes how to divide data into discrete partitions, and
the issues that can arise.
External Configuration Store pattern
11/2/2020 • 10 minutes to read • Edit Online

Move configuration information out of the application deployment package to a centralized location. This can
provide opportunities for easier management and control of configuration data, and for sharing configuration
data across applications and application instances.

Context and problem


The majority of application runtime environments include configuration information that's held in files deployed
with the application. In some cases, it's possible to edit these files to change the application behavior after it's been
deployed. However, changes to the configuration require the application be redeployed, often resulting in
unacceptable downtime and other administrative overhead.
Local configuration files also limit the configuration to a single application, but sometimes it would be useful to
share configuration settings across multiple applications. Examples include database connection strings, UI theme
information, or the URLs of queues and storage used by a related set of applications.
It's challenging to manage changes to local configurations across multiple running instances of the application,
especially in a cloud-hosted scenario. It can result in instances using different configuration settings while the
update is being deployed.
In addition, updates to applications and components might require changes to configuration schemas. Many
configuration systems don't support different versions of configuration information.

Solution
Store the configuration information in external storage, and provide an interface that can be used to quickly and
efficiently read and update configuration settings. The type of external store depends on the hosting and runtime
environment of the application. In a cloud-hosted scenario it's typically a cloud-based storage service, but could be
a hosted database or other system.
The backing store you choose for configuration information should have an interface that provides consistent and
easy-to-use access. It should expose the information in a correctly typed and structured format. The
implementation might also need to authorize users’ access in order to protect configuration data, and be flexible
enough to allow storage of multiple versions of the configuration (such as development, staging, or production,
including multiple release versions of each one).

Many built-in configuration systems read the data when the application starts up, and cache the data in
memory to provide fast access and minimize the impact on application performance. Depending on the type
of backing store used, and the latency of this store, it might be helpful to implement a caching mechanism
within the external configuration store. For more information, see the Caching Guidance. The figure illustrates
an overview of the External Configuration Store pattern with optional local cache.
Issues and considerations
Consider the following points when deciding how to implement this pattern:
Choose a backing store that offers acceptable performance, high availability, robustness, and can be backed up as
part of the application maintenance and administration process. In a cloud-hosted application, using a cloud
storage mechanism is usually a good choice to meet these requirements.
Design the schema of the backing store to allow flexibility in the types of information it can hold. Ensure that it
provides for all configuration requirements such as typed data, collections of settings, multiple versions of settings,
and any other features that the applications using it require. The schema should be easy to extend to support
additional settings as requirements change.
Consider the physical capabilities of the backing store, how it relates to the way configuration information is
stored, and the effects on performance. For example, storing an XML document containing configuration
information will require either the configuration interface or the application to parse the document in order to
read individual settings. It'll make updating a setting more complicated, though caching the settings can help to
offset slower read performance.
Consider how the configuration interface will permit control of the scope and inheritance of configuration settings.
For example, it might be a requirement to scope configuration settings at the organization, application, and the
machine level. It might need to support delegation of control over access to different scopes, and to prevent or
allow individual applications to override settings.
Ensure that the configuration interface can expose the configuration data in the required formats such as typed
values, collections, key/value pairs, or property bags.
Consider how the configuration store interface will behave when settings contain errors, or don't exist in the
backing store. It might be appropriate to return default settings and log errors. Also consider aspects such as the
case sensitivity of configuration setting keys or names, the storage and handling of binary data, and the ways that
null or empty values are handled.
Consider how to protect the configuration data to allow access to only the appropriate users and applications. This
is likely a feature of the configuration store interface, but it's also necessary to ensure that the data in the backing
store can't be accessed directly without the appropriate permission. Ensure strict separation between the
permissions required to read and to write configuration data. Also consider whether you need to encrypt some or
all of the configuration settings, and how this'll be implemented in the configuration store interface.
Centrally stored configurations, which change application behavior during runtime, are critically important and
should be deployed, updated, and managed using the same mechanisms as deploying application code. For
example, changes that can affect more than one application must be carried out using a full test and staged
deployment approach to ensure that the change is appropriate for all applications that use this configuration. If an
administrator edits a setting to update one application, it could adversely impact other applications that use the
same setting.
If an application caches configuration information, the application needs to be alerted if the configuration changes.
It might be possible to implement an expiration policy over cached configuration data so that this information is
automatically refreshed periodically and any changes picked up (and acted on).

When to use this pattern


This pattern is useful for:
Configuration settings that are shared between multiple applications and application instances, or where a
standard configuration must be enforced across multiple applications and application instances.
A standard configuration system that doesn't support all of the required configuration settings, such as
storing images or complex data types.
As a complementary store for some of the settings for applications, perhaps allowing applications to
override some or all of the centrally-stored settings.
As a way to simplify administration of multiple applications, and optionally for monitoring use of
configuration settings by logging some or all types of access to the configuration store.

Example
In a Microsoft Azure hosted application, a typical choice for storing configuration information externally is to use
Azure Storage. This is resilient, offers high performance, and is replicated three times with automatic failover to
offer high availability. Azure Table storage provides a key/value store with the ability to use a flexible schema for
the values. Azure Blob storage provides a hierarchical, container-based store that can hold any type of data in
individually named blobs.
The following example shows how a configuration store can be implemented over Blob storage to store and
expose configuration information. The BlobSettingsStore class abstracts Blob storage for holding configuration
information, and implements the ISettingsStore interface shown in the following code.

This code is provided in the ExternalConfigurationStore.Cloud project in the ExternalConfigurationStore


solution, available from GitHub.

public interface ISettingsStore


{
Task<ETag> GetVersionAsync();

Task<Dictionary<string, string>> FindAllAsync();


}

This interface defines methods for retrieving and updating configuration settings held in the configuration store,
and includes a version number that can be used to detect whether any configuration settings have been modified
recently. The BlobSettingsStore class uses the ETag property of the blob to implement versioning. The ETag
property is updated automatically each time the blob is written.

By design, this simple solution exposes all configuration settings as string values rather than typed values.

The ExternalConfigurationManager class provides a wrapper around a BlobSettingsStore object. An application


can use this class to store and retrieve configuration information. This class uses the Microsoft Reactive Extensions
library to expose any changes made to the configuration through an implementation of the IObservable interface.
If a setting is modified by calling the SetAppSetting method, the Changed event is raised and all subscribers to this
event will be notified.
Note that all settings are also cached in a Dictionary object inside the ExternalConfigurationManager class for fast
access. The GetSetting method used to retrieve a configuration setting reads the data from the cache. If the
setting isn't found in the cache, it's retrieved from the BlobSettingsStore object instead.
The GetSettings method invokes the CheckForConfigurationChanges method to detect whether the configuration
information in blob storage has changed. It does this by examining the version number and comparing it with the
current version number held by the ExternalConfigurationManager object. If one or more changes have occurred,
the Changed event is raised and the configuration settings cached in the Dictionary object are refreshed. This is
an application of the Cache-Aside pattern.
The following code sample shows how the Changed event, the GetSettings method, and the
CheckForConfigurationChanges method are implemented:

public class ExternalConfigurationManager : IDisposable


{
// An abstraction of the configuration store.
private readonly ISettingsStore settings;
private readonly ISubject<KeyValuePair<string, string>> changed;
...
private readonly ReaderWriterLockSlim settingsCacheLock = new ReaderWriterLockSlim();
private readonly SemaphoreSlim syncCacheSemaphore = new SemaphoreSlim(1);
...
private Dictionary<string, string> settingsCache;
private ETag currentVersion;
...

public ExternalConfigurationManager(ISettingsStore settings, TimeSpan interval, string environment)


{
this.settings = settings;
this.interval = interval;
this.CheckForConfigurationChangesAsync().Wait();
this.changed = new Subject<KeyValuePair<string, string>>();
this.Environment = environment;
}

...
public IObservable<KeyValuePair<string, string>> Changed => this.changed.AsObservable();
...

public string GetAppSetting(string key)


{
...
// Try to get the value from the settings cache.
// If there's a cache miss, get the setting from the settings store and refresh the settings cache.

string value;
try
{
this.settingsCacheLock.EnterReadLock();

this.settingsCache.TryGetValue(key, out value);


}
finally
{
this.settingsCacheLock.ExitReadLock();
}

return value;
}
...
private void CheckForConfigurationChanges()
{
try
{
// It is assumed that updates are infrequent.
// To avoid race conditions in refreshing the cache, synchronize access to the in-memory cache.
await this.syncCacheSemaphore.WaitAsync();

var latestVersion = await this.settings.GetVersionAsync();

// If the versions are the same, nothing has changed in the configuration.
if (this.currentVersion == latestVersion) return;

// Get the latest settings from the settings store and publish changes.
var latestSettings = await this.settings.FindAllAsync();

// Refresh the settings cache.


try
{
this.settingsCacheLock.EnterWriteLock();

if (this.settingsCache != null)
{
//Notify settings changed
latestSettings.Except(this.settingsCache).ToList().ForEach(kv => this.changed.OnNext(kv));
}
this.settingsCache = latestSettings;
}
finally
{
this.settingsCacheLock.ExitWriteLock();
}

// Update the current version.


this.currentVersion = latestVersion;
}
catch (Exception ex)
{
this.changed.OnError(ex);
}
finally
{
this.syncCacheSemaphore.Release();
}
}
}

The ExternalConfigurationManager class also provides a property named Environment . This property supports
varying configurations for an application running in different environments, such as staging and production.

An ExternalConfigurationManager object can also query the BlobSettingsStore object periodically for any changes.
In the following code, the StartMonitor method calls CheckForConfigurationChanges at an interval to detect any
changes and raise the Changed event, as described earlier.

public class ExternalConfigurationManager : IDisposable


{
...
private readonly ISubject<KeyValuePair<string, string>> changed;
private Dictionary<string, string> settingsCache;
private readonly CancellationTokenSource cts = new CancellationTokenSource();
private Task monitoringTask;
private readonly TimeSpan interval;

private readonly SemaphoreSlim timerSemaphore = new SemaphoreSlim(1);


...
public ExternalConfigurationManager(string environment) : this(new BlobSettingsStore(environment),
TimeSpan.FromSeconds(15), environment)
{
{
}

public ExternalConfigurationManager(ISettingsStore settings, TimeSpan interval, string environment)


{
this.settings = settings;
this.interval = interval;
this.CheckForConfigurationChangesAsync().Wait();
this.changed = new Subject<KeyValuePair<string, string>>();
this.Environment = environment;
}
...
/// <summary>
/// Check to see if the current instance is monitoring for changes
/// </summary>
public bool IsMonitoring => this.monitoringTask != null && !this.monitoringTask.IsCompleted;

/// <summary>
/// Start the background monitoring for configuration changes in the central store
/// </summary>
public void StartMonitor()
{
if (this.IsMonitoring)
return;

try
{
this.timerSemaphore.Wait();

// Check again to make sure we are not already running.


if (this.IsMonitoring)
return;

// Start running our task loop.


this.monitoringTask = ConfigChangeMonitor();
}
finally
{
this.timerSemaphore.Release();
}
}

/// <summary>
/// Loop that monitors for configuration changes
/// </summary>
/// <returns></returns>
public async Task ConfigChangeMonitor()
{
while (!cts.Token.IsCancellationRequested)
{
await this.CheckForConfigurationChangesAsync();
await Task.Delay(this.interval, cts.Token);
}
}

/// <summary>
/// Stop monitoring for configuration changes
/// </summary>
public void StopMonitor()
{
try
{
this.timerSemaphore.Wait();

// Signal the task to stop.


this.cts.Cancel();

// Wait for the loop to stop.


this.monitoringTask.Wait();
this.monitoringTask = null;
}
finally
{
this.timerSemaphore.Release();
}
}

public void Dispose()


{
this.cts.Cancel();
}
...
}

The ExternalConfigurationManager class is instantiated as a singleton instance by the ExternalConfiguration class


shown below.

public static class ExternalConfiguration


{
private static readonly Lazy<ExternalConfigurationManager> configuredInstance = new
Lazy<ExternalConfigurationManager>(
() =>
{
var environment = CloudConfigurationManager.GetSetting("environment");
return new ExternalConfigurationManager(environment);
});

public static ExternalConfigurationManager Instance => configuredInstance.Value;


}

The following code is taken from the WorkerRole class in the ExternalConfigurationStore.Cloud project. It shows
how the application uses the ExternalConfiguration class to read a setting.

public override void Run()


{
// Start monitoring configuration changes.
ExternalConfiguration.Instance.StartMonitor();

// Get a setting.
var setting = ExternalConfiguration.Instance.GetAppSetting("setting1");
Trace.TraceInformation("Worker Role: Get setting1, value: " + setting);

this.completeEvent.WaitOne();
}

The following code, also from the WorkerRole class, shows how the application subscribes to configuration events.

public override bool OnStart()


{
...
// Subscribe to the event.
ExternalConfiguration.Instance.Changed.Subscribe(
m => Trace.TraceInformation("Configuration has changed. Key:{0} Value:{1}",
m.Key, m.Value),
ex => Trace.TraceError("Error detected: " + ex.Message));
...
}
Related patterns and guidance
A sample that demonstrates this pattern is available on GitHub.
Federated Identity pattern
11/2/2020 • 7 minutes to read • Edit Online

Delegate authentication to an external identity provider. This can simplify development, minimize the requirement
for user administration, and improve the user experience of the application.

Context and problem


Users typically need to work with multiple applications provided and hosted by different organizations they have a
business relationship with. These users might be required to use specific (and different) credentials for each one.
This can:
Cause a disjointed user experience . Users often forget sign-in credentials when they have many
different ones.
Expose security vulnerabilities . When a user leaves the company the account must immediately be
deprovisioned. It's easy to overlook this in large organizations.
Complicate user management . Administrators must manage credentials for all of the users, and perform
additional tasks such as providing password reminders.
Users typically prefer to use the same credentials for all these applications.

Solution
Implement an authentication mechanism that can use federated identity. Separate user authentication from the
application code, and delegate authentication to a trusted identity provider. This can simplify development and
allow users to authenticate using a wider range of identity providers (IdP) while minimizing the administrative
overhead. It also allows you to clearly decouple authentication from authorization.
The trusted identity providers include corporate directories, on-premises federation services, other security token
services (STS) provided by business partners, or social identity providers that can authenticate users who have, for
example, a Microsoft, Google, Yahoo!, or Facebook account.
The figure illustrates the Federated Identity pattern when a client application needs to access a service that requires
authentication. The authentication is performed by an IdP that works in concert with an STS. The IdP issues security
tokens that provide information about the authenticated user. This information, referred to as claims, includes the
user’s identity, and might also include other information such as role membership and more granular access
rights.
This model is often called claims-based access control. Applications and services authorize access to features and
functionality based on the claims contained in the token. The service that requires authentication must trust the IdP.
The client application contacts the IdP that performs the authentication. If the authentication is successful, the IdP
returns a token containing the claims that identify the user to the STS (note that the IdP and STS can be the same
service). The STS can transform and augment the claims in the token based on predefined rules, before returning it
to the client. The client application can then pass this token to the service as proof of its identity.

There might be additional security token services in the chain of trust. For example, in the scenario described
later, an on-premises STS trusts another STS that is responsible for accessing an identity provider to
authenticate the user. This approach is common in enterprise scenarios where there's an on-premises STS and
directory.

Federated authentication provides a standards-based solution to the issue of trusting identities across diverse
domains, and can support single sign-on. It's becoming more common across all types of applications, especially
cloud-hosted applications, because it supports single sign-on without requiring a direct network connection to
identity providers. The user doesn't have to enter credentials for every application. This increases security because
it prevents the creation of credentials required to access many different applications, and it also hides the user’s
credentials from all but the original identity provider. Applications see just the authenticated identity information
contained within the token.
Federated identity also has the major advantage that management of the identity and credentials is the
responsibility of the identity provider. The application or service doesn't need to provide identity management
features. In addition, in corporate scenarios, the corporate directory doesn't need to know about the user if it trusts
the identity provider. This removes all the administrative overhead of managing the user identity within the
directory.

Issues and considerations


Consider the following when designing applications that implement federated authentication:
Authentication can be a single point of failure. If you deploy your application to multiple datacenters,
consider deploying your identity management mechanism to the same datacenters to maintain application
reliability and availability.
Authentication tools make it possible to configure access control based on role claims contained in the
authentication token. This is often referred to as role-based access control (RBAC), and it can allow a more
granular level of control over access to features and resources.
Unlike a corporate directory, claims-based authentication using social identity providers doesn't usually
provide information about the authenticated user other than an email address, and perhaps a name. Some
social identity providers, such as a Microsoft account, provide only a unique identifier. The application
usually needs to maintain some information on registered users, and be able to match this information to
the identifier contained in the claims in the token. Typically this is done through registration when the user
first accesses the application, and information is then injected into the token as additional claims after each
authentication.
If there's more than one identity provider configured for the STS, it must detect which identity provider the
user should be redirected to for authentication. This process is called home realm discovery. The STS might
be able to do this automatically based on an email address or user name that the user provides, a
subdomain of the application that the user is accessing, the user’s IP address scope, or on the contents of a
cookie stored in the user’s browser. For example, if the user entered an email address in the Microsoft
domain, such as [email protected], the STS will redirect the user to the Microsoft account sign-in page. On later
visits, the STS could use a cookie to indicate that the last sign in was with a Microsoft account. If automatic
discovery can't determine the home realm, the STS will display a home realm discovery page that lists the
trusted identity providers, and the user must select the one they want to use.

When to use this pattern


This pattern is useful for scenarios such as:
Single sign-on in the enterprise . In this scenario you need to authenticate employees for corporate
applications that are hosted in the cloud outside the corporate security boundary, without requiring them to
sign in every time they visit an application. The user experience is the same as when using on-premises
applications where they're authenticated when signing in to a corporate network, and from then on have
access to all relevant applications without needing to sign in again.
Federated identity with multiple par tners . In this scenario you need to authenticate both corporate
employees and business partners who don't have accounts in the corporate directory. This is common in
business-to-business applications, applications that integrate with third-party services, and where
companies with different IT systems have merged or shared resources.
Federated identity in SaaS applications . In this scenario independent software vendors provide a
ready-to-use service for multiple clients or tenants. Each tenant authenticates using a suitable identity
provider. For example, business users will use their corporate credentials, while consumers and clients of the
tenant will use their social identity credentials.
This pattern might not be useful in the following situations:
All users of the application can be authenticated by one identity provider, and there's no requirement to
authenticate using any other identity provider. This is typical in business applications that use a corporate
directory (accessible within the application) for authentication, by using a VPN, or (in a cloud-hosted
scenario) through a virtual network connection between the on-premises directory and the application.
The application was originally built using a different authentication mechanism, perhaps with custom user
stores, or doesn't have the capability to handle the negotiation standards used by claims-based
technologies. Retrofitting claims-based authentication and access control into existing applications can be
complex, and probably not cost effective.

Example
An organization hosts a multi-tenant software as a service (SaaS) application in Microsoft Azure. The application
includes a website that tenants can use to manage the application for their own users. The application allows
tenants to access the website by using a federated identity that is generated by Active Directory Federation
Services (AD FS) when a user is authenticated by that organization’s own Active Directory.

The figure shows how tenants authenticate with their own identity provider (step 1), in this case AD FS. After
successfully authenticating a tenant, AD FS issues a token. The client browser forwards this token to the SaaS
application’s federation provider, which trusts tokens issued by the tenant’s AD FS, in order to get back a token that
is valid for the SaaS federation provider (step 2). If necessary, the SaaS federation provider performs a
transformation on the claims in the token into claims that the application recognizes (step 3) before returning the
new token to the client browser. The application trusts tokens issued by the SaaS federation provider and uses the
claims in the token to apply authorization rules (step 4).
Tenants won't need to remember separate credentials to access the application, and an administrator at the
tenant’s company can configure in its own AD FS the list of users that can access the application.

Related guidance
Microsoft Azure Active Directory
Active Directory Domain Services
Active Directory Federation Services
Identity management for multitenant applications in Microsoft Azure
Multitenant Applications in Azure
Gatekeeper pattern
11/2/2020 • 4 minutes to read • Edit Online

Protect applications and services by using a dedicated host instance that acts as a broker between clients and the
application or service, validates and sanitizes requests, and passes requests and data between them. This can
provide an additional layer of security, and limit the attack surface of the system.

Context and problem


Applications expose their functionality to clients by accepting and processing requests. In cloud-hosted scenarios,
applications expose endpoints clients connect to, and typically include the code to handle the requests from clients.
This code performs authentication and validation, some or all request processing, and is likely to accesses storage
and other services on behalf of the client.
If a malicious user is able to compromise the system and gain access to the application’s hosting environment, the
security mechanisms it uses such as credentials and storage keys, and the services and data it accesses, are
exposed. As a result, the malicious user can gain unrestrained access to sensitive information and other services.

Solution
To minimize the risk of clients gaining access to sensitive information and services, decouple hosts or tasks that
expose public endpoints from the code that processes requests and accesses storage. You can achieve this by using
a façade or a dedicated task that interacts with clients and then hands off the request—perhaps through a
decoupled interface—to the hosts or tasks that'll handle the request. The figure provides a high-level overview of
this pattern.

The gatekeeper pattern can be used to simply protect storage, or it can be used as a more comprehensive façade
to protect all of the functions of the application. The important factors are:
Controlled validation . The gatekeeper validates all requests, and rejects those that don't meet validation
requirements.
Limited risk and exposure . The gatekeeper doesn't have access to the credentials or keys used by the trusted
host to access storage and services. If the gatekeeper is compromised, the attacker doesn't get access to these
credentials or keys.
Appropriate security . The gatekeeper runs in a limited privilege mode, while the rest of the application runs
in the full trust mode required to access storage and services. If the gatekeeper is compromised, it can't directly
access the application services or data.
This pattern acts like a firewall in a typical network topography. It allows the gatekeeper to examine requests and
make a decision about whether to pass the request on to the trusted host that performs the required tasks. This
decision typically requires the gatekeeper to validate and sanitize the request content before passing it on to the
trusted host.

Issues and considerations


Consider the following points when deciding how to implement this pattern:
Ensure that the trusted hosts the gatekeeper passes requests to expose only internal or protected endpoints,
and connect only to the gatekeeper. The trusted hosts shouldn't expose any external endpoints or interfaces.
The gatekeeper must run in a limited privilege mode. Typically this means running the gatekeeper and the
trusted host in separate hosted services or virtual machines.
The gatekeeper shouldn't perform any processing related to the application or services, or access any data. Its
function is purely to validate and sanitize requests. The trusted hosts might need to perform additional
validation of requests, but the core validation should be performed by the gatekeeper.
Use a secure communication channel (HTTPS, SSL, or TLS) between the gatekeeper and the trusted hosts or
tasks where this is possible. However, some hosting environments don't support HTTPS on internal endpoints.
Adding the extra layer to the application to implement the gatekeeper pattern is likely to have some impact on
performance due to the additional processing and network communication it requires.
The gatekeeper instance could be a single point of failure. To minimize the impact of a failure, consider
deploying additional instances and using an autoscaling mechanism to ensure capacity to maintain availability.

When to use this pattern


This pattern is useful for:
Applications that handle sensitive information, expose services that must have a high degree of protection from
malicious attacks, or perform mission-critical operations that shouldn't be disrupted.
Distributed applications where it's necessary to perform request validation separately from the main tasks, or
to centralize this validation to simplify maintenance and administration.

Example
In a cloud-hosted scenario, this pattern can be implemented by decoupling the gatekeeper role or virtual machine
from the trusted roles and services in an application. Do this by using an internal endpoint, a queue, or storage as
an intermediate communication mechanism. The figure illustrates using an internal endpoint.

Related patterns
The Valet Key pattern might also be relevant when implementing the Gatekeeper pattern. When communicating
between the Gatekeeper and trusted roles it's good practice to enhance security by using keys or tokens that limit
permissions for accessing resources. Describes how to use a token or key that provides clients with restricted
direct access to a specific resource or service.
Gateway Aggregation pattern
11/2/2020 • 3 minutes to read • Edit Online

Use a gateway to aggregate multiple individual requests into a single request. This pattern is useful when a client
must make multiple calls to different backend systems to perform an operation.

Context and problem


To perform a single task, a client may have to make multiple calls to various backend services. An application that
relies on many services to perform a task must expend resources on each request. When any new feature or
service is added to the application, additional requests are needed, further increasing resource requirements and
network calls. This chattiness between a client and a backend can adversely impact the performance and scale of
the application. Microservice architectures have made this problem more common, as applications built around
many smaller services naturally have a higher amount of cross-service calls.
In the following diagram, the client sends requests to each service (1,2,3). Each service processes the request and
sends the response back to the application (4,5,6). Over a cellular network with typically high latency, using
individual requests in this manner is inefficient and could result in broken connectivity or incomplete requests.
While each request may be done in parallel, the application must send, wait, and process data for each request, all
on separate connections, increasing the chance of failure.

Solution
Use a gateway to reduce chattiness between the client and the services. The gateway receives client requests,
dispatches requests to the various backend systems, and then aggregates the results and sends them back to the
requesting client.
This pattern can reduce the number of requests that the application makes to backend services, and improve
application performance over high-latency networks.
In the following diagram, the application sends a request to the gateway (1). The request contains a package of
additional requests. The gateway decomposes these and processes each request by sending it to the relevant
service (2). Each service returns a response to the gateway (3). The gateway combines the responses from each
service and sends the response to the application (4). The application makes a single request and receives only a
single response from the gateway.
Issues and considerations
The gateway should not introduce service coupling across the backend services.
The gateway should be located near the backend services to reduce latency as much as possible.
The gateway service may introduce a single point of failure. Ensure the gateway is properly designed to meet
your application's availability requirements.
The gateway may introduce a bottleneck. Ensure the gateway has adequate performance to handle load and
can be scaled to meet your anticipated growth.
Perform load testing against the gateway to ensure you don't introduce cascading failures for services.
Implement a resilient design, using techniques such as bulkheads, circuit breaking, retry, and timeouts.
If one or more service calls takes too long, it may be acceptable to timeout and return a partial set of data.
Consider how your application will handle this scenario.
Use asynchronous I/O to ensure that a delay at the backend doesn't cause performance issues in the
application.
Implement distributed tracing using correlation IDs to track each individual call.
Monitor request metrics and response sizes.
Consider returning cached data as a failover strategy to handle failures.
Instead of building aggregation into the gateway, consider placing an aggregation service behind the gateway.
Request aggregation will likely have different resource requirements than other services in the gateway and
may impact the gateway's routing and offloading functionality.

When to use this pattern


Use this pattern when:
A client needs to communicate with multiple backend services to perform an operation.
The client may use networks with significant latency, such as cellular networks.
This pattern may not be suitable when:
You want to reduce the number of calls between a client and a single service across multiple operations. In
that scenario, it may be better to add a batch operation to the service.
The client or application is located near the backend services and latency is not a significant factor.

Example
The following example illustrates how to create a simple a gateway aggregation NGINX service using Lua.

worker_processes 4;

events {
worker_connections 1024;
}

http {
server {
listen 80;

location = /batch {
content_by_lua '
ngx.req.read_body()

-- read json body content


local cjson = require "cjson"
local batch = cjson.decode(ngx.req.get_body_data())["batch"]

-- create capture_multi table


local requests = {}
for i, item in ipairs(batch) do
table.insert(requests, {item.relative_url, { method = ngx.HTTP_GET}})
end

-- execute batch requests in parallel


local results = {}
local resps = { ngx.location.capture_multi(requests) }
for i, res in ipairs(resps) do
table.insert(results, {status = res.status, body = cjson.decode(res.body), header = res.header})
end

ngx.say(cjson.encode({results = results}))
';
}

location = /service1 {
default_type application/json;
echo '{"attr1":"val1"}';
}

location = /service2 {
default_type application/json;
echo '{"attr2":"val2"}';
}
}
}

Related guidance
Backends for Frontends pattern
Gateway Offloading pattern
Gateway Routing pattern
Gateway Offloading pattern
11/2/2020 • 3 minutes to read • Edit Online

Offload shared or specialized service functionality to a gateway proxy. This pattern can simplify application
development by moving shared service functionality, such as the use of SSL certificates, from other parts of the
application into the gateway.

Context and problem


Some features are commonly used across multiple services, and these features require configuration,
management, and maintenance. A shared or specialized service that is distributed with every application
deployment increases the administrative overhead and increases the likelihood of deployment error. Any updates
to a shared feature must be deployed across all services that share that feature.
Properly handling security issues (token validation, encryption, SSL certificate management) and other complex
tasks can require team members to have highly specialized skills. For example, a certificate needed by an
application must be configured and deployed on all application instances. With each new deployment, the
certificate must be managed to ensure that it does not expire. Any common certificate that is due to expire must
be updated, tested, and verified on every application deployment.
Other common services such as authentication, authorization, logging, monitoring, or throttling can be difficult
to implement and manage across a large number of deployments. It may be better to consolidate this type of
functionality, in order to reduce overhead and the chance of errors.

Solution
Offload some features into an API gateway, particularly cross-cutting concerns such as certificate management,
authentication, SSL termination, monitoring, protocol translation, or throttling.
The following diagram shows an API gateway that terminates inbound SSL connections. It requests data on
behalf of the original requestor from any HTTP server upstream of the API gateway.

Benefits of this pattern include:


Simplify the development of services by removing the need to distribute and maintain supporting
resources, such as web server certificates and configuration for secure websites. Simpler configuration
results in easier management and scalability and makes service upgrades simpler.
Allow dedicated teams to implement features that require specialized expertise, such as security. This
allows your core team to focus on the application functionality, leaving these specialized but cross-cutting
concerns to the relevant experts.
Provide some consistency for request and response logging and monitoring. Even if a service is not
correctly instrumented, the gateway can be configured to ensure a minimum level of monitoring and
logging.

Issues and considerations


Ensure the API gateway is highly available and resilient to failure. Avoid single points of failure by running
multiple instances of your API gateway.
Ensure the gateway is designed for the capacity and scaling requirements of your application and endpoints.
Make sure the gateway does not become a bottleneck for the application and is sufficiently scalable.
Only offload features that are used by the entire application, such as security or data transfer.
Business logic should never be offloaded to the API gateway.
If you need to track transactions, consider generating correlation IDs for logging purposes.

When to use this pattern


Use this pattern when:
An application deployment has a shared concern such as SSL certificates or encryption.
A feature that is common across application deployments that may have different resource requirements,
such as memory resources, storage capacity or network connections.
You wish to move the responsibility for issues such as network security, throttling, or other network boundary
concerns to a more specialized team.
This pattern may not be suitable if it introduces coupling across services.

Example
Using Nginx as the SSL offload appliance, the following configuration terminates an inbound SSL connection and
distributes the connection to one of three upstream HTTP servers.

upstream iis {
server 10.3.0.10 max_fails=3 fail_timeout=15s;
server 10.3.0.20 max_fails=3 fail_timeout=15s;
server 10.3.0.30 max_fails=3 fail_timeout=15s;
}

server {
listen 443;
ssl on;
ssl_certificate /etc/nginx/ssl/domain.cer;
ssl_certificate_key /etc/nginx/ssl/domain.key;

location / {
set $targ iis;
proxy_pass http://$targ;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto https;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Host $host;
}
}

On Azure, this can be achieved by setting up SSL termination on Application Gateway.

Related guidance
Backends for Frontends pattern
Gateway Aggregation pattern
Gateway Routing pattern
Gateway Routing pattern
11/2/2020 • 3 minutes to read • Edit Online

Route requests to multiple services using a single endpoint. This pattern is useful when you wish to expose
multiple services on a single endpoint and route to the appropriate service based on the request.

Context and problem


When a client needs to consume multiple services, setting up a separate endpoint for each service and having
the client manage each endpoint can be challenging. For example, an e-commerce application might provide
services such as search, reviews, cart, checkout, and order history. Each service has a different API that the client
must interact with, and the client must know about each endpoint in order to connect to the services. If an API
changes, the client must be updated as well. If you refactor a service into two or more separate services, the code
must change in both the service and the client.

Solution
Place a gateway in front of a set of applications, services, or deployments. Use application Layer 7 routing to
route the request to the appropriate instances.
With this pattern, the client application only needs to know about and communicate with a single endpoint. If a
service is consolidated or decomposed, the client does not necessarily require updating. It can continue making
requests to the gateway, and only the routing changes.
A gateway also lets you abstract backend services from the clients, allowing you to keep client calls simple while
enabling changes in the backend services behind the gateway. Client calls can be routed to whatever service or
services need to handle the expected client behavior, allowing you to add, split, and reorganize services behind
the gateway without changing the client.
This pattern can also help with deployment, by allowing you to manage how updates are rolled out to users.
When a new version of your service is deployed, it can be deployed in parallel with the existing version. Routing
lets you control what version of the service is presented to the clients, giving you the flexibility to use various
release strategies, whether incremental, parallel, or complete rollouts of updates. Any issues discovered after the
new service is deployed can be quickly reverted by making a configuration change at the gateway, without
affecting clients.

Issues and considerations


The gateway service may introduce a single point of failure. Ensure it is properly designed to meet your
availability requirements. Consider resiliency and fault tolerance capabilities when implementing.
The gateway service may introduce a bottleneck. Ensure the gateway has adequate performance to handle
load and can easily scale in line with your growth expectations.
Perform load testing against the gateway to ensure you don't introduce cascading failures for services.
Gateway routing is level 7. It can be based on IP, port, header, or URL.

When to use this pattern


Use this pattern when:
A client needs to consume multiple services that can be accessed behind a gateway.
You wish to simplify client applications by using a single endpoint.
You need to route requests from externally addressable endpoints to internal virtual endpoints, such as
exposing ports on a VM to cluster virtual IP addresses.
This pattern may not be suitable when you have a simple application that uses only one or two services.

Example
Using Nginx as the router, the following is a simple example configuration file for a server that routes requests
for applications residing on different virtual directories to different machines at the back end.

server {
listen 80;
server_name domain.com;

location /app1 {
proxy_pass http://10.0.3.10:80;
}

location /app2 {
proxy_pass http://10.0.3.20:80;
}

location /app3 {
proxy_pass http://10.0.3.30:80;
}
}

On Azure, multiple services can be set up behind an Application Gateway instance, which provides layer-7
routing.

Related guidance
Backends for Frontends pattern
Gateway Aggregation pattern
Gateway Offloading pattern
Geodes
11/2/2020 • 5 minutes to read • Edit Online

The geode pattern involves deploying a collection of backend services into a set of ge ographical node s, each of
which can service any request for any client in any region. This pattern allows serving requests in an active-active
style, improving latency and increasing availability by distributing request processing around the globe.

Context and problem


Many large-scale services have specific challenges around geo-availability and scale. Classic designs often bring
the data to the compute by storing data in a remote SQL server that serves as the compute tier for that data,
relying on scale-up for growth.
The classic approach may present a number of challenges:
Network latency issues for users coming from the other side of the globe to connect to the hosting endpoint
Traffic management for demand bursts that can overwhelm the services in a single region
Cost-prohibitive complexity of deploying copies of app infrastructure into multiple regions for a 24x7 service
Modern cloud infrastructure has evolved to enable geographic load balancing of front-end services, while
allowing for geographic replication of backend services. For availability and performance, getting data closer to
the user is good. When data is geo-distributed across a far-flung user base, the geo-distributed datastores should
also be colocated with the compute resources that process the data. The geode pattern brings the compute to the
data.

Solution
Deploy the service into a number of satellite deployments spread around the globe, each of which is called a
geode. The geode pattern harnesses key features of Azure to route traffic via the shortest path to a nearby geode,
which improves latency and performance. Each geode is behind a global load balancer, and uses a geo-replicated
read-write service like Azure Cosmos DB to host the data plane, ensuring cross-geode data consistency. Data
replication services ensure that data stores are identical across geodes, so all requests can be served from all
geodes.
The key difference between a deployment stamp and a geode is that geodes never exist in isolation. There should
always be more than one geode in a production platform.
Geodes have the following characteristics:
Consist of a collection of disparate types of resources, often defined in a template.
Have no dependencies outside of the geode footprint and are self-contained. No geode is dependent on
another to operate, and if one dies, the others continue to operate.
Are loosely coupled via an edge network and replication backplane. For example, you can use Azure Traffic
Manager or Azure Front Door for fronting the geodes, while Azure Cosmos DB can act as the replication
backplane. Geodes are not the same as clusters because they share a replication backplane, so the platform
takes care of quorum issues.
The geode pattern occurs in big data architectures that use commodity hardware to process data colocated on the
same machine, and MapReduce to consolidate results across machines. Another usage is near-edge compute,
which brings compute closer to the intelligent edge of the network to reduce response time.
Services can use this pattern over dozens or hundreds of geodes. Furthermore, the resiliency of the whole
solution increases with each added geode, since any geodes can take over if a regional outage takes one or more
geodes offline.
It's also possible to augment local availability techniques, such as availability zones or paired regions, with the
geode pattern for global availability. This increases complexity, but is useful if your architecture is underpinned by
a storage engine such as blob storage that can only replicate to a paired region. You can deploy geodes into an
intra-zone, zonal, or regional footprint, with a mind to regulatory or latency constraints on location.

Issues and considerations


Use the following techniques and technologies to implement this pattern:
Modern DevOps practices and tools to produce and rapidly deploy identical geodes across a large number of
regions or instances.
Autoscaling to scale out compute and database throughput instances within a geode. Each geode individually
scales out, within the common backplane constraints.
A front-end service like Azure Front Door that does dynamic content acceleration, split TCP, and Anycast
routing.
A replicating data store like Azure Cosmos DB to control data consistency.
Serverless technologies where possible, to reduce always-on deployment cost, especially when load is
frequently rebalanced around the globe. This strategy allows for many geodes to be deployed with minimal
additional investment. Serverless and consumption-based billing technologies reduce waste and cost from
duplicate geo-distributed deployments.
Consider the following points when deciding how to implement this pattern:
Choose whether to process data locally in each region, or to distribute aggregations in a single geode and
replicate the result across the globe. The Azure Cosmos DB change feed processor offers this granular control
using its lease container concept, and the leasecollectionprefix in the corresponding Azure Functions binding.
Each approach has distinct advantages and drawbacks.
Geodes can work in tandem, using the Azure Cosmos DB change feed and a real-time communication
platform like SignalR. Geodes can communicate with remote users via other geodes in a mesh pattern, without
knowing or caring where the remote user is located.
This design pattern implicitly decouples everything, resulting in an ultra-highly distributed and decoupled
architecture. Decoupling is a good thing, but consider how to track different components of the same request
as they might execute asynchronously on different instances. Get a good monitoring strategy in place.

When to use this pattern


Use this pattern:
To implement a high-scale platform that has users distributed over a wide area.
For any service that requires extreme availability and resilience characteristics, because services based on the
geode pattern can survive the loss of multiple service regions at the same time.
This pattern might not be suitable for
Architectures that have constraints so that all geodes can't be equal for data storage. For example, there may
be data residency requirements, an application that needs to maintain temporary state for a particular session,
or a heavy weighting of requests towards a single region. In this case, consider using deployment stamps in
combination with a global routing plane that is aware of where a user’s data sits, such as the traffic routing
component described within the deployment stamps pattern.
Situations where there's no geographical distribution required. Instead, consider availability zones and paired
regions for clustering.
Situations where a legacy platform needs to be retrofitted. This pattern works for cloud-native development
only, and can be difficult to retrofit.
Simple architectures and requirements, where geo-redundancy and geo-distribution aren't required or
advantageous.

Examples
Windows Active Directory implements an early variant of this pattern. Multi-primary replication means all
updates and requests can in theory be served from all serviceable nodes, but FSMO roles mean that all geodes
aren't equal.
A QnA sample application on GitHub showcases this design pattern in practice.
Health Endpoint Monitoring pattern
11/2/2020 • 13 minutes to read • Edit Online

Implement functional checks in an application that external tools can access through exposed endpoints at
regular intervals. This can help to verify that applications and services are performing correctly.

Context and problem


It's a good practice, and often a business requirement, to monitor web applications and back-end services, to
ensure they're available and performing correctly. However, it's more difficult to monitor services running in
the cloud than it is to monitor on-premises services. For example, you don't have full control of the hosting
environment, and the services typically depend on other services provided by platform vendors and others.
There are many factors that affect cloud-hosted applications such as network latency, the performance and
availability of the underlying compute and storage systems, and the network bandwidth between them. The
service can fail entirely or partially due to any of these factors. Therefore, you must verify at regular intervals
that the service is performing correctly to ensure the required level of availability, which might be part of your
service level agreement (SLA).

Solution
Implement health monitoring by sending requests to an endpoint on the application. The application should
perform the necessary checks, and return an indication of its status.
A health monitoring check typically combines two factors:
The checks (if any) performed by the application or service in response to the request to the health
verification endpoint.
Analysis of the results by the tool or framework that performs the health verification check.
The response code indicates the status of the application and, optionally, any components or services it uses.
The latency or response time check is performed by the monitoring tool or framework. The figure provides an
overview of the pattern.
Other checks that might be carried out by the health monitoring code in the application include:
Checking cloud storage or a database for availability and response time.
Checking other resources or services located in the application, or located elsewhere but used by the
application.
Services and tools are available that monitor web applications by submitting a request to a configurable set of
endpoints, and evaluating the results against a set of configurable rules. It's relatively easy to create a service
endpoint whose sole purpose is to perform some functional tests on the system.
Typical checks that can be performed by the monitoring tools include:
Validating the response code. For example, an HTTP response of 200 (OK) indicates that the application
responded without error. The monitoring system might also check for other response codes to give more
comprehensive results.
Checking the content of the response to detect errors, even when a 200 (OK) status code is returned. This
can detect errors that affect only a section of the returned web page or service response. For example,
checking the title of a page or looking for a specific phrase that indicates the correct page was returned.
Measuring the response time, which indicates a combination of the network latency and the time that the
application took to execute the request. An increasing value can indicate an emerging problem with the
application or network.
Checking resources or services located outside the application, such as a content delivery network used by
the application to deliver content from global caches.
Checking for expiration of SSL certificates.
Measuring the response time of a DNS lookup for the URL of the application to measure DNS latency and
DNS failures.
Validating the URL returned by the DNS lookup to ensure correct entries. This can help to avoid malicious
request redirection through a successful attack on the DNS server.
It's also useful, where possible, to run these checks from different on-premises or hosted locations to measure
and compare response times. Ideally you should monitor applications from locations that are close to
customers to get an accurate view of the performance from each location. In addition to providing a more
robust checking mechanism, the results can help you decide on the deployment location for the application—
and whether to deploy it in more than one datacenter.
Tests should also be run against all the service instances that customers use to ensure the application is
working correctly for all customers. For example, if customer storage is spread across more than one storage
account, the monitoring process should check all of these.

Issues and considerations


Consider the following points when deciding how to implement this pattern:
How to validate the response. For example, is just a single 200 (OK) status code sufficient to verify the
application is working correctly? While this provides the most basic measure of application availability, and is
the minimum implementation of this pattern, it provides little information about the operations, trends, and
possible upcoming issues in the application.

Make sure that the application correctly returns a 200 (OK) only when the target resource is found and
processed. In some scenarios, such as when using a master page to host the target web page, the server
sends back a 200 (OK) status code instead of a 404 (Not Found) code, even when the target content page
was not found.

The number of endpoints to expose for an application. One approach is to expose at least one endpoint for the
core services that the application uses and another for lower priority services, allowing different levels of
importance to be assigned to each monitoring result. Also consider exposing more endpoints, such as one for
each core service, for additional monitoring granularity. For example, a health verification check might check
the database, storage, and an external geocoding service that an application uses, with each requiring a
different level of uptime and response time. The application could still be healthy if the geocoding service, or
some other background task, is unavailable for a few minutes.
Whether to use the same endpoint for monitoring as is used for general access, but to a specific path designed
for health verification checks, for example, /HealthCheck/{GUID}/ on the general access endpoint. This allows
some functional tests in the application to be run by the monitoring tools, such as adding a new user
registration, signing in, and placing a test order, while also verifying that the general access endpoint is
available.
The type of information to collect in the service in response to monitoring requests, and how to return this
information. Most existing tools and frameworks look only at the HTTP status code that the endpoint returns. To
return and validate additional information, you might have to create a custom monitoring utility or service.
How much information to collect. Performing excessive processing during the check can overload the
application and impact other users. The time it takes might exceed the timeout of the monitoring system so it
marks the application as unavailable. Most applications include instrumentation such as error handlers and
performance counters that log performance and detailed error information, this might be sufficient instead of
returning additional information from a health verification check.
Caching the endpoint status. It could be expensive to run the health check too frequently. If the health status is
reported through a dashboard, for example, you don't want every request from the dashboard to trigger a
health check. Instead, periodically check the system health and cache the status. Expose an endpoint that
returns the cached status.
How to configure security for the monitoring endpoints to protect them from public access, which might
expose the application to malicious attacks, risk the exposure of sensitive information, or attract denial of
service (DoS) attacks. Typically this should be done in the application configuration so that it can be updated
easily without restarting the application. Consider using one or more of the following techniques:
Secure the endpoint by requiring authentication. You can do this by using an authentication security key
in the request header or by passing credentials with the request, provided that the monitoring service or
tool supports authentication.
Use an obscure or hidden endpoint. For example, expose the endpoint on a different IP address to
that used by the default application URL, configure the endpoint on a nonstandard HTTP port,
and/or use a complex path to the test page. You can usually specify additional endpoint addresses
and ports in the application configuration, and add entries for these endpoints to the DNS server
if required to avoid having to specify the IP address directly.
Expose a method on an endpoint that accepts a parameter such as a key value or an operation
mode value. Depending on the value supplied for this parameter, when a request is received the
code can perform a specific test or set of tests, or return a 404 (Not Found) error if the parameter
value isn't recognized. The recognized parameter values could be set in the application
configuration.

DoS attacks are likely to have less impact on a separate endpoint that performs basic
functional tests without compromising the operation of the application. Ideally, avoid using a
test that might expose sensitive information. If you must return information that might be
useful to an attacker, consider how you'll protect the endpoint and the data from
unauthorized access. In this case just relying on obscurity isn't enough. You should also
consider using an HTTPS connection and encrypting any sensitive data, although this will
increase the load on the server.
How to access an endpoint that's secured using authentication. Not all tools and frameworks can be
configured to include credentials with the health verification request. For example, Microsoft Azure built-
in health verification features can't provide authentication credentials. Some third-party alternatives are
Pingdom, Panopta, NewRelic, and Statuscake.
How to ensure that the monitoring agent is performing correctly. One approach is to expose an endpoint
that simply returns a value from the application configuration or a random value that can be used to test
the agent.

Also ensure that the monitoring system performs checks on itself, such as a self-test and built-in test,
to avoid it issuing false positive results.

When to use this pattern


This pattern is useful for:
Monitoring websites and web applications to verify availability.
Monitoring websites and web applications to check for correct operation.
Monitoring middle-tier or shared services to detect and isolate a failure that could disrupt other
applications.
Complementing existing instrumentation in the application, such as performance counters and error
handlers. Health verification checking doesn't replace the requirement for logging and auditing in the
application. Instrumentation can provide valuable information for an existing framework that monitors
counters and error logs to detect failures or other issues. However, it can't provide information if the
application is unavailable.

Example
The following code examples, taken from the HealthCheckController class (a sample that demonstrates this
pattern is available on GitHub), demonstrates exposing an endpoint for performing a range of health checks.
The CoreServices method, shown below in C#, performs a series of checks on services used in the application.
If all of the tests run without error, the method returns a 200 (OK) status code. If any of the tests raises an
exception, the method returns a 500 (Internal Error) status code. The method could optionally return additional
information when an error occurs, if the monitoring tool or framework is able to use it.
public ActionResult CoreServices()
{
try
{
// Run a simple check to ensure the database is available.
DataStore.Instance.CoreHealthCheck();

// Run a simple check on our external service.


MyExternalService.Instance.CoreHealthCheck();
}
catch (Exception ex)
{
Trace.TraceError("Exception in basic health check: {0}", ex.Message);

// This can optionally return different status codes based on the exception.
// Optionally it could return more details about the exception.
// The additional information could be used by administrators who access the
// endpoint with a browser, or using a ping utility that can display the
// additional information.
return new HttpStatusCodeResult((int)HttpStatusCode.InternalServerError);
}
return new HttpStatusCodeResult((int)HttpStatusCode.OK);
}

The ObscurePath method shows how you can read a path from the application configuration and use it as the
endpoint for tests. This example, in C#, also shows how you can accept an ID as a parameter and use it to check
for valid requests.

public ActionResult ObscurePath(string id)


{
// The id could be used as a simple way to obscure or hide the endpoint.
// The id to match could be retrieved from configuration and, if matched,
// perform a specific set of tests and return the result. If not matched it
// could return a 404 (Not Found) status.

// The obscure path can be set through configuration to hide the endpoint.
var hiddenPathKey = CloudConfigurationManager.GetSetting("Test.ObscurePath");

// If the value passed does not match that in configuration, return 404 (Not Found).
if (!string.Equals(id, hiddenPathKey))
{
return new HttpStatusCodeResult((int)HttpStatusCode.NotFound);
}

// Else continue and run the tests...


// Return results from the core services test.
return this.CoreServices();
}

The TestResponseFromConfig method shows how you can expose an endpoint that performs a check for a
specified configuration setting value.
public ActionResult TestResponseFromConfig()
{
// Health check that returns a response code set in configuration for testing.
var returnStatusCodeSetting = CloudConfigurationManager.GetSetting(
"Test.ReturnStatusCode");

int returnStatusCode;

if (!int.TryParse(returnStatusCodeSetting, out returnStatusCode))


{
returnStatusCode = (int)HttpStatusCode.OK;
}

return new HttpStatusCodeResult(returnStatusCode);


}

Monitoring endpoints in Azure hosted applications


Some options for monitoring endpoints in Azure applications are:
Use the built-in monitoring features of Azure.
Use a third-party service or a framework such as Microsoft System Center Operations Manager.
Create a custom utility or a service that runs on your own or on a hosted server.

Even though Azure provides a reasonably comprehensive set of monitoring options, you can use
additional services and tools to provide extra information. Azure Management Services provides a
built-in monitoring mechanism for alert rules. The alerts section of the management services page
in the Azure portal allows you to configure up to ten alert rules per subscription for your services.
These rules specify a condition and a threshold value for a service such as CPU load, or the number
of requests or errors per second, and the service can automatically send email notifications to
addresses you define in each rule.

The conditions you can monitor vary depending on the hosting mechanism you choose for your application
(such as Web Sites, Cloud Services, Virtual Machines, or Mobile Services), but all of these include the ability to
create an alert rule that uses a web endpoint you specify in the settings for your service. This endpoint should
respond in a timely way so that the alert system can detect that the application is operating correctly.

Read more information about creating alert notifications.

If you host your application in Azure Cloud Services web and worker roles or Virtual Machines, you can take
advantage of one of the built-in services in Azure called Traffic Manager. Traffic Manager is a routing and load-
balancing service that can distribute requests to specific instances of your Cloud Services hosted application
based on a range of rules and settings.
In addition to routing requests, Traffic Manager pings a URL, port, and relative path that you specify on a
regular basis to determine which instances of the application defined in its rules are active and are responding
to requests. If it detects a status code 200 (OK), it marks the application as available. Any other status code
causes Traffic Manager to mark the application as offline. You can view the status in the Traffic Manager
console, and configure the rule to reroute requests to other instances of the application that are responding.
However, Traffic Manager will only wait ten seconds to receive a response from the monitoring URL. Therefore,
you should ensure that your health verification code executes in this time, allowing for network latency for the
round trip from Traffic Manager to your application and back again.
Read more information about using Traffic Manager to monitor your applications. Traffic Manager is also
discussed in Multiple Datacenter Deployment Guidance.

Related guidance
The following guidance can be useful when implementing this pattern:
Instrumentation and Telemetry Guidance. Checking the health of services and components is typically done
by probing, but it's also useful to have information in place to monitor application performance and detect
events that occur at runtime. This data can be transmitted back to monitoring tools as additional
information for health monitoring. Instrumentation and Telemetry Guidance explores gathering remote
diagnostics information that's collected by instrumentation in applications.
Receiving alert notifications.
This pattern includes a downloadable sample application.
Index Table pattern
11/2/2020 • 9 minutes to read • Edit Online

Create indexes over the fields in data stores that are frequently referenced by queries. This pattern can improve
query performance by allowing applications to more quickly locate the data to retrieve from a data store.

Context and problem


Many data stores organize the data for a collection of entities using the primary key. An application can use this
key to locate and retrieve data. The figure shows an example of a data store holding customer information. The
primary key is the Customer ID. The figure shows customer information organized by the primary key (Customer
ID).

While the primary key is valuable for queries that fetch data based on the value of this key, an application might
not be able to use the primary key if it needs to retrieve data based on some other field. In the customers
example, an application can't use the Customer ID primary key to retrieve customers if it queries data solely by
referencing the value of some other attribute, such as the town in which the customer is located. To perform a
query such as this, the application might have to fetch and examine every customer record, which could be a slow
process.
Many relational database management systems support secondary indexes. A secondary index is a separate data
structure that's organized by one or more nonprimary (secondary) key fields, and it indicates where the data for
each indexed value is stored. The items in a secondary index are typically sorted by the value of the secondary
keys to enable fast lookup of data. These indexes are usually maintained automatically by the database
management system.
You can create as many secondary indexes as you need to support the different queries that your application
performs. For example, in a Customers table in a relational database where the Customer ID is the primary key,
it's beneficial to add a secondary index over the town field if the application frequently looks up customers by the
town where they reside.
However, although secondary indexes are common in relational systems, most NoSQL data stores used by cloud
applications don't provide an equivalent feature.

Solution
If the data store doesn't support secondary indexes, you can emulate them manually by creating your own index
tables. An index table organizes the data by a specified key. Three strategies are commonly used for structuring
an index table, depending on the number of secondary indexes that are required and the nature of the queries
that an application performs.
The first strategy is to duplicate the data in each index table but organize it by different keys (complete
denormalization). The next figure shows index tables that organize the same customer information by Town and
LastName.

This strategy is appropriate if the data is relatively static compared to the number of times it's queried using each
key. If the data is more dynamic, the processing overhead of maintaining each index table becomes too large for
this approach to be useful. Also, if the volume of data is very large, the amount of space required to store the
duplicate data is significant.
The second strategy is to create normalized index tables organized by different keys and reference the original
data by using the primary key rather than duplicating it, as shown in the following figure. The original data is
called a fact table.

This technique saves space and reduces the overhead of maintaining duplicate data. The disadvantage is that an
application has to perform two lookup operations to find data using a secondary key. It has to find the primary
key for the data in the index table, and then use the primary key to look up the data in the fact table.
The third strategy is to create partially normalized index tables organized by different keys that duplicate
frequently retrieved fields. Reference the fact table to access less frequently accessed fields. The next figure shows
how commonly accessed data is duplicated in each index table.
With this strategy, you can strike a balance between the first two approaches. The data for common queries can
be retrieved quickly by using a single lookup, while the space and maintenance overhead isn't as significant as
duplicating the entire data set.
If an application frequently queries data by specifying a combination of values (for example, “Find all customers
that live in Redmond and that have a last name of Smith”), you could implement the keys to the items in the index
table as a concatenation of the Town attribute and the LastName attribute. The next figure shows an index table
based on composite keys. The keys are sorted by Town, and then by LastName for records that have the same
value for Town.

Index tables can speed up query operations over sharded data, and are especially useful where the shard key is
hashed. The next figure shows an example where the shard key is a hash of the Customer ID. The index table can
organize data by the nonhashed value (Town and LastName), and provide the hashed shard key as the lookup
data. This can save the application from repeatedly calculating hash keys (an expensive operation) if it needs to
retrieve data that falls within a range, or it needs to fetch data in order of the nonhashed key. For example, a
query such as “Find all customers that live in Redmond” can be quickly resolved by locating the matching items in
the index table, where they're all stored in a contiguous block. Then, follow the references to the customer data
using the shard keys stored in the index table.
Issues and considerations
Consider the following points when deciding how to implement this pattern:
The overhead of maintaining secondary indexes can be significant. You must analyze and understand the
queries that your application uses. Only create index tables when they're likely to be used regularly. Don't
create speculative index tables to support queries that an application doesn't perform, or performs only
occasionally.
Duplicating data in an index table can add significant overhead in storage costs and the effort required to
maintain multiple copies of data.
Implementing an index table as a normalized structure that references the original data requires an
application to perform two lookup operations to find data. The first operation searches the index table to
retrieve the primary key, and the second uses the primary key to fetch the data.
If a system incorporates a number of index tables over very large data sets, it can be difficult to maintain
consistency between index tables and the original data. It might be possible to design the application
around the eventual consistency model. For example, to insert, update, or delete data, an application could
post a message to a queue and let a separate task perform the operation and maintain the index tables
that reference this data asynchronously. For more information about implementing eventual consistency,
see the Data Consistency Primer.

Microsoft Azure storage tables support transactional updates for changes made to data held in the
same partition (referred to as entity group transactions). If you can store the data for a fact table and
one or more index tables in the same partition, you can use this feature to help ensure consistency.

Index tables might themselves be partitioned or sharded.

When to use this pattern


Use this pattern to improve query performance when an application frequently needs to retrieve data by using a
key other than the primary (or shard) key.
This pattern might not be useful when:
Data is volatile. An index table can become out of date very quickly, making it ineffective or making the
overhead of maintaining the index table greater than any savings made by using it.
A field selected as the secondary key for an index table is nondiscriminating and can only have a small set of
values (for example, gender).
The balance of the data values for a field selected as the secondary key for an index table are highly skewed.
For example, if 90% of the records contain the same value in a field, then creating and maintaining an index
table to look up data based on this field might create more overhead than scanning sequentially through the
data. However, if queries very frequently target values that lie in the remaining 10%, this index can be useful.
You should understand the queries that your application is performing, and how frequently they're performed.

Example
Azure storage tables provide a highly scalable key/value data store for applications running in the cloud.
Applications store and retrieve data values by specifying a key. The data values can contain multiple fields, but the
structure of a data item is opaque to table storage, which simply handles a data item as an array of bytes.
Azure storage tables also support sharding. The sharding key includes two elements, a partition key and a row
key. Items that have the same partition key are stored in the same partition (shard), and the items are stored in
row key order within a shard. Table storage is optimized for performing queries that fetch data falling within a
contiguous range of row key values within a partition. If you're building cloud applications that store information
in Azure tables, you should structure your data with this feature in mind.
For example, consider an application that stores information about movies. The application frequently queries
movies by genre (action, documentary, historical, comedy, drama, and so on). You could create an Azure table
with partitions for each genre by using the genre as the partition key, and specifying the movie name as the row
key, as shown in the next figure.

This approach is less effective if the application also needs to query movies by starring actor. In this case, you can
create a separate Azure table that acts as an index table. The partition key is the actor and the row key is the
movie name. The data for each actor will be stored in separate partitions. If a movie stars more than one actor, the
same movie will occur in multiple partitions.
You can duplicate the movie data in the values held by each partition by adopting the first approach described in
the Solution section above. However, it's likely that each movie will be replicated several times (once for each
actor), so it might be more efficient to partially denormalize the data to support the most common queries (such
as the names of the other actors) and enable an application to retrieve any remaining details by including the
partition key necessary to find the complete information in the genre partitions. This approach is described by
the third option in the Solution section. The next figure shows this approach.
Related patterns and guidance
The following patterns and guidance might also be relevant when implementing this pattern:
Data Consistency Primer. An index table must be maintained as the data that it indexes changes. In the cloud, it
might not be possible or appropriate to perform operations that update an index as part of the same
transaction that modifies the data. In that case, an eventually consistent approach is more suitable. Provides
information on the issues surrounding eventual consistency.
Sharding pattern. The Index Table pattern is frequently used in conjunction with data partitioned by using
shards. The Sharding pattern provides more information on how to divide a data store into a set of shards.
Materialized View pattern. Instead of indexing data to support queries that summarize data, it might be more
appropriate to create a materialized view of the data. Describes how to support efficient summary queries by
generating prepopulated views over data.
Leader Election pattern
11/2/2020 • 11 minutes to read • Edit Online

Coordinate the actions performed by a collection of collaborating instances in a distributed application by


electing one instance as the leader that assumes responsibility for managing the others. This can help to ensure
that instances don't conflict with each other, cause contention for shared resources, or inadvertently interfere with
the work that other instances are performing.

Context and problem


A typical cloud application has many tasks acting in a coordinated manner. These tasks could all be instances
running the same code and requiring access to the same resources, or they might be working together in parallel
to perform the individual parts of a complex calculation.
The task instances might run separately for much of the time, but it might also be necessary to coordinate the
actions of each instance to ensure that they don’t conflict, cause contention for shared resources, or accidentally
interfere with the work that other task instances are performing.
For example:
In a cloud-based system that implements horizontal scaling, multiple instances of the same task could be
running at the same time with each instance serving a different user. If these instances write to a shared
resource, it's necessary to coordinate their actions to prevent each instance from overwriting the changes
made by the others.
If the tasks are performing individual elements of a complex calculation in parallel, the results need to be
aggregated when they all complete.
The task instances are all peers, so there isn't a natural leader that can act as the coordinator or aggregator.

Solution
A single task instance should be elected to act as the leader, and this instance should coordinate the actions of the
other subordinate task instances. If all of the task instances are running the same code, they are each capable of
acting as the leader. Therefore, the election process must be managed carefully to prevent two or more instances
taking over the leader role at the same time.
The system must provide a robust mechanism for selecting the leader. This method has to cope with events such
as network outages or process failures. In many solutions, the subordinate task instances monitor the leader
through some type of heartbeat method, or by polling. If the designated leader terminates unexpectedly, or a
network failure makes the leader unavailable to the subordinate task instances, it's necessary for them to elect a
new leader.
There are several strategies for electing a leader among a set of tasks in a distributed environment, including:
Selecting the task instance with the lowest-ranked instance or process ID.
Racing to acquire a shared, distributed mutex. The first task instance that acquires the mutex is the leader.
However, the system must ensure that, if the leader terminates or becomes disconnected from the rest of the
system, the mutex is released to allow another task instance to become the leader.
Implementing one of the common leader election algorithms such as the Bully Algorithm or the Ring
Algorithm. These algorithms assume that each candidate in the election has a unique ID, and that it can
communicate with the other candidates reliably.
Issues and considerations
Consider the following points when deciding how to implement this pattern:
The process of electing a leader should be resilient to transient and persistent failures.
It must be possible to detect when the leader has failed or has become otherwise unavailable (such as due to a
communications failure). How quickly detection is needed is system dependent. Some systems might be able
to function for a short time without a leader, during which a transient fault might be fixed. In other cases, it
might be necessary to detect leader failure immediately and trigger a new election.
In a system that implements horizontal autoscaling, the leader could be terminated if the system scales back
and shuts down some of the computing resources.
Using a shared, distributed mutex introduces a dependency on the external service that provides the mutex.
The service constitutes a single point of failure. If it becomes unavailable for any reason, the system won't be
able to elect a leader.
Using a single dedicated process as the leader is a straightforward approach. However, if the process fails
there could be a significant delay while it's restarted. The resulting latency can affect the performance and
response times of other processes if they're waiting for the leader to coordinate an operation.
Implementing one of the leader election algorithms manually provides the greatest flexibility for tuning and
optimizing the code.

When to use this pattern


Use this pattern when the tasks in a distributed application, such as a cloud-hosted solution, need careful
coordination and there's no natural leader.

Avoid making the leader a bottleneck in the system. The purpose of the leader is to coordinate the work of the
subordinate tasks, and it doesn't necessarily have to participate in this work itself—although it should be able
to do so if the task isn't elected as the leader.

This pattern might not be useful if:


There's a natural leader or dedicated process that can always act as the leader. For example, it might be
possible to implement a singleton process that coordinates the task instances. If this process fails or becomes
unhealthy, the system can shut it down and restart it.
The coordination between tasks can be achieved using a more lightweight method. For example, if several task
instances simply need coordinated access to a shared resource, a better solution is to use optimistic or
pessimistic locking to control access.
A third-party solution is more appropriate. For example, the Microsoft Azure HDInsight service (based on
Apache Hadoop) uses the services provided by Apache Zookeeper to coordinate the map and reduce tasks
that collect and summarize data.

Example
The DistributedMutex project in the LeaderElection solution (a sample that demonstrates this pattern is available
on GitHub) shows how to use a lease on an Azure Storage blob to provide a mechanism for implementing a
shared, distributed mutex. This mutex can be used to elect a leader among a group of role instances in an Azure
cloud service. The first role instance to acquire the lease is elected the leader, and remains the leader until it
releases the lease or isn't able to renew the lease. Other role instances can continue to monitor the blob lease in
case the leader is no longer available.

A blob lease is an exclusive write lock over a blob. A single blob can be the subject of only one lease at any
point in time. A role instance can request a lease over a specified blob, and it'll be granted the lease if no other
role instance holds a lease over the same blob. Otherwise the request will throw an exception.
To avoid a faulted role instance retaining the lease indefinitely, specify a lifetime for the lease. When this
expires, the lease becomes available. However, while a role instance holds the lease it can request that the
lease is renewed, and it'll be granted the lease for a further period of time. The role instance can continually
repeat this process if it wants to retain the lease. For more information on how to lease a blob, see Lease Blob
(REST API).

The BlobDistributedMutex class in the C# example below contains the RunTaskWhenMutexAcquired method that
enables a role instance to attempt to acquire a lease over a specified blob. The details of the blob (the name,
container, and storage account) are passed to the constructor in a BlobSettings object when the
BlobDistributedMutex object is created (this object is a simple struct that is included in the sample code). The
constructor also accepts a Task that references the code that the role instance should run if it successfully
acquires the lease over the blob and is elected the leader. Note that the code that handles the low-level details of
acquiring the lease is implemented in a separate helper class named BlobLeaseManager .

public class BlobDistributedMutex


{
...
private readonly BlobSettings blobSettings;
private readonly Func<CancellationToken, Task> taskToRunWhenLeaseAcquired;
...

public BlobDistributedMutex(BlobSettings blobSettings,


Func<CancellationToken, Task> taskToRunWhenLeaseAcquired)
{
this.blobSettings = blobSettings;
this.taskToRunWhenLeaseAcquired = taskToRunWhenLeaseAcquired;
}

public async Task RunTaskWhenMutexAcquired(CancellationToken token)


{
var leaseManager = new BlobLeaseManager(blobSettings);
await this.RunTaskWhenBlobLeaseAcquired(leaseManager, token);
}
...

The RunTaskWhenMutexAcquired method in the code sample above invokes the RunTaskWhenBlobLeaseAcquired
method shown in the following code sample to actually acquire the lease. The RunTaskWhenBlobLeaseAcquired
method runs asynchronously. If the lease is successfully acquired, the role instance has been elected the leader.
The purpose of the taskToRunWhenLeaseAcquired delegate is to perform the work that coordinates the other role
instances. If the lease isn't acquired, another role instance has been elected as the leader and the current role
instance remains a subordinate. Note that the TryAcquireLeaseOrWait method is a helper method that uses the
BlobLeaseManager object to acquire the lease.
private async Task RunTaskWhenBlobLeaseAcquired(
BlobLeaseManager leaseManager, CancellationToken token)
{
while (!token.IsCancellationRequested)
{
// Try to acquire the blob lease.
// Otherwise wait for a short time before trying again.
string leaseId = await this.TryAcquireLeaseOrWait(leaseManager, token);

if (!string.IsNullOrEmpty(leaseId))
{
// Create a new linked cancellation token source so that if either the
// original token is canceled or the lease can't be renewed, the
// leader task can be canceled.
using (var leaseCts =
CancellationTokenSource.CreateLinkedTokenSource(new[] { token }))
{
// Run the leader task.
var leaderTask = this.taskToRunWhenLeaseAcquired.Invoke(leaseCts.Token);
...
}
}
}
...
}

The task started by the leader also runs asynchronously. While this task is running, the
RunTaskWhenBlobLeaseAcquired method shown in the following code sample periodically attempts to renew the
lease. This helps to ensure that the role instance remains the leader. In the sample solution, the delay between
renewal requests is less than the time specified for the duration of the lease in order to prevent another role
instance from being elected the leader. If the renewal fails for any reason, the task is canceled.
If the lease fails to be renewed or the task is canceled (possibly as a result of the role instance shutting down), the
lease is released. At this point, this or another role instance might be elected as the leader. The code extract below
shows this part of the process.

private async Task RunTaskWhenBlobLeaseAcquired(


BlobLeaseManager leaseManager, CancellationToken token)
{
while (...)
{
...
if (...)
{
...
using (var leaseCts = ...)
{
...
// Keep renewing the lease in regular intervals.
// If the lease can't be renewed, then the task completes.
var renewLeaseTask =
this.KeepRenewingLease(leaseManager, leaseId, leaseCts.Token);

// When any task completes (either the leader task itself or when it
// couldn't renew the lease) then cancel the other task.
await CancelAllWhenAnyCompletes(leaderTask, renewLeaseTask, leaseCts);
}
}
}
}
...
}
The KeepRenewingLease method is another helper method that uses the BlobLeaseManager object to renew the
lease. The CancelAllWhenAnyCompletes method cancels the tasks specified as the first two parameters. The
following diagram illustrates using the BlobDistributedMutex class to elect a leader and run a task that
coordinates operations.

The following code example shows how to use the BlobDistributedMutex class in a worker role. This code
acquires a lease over a blob named MyLeaderCoordinatorTask in the lease's container in development storage, and
specifies that the code defined in the MyLeaderCoordinatorTask method should run if the role instance is elected
the leader.
var settings = new BlobSettings(CloudStorageAccount.DevelopmentStorageAccount,
"leases", "MyLeaderCoordinatorTask");
var cts = new CancellationTokenSource();
var mutex = new BlobDistributedMutex(settings, MyLeaderCoordinatorTask);
mutex.RunTaskWhenMutexAcquired(this.cts.Token);
...

// Method that runs if the role instance is elected the leader


private static async Task MyLeaderCoordinatorTask(CancellationToken token)
{
...
}

Note the following points about the sample solution:


The blob is a potential single point of failure. If the blob service becomes unavailable, or is inaccessible, the
leader won't be able to renew the lease and no other role instance will be able to acquire the lease. In this case,
no role instance will be able to act as the leader. However, the blob service is designed to be resilient, so
complete failure of the blob service is considered to be extremely unlikely.
If the task being performed by the leader stalls, the leader might continue to renew the lease, preventing any
other role instance from acquiring the lease and taking over the leader role in order to coordinate tasks. In the
real world, the health of the leader should be checked at frequent intervals.
The election process is nondeterministic. You can't make any assumptions about which role instance will
acquire the blob lease and become the leader.
The blob used as the target of the blob lease shouldn't be used for any other purpose. If a role instance
attempts to store data in this blob, this data won't be accessible unless the role instance is the leader and holds
the blob lease.

Related patterns and guidance


The following guidance might also be relevant when implementing this pattern:
This pattern has a downloadable sample application.
Autoscaling Guidance. It's possible to start and stop instances of the task hosts as the load on the application
varies. Autoscaling can help to maintain throughput and performance during times of peak processing.
Compute Partitioning Guidance. This guidance describes how to allocate tasks to hosts in a cloud service in a
way that helps to minimize running costs while maintaining the scalability, performance, availability, and
security of the service.
The Task-based Asynchronous Pattern.
An example illustrating the Bully Algorithm.
An example illustrating the Ring Algorithm.
Apache Curator a client library for Apache ZooKeeper.
The article Lease Blob (REST API) on MSDN.
Materialized View pattern
11/2/2020 • 7 minutes to read • Edit Online

Generate prepopulated views over the data in one or more data stores when the data isn't ideally formatted for
required query operations. This can help support efficient querying and data extraction, and improve application
performance.

Context and problem


When storing data, the priority for developers and data administrators is often focused on how the data is stored,
as opposed to how it's read. The chosen storage format is usually closely related to the format of the data,
requirements for managing data size and data integrity, and the kind of store in use. For example, when using
NoSQL document store, the data is often represented as a series of aggregates, each containing all of the
information for that entity.
However, this can have a negative effect on queries. When a query only needs a subset of the data from some
entities, such as a summary of orders for several customers without all of the order details, it must extract all of
the data for the relevant entities in order to obtain the required information.

Solution
To support efficient querying, a common solution is to generate, in advance, a view that materializes the data in a
format suited to the required results set. The Materialized View pattern describes generating prepopulated views
of data in environments where the source data isn't in a suitable format for querying, where generating a
suitable query is difficult, or where query performance is poor due to the nature of the data or the data store.
These materialized views, which only contain data required by a query, allow applications to quickly obtain the
information they need. In addition to joining tables or combining data entities, materialized views can include the
current values of calculated columns or data items, the results of combining values or executing transformations
on the data items, and values specified as part of the query. A materialized view can even be optimized for just a
single query.
A key point is that a materialized view and the data it contains is completely disposable because it can be entirely
rebuilt from the source data stores. A materialized view is never updated directly by an application, and so it's a
specialized cache.
When the source data for the view changes, the view must be updated to include the new information. You can
schedule this to happen automatically, or when the system detects a change to the original data. In some cases it
might be necessary to regenerate the view manually. The figure shows an example of how the Materialized View
pattern might be used.
Issues and considerations
Consider the following points when deciding how to implement this pattern:
How and when the view will be updated. Ideally it'll regenerate in response to an event indicating a change to the
source data, although this can lead to excessive overhead if the source data changes rapidly. Alternatively,
consider using a scheduled task, an external trigger, or a manual action to regenerate the view.
In some systems, such as when using the Event Sourcing pattern to maintain a store of only the events that
modified the data, materialized views are necessary. Prepopulating views by examining all events to determine
the current state might be the only way to obtain information from the event store. If you're not using Event
Sourcing, you need to consider whether a materialized view is helpful or not. Materialized views tend to be
specifically tailored to one, or a small number of queries. If many queries are used, materialized views can result
in unacceptable storage capacity requirements and storage cost.
Consider the impact on data consistency when generating the view, and when updating the view if this occurs on
a schedule. If the source data is changing at the point when the view is generated, the copy of the data in the
view won't be fully consistent with the original data.
Consider where you'll store the view. The view doesn't have to be located in the same store or partition as the
original data. It can be a subset from a few different partitions combined.
A view can be rebuilt if lost. Because of that, if the view is transient and is only used to improve query
performance by reflecting the current state of the data, or to improve scalability, it can be stored in a cache or in a
less reliable location.
When defining a materialized view, maximize its value by adding data items or columns to it based on
computation or transformation of existing data items, on values passed in the query, or on combinations of these
values when appropriate.
Where the storage mechanism supports it, consider indexing the materialized view to further increase
performance. Most relational databases support indexing for views, as do big data solutions based on Apache
Hadoop.

When to use this pattern


This pattern is useful when:
Creating materialized views over data that's difficult to query directly, or where queries must be very complex
to extract data that's stored in a normalized, semi-structured, or unstructured way.
Creating temporary views that can dramatically improve query performance, or can act directly as source
views or data transfer objects for the UI, for reporting, or for display.
Supporting occasionally connected or disconnected scenarios where connection to the data store isn't always
available. The view can be cached locally in this case.
Simplifying queries and exposing data for experimentation in a way that doesn't require knowledge of the
source data format. For example, by joining different tables in one or more databases, or one or more
domains in NoSQL stores, and then formatting the data to fit its eventual use.
Providing access to specific subsets of the source data that, for security or privacy reasons, shouldn't be
generally accessible, open to modification, or fully exposed to users.
Bridging different data stores, to take advantage of their individual capabilities. For example, using a cloud
store that's efficient for writing as the reference data store, and a relational database that offers good query
and read performance to hold the materialized views.
This pattern isn't useful in the following situations:
The source data is simple and easy to query.
The source data changes very quickly, or can be accessed without using a view. In these cases, you should
avoid the processing overhead of creating views.
Consistency is a high priority. The views might not always be fully consistent with the original data.
When using microservices. Microservices typically have well defined boundries aligning to domain driven
design (DDD).

Example
The following figure shows an example of using the Materialized View pattern to generate a summary of sales.
Data in the Order, OrderItem, and Customer tables in separate partitions in an Azure storage account are
combined to generate a view containing the total sales value for each product in the Electronics category, along
with a count of the number of customers who made purchases of each item.

Creating this materialized view requires complex queries. However, by exposing the query result as a
materialized view, users can easily obtain the results and use them directly or incorporate them in another query.
The view is likely to be used in a reporting system or dashboard, and can be updated on a scheduled basis such
as weekly.

Although this example uses Azure table storage, many relational database management systems also
provide native support for materialized views.

Related patterns and guidance


The following patterns and guidance might also be relevant when implementing this pattern:
Data Consistency Primer. The summary information in a materialized view has to be maintained so that it
reflects the underlying data values. As the data values change, it might not be practical to update the
summary data in real time, and instead you'll have to adopt an eventually consistent approach. Summarizes
the issues surrounding maintaining consistency over distributed data, and describes the benefits and
tradeoffs of different consistency models.
Command and Query Responsibility Segregation (CQRS) pattern. Use to update the information in a
materialized view by responding to events that occur when the underlying data values change.
Event Sourcing pattern. Use in conjunction with the CQRS pattern to maintain the information in a
materialized view. When the data values a materialized view is based on are changed, the system can raise
events that describe these changes and save them in an event store.
Index Table pattern. The data in a materialized view is typically organized by a primary key, but queries might
need to retrieve information from this view by examining data in other fields. Use to create secondary indexes
over data sets for data stores that don't support native secondary indexes.
Pipes and Filters pattern
11/2/2020 • 12 minutes to read • Edit Online

Decompose a task that performs complex processing into a series of separate elements that can be reused. This
can improve performance, scalability, and reusability by allowing task elements that perform the processing to be
deployed and scaled independently.

Context and problem


An application is required to perform a variety of tasks of varying complexity on the information that it processes.
A straightforward but inflexible approach to implementing an application is to perform this processing as a
monolithic module. However, this approach is likely to reduce the opportunities for refactoring the code,
optimizing it, or reusing it if parts of the same processing are required elsewhere within the application.
The figure illustrates the issues with processing data using the monolithic approach. An application receives and
processes data from two sources. The data from each source is processed by a separate module that performs a
series of tasks to transform this data, before passing the result to the business logic of the application.

Some of the tasks that the monolithic modules perform are functionally very similar, but the modules have been
designed separately. The code that implements the tasks is closely coupled in a module, and has been developed
with little or no thought given to reuse or scalability.
However, the processing tasks performed by each module, or the deployment requirements for each task, could
change as business requirements are updated. Some tasks might be compute intensive and could benefit from
running on powerful hardware, while others might not require such expensive resources. Also, additional
processing might be required in the future, or the order in which the tasks performed by the processing could
change. A solution is required that addresses these issues, and increases the possibilities for code reuse.

Solution
Break down the processing required for each stream into a set of separate components (or filters), each
performing a single task. By standardizing the format of the data that each component receives and sends, these
filters can be combined together into a pipeline. This helps to avoid duplicating code, and makes it easy to
remove, replace, or integrate additional components if the processing requirements change. The next figure
shows a solution implemented using pipes and filters.

The time it takes to process a single request depends on the speed of the slowest filter in the pipeline. One or
more filters could be a bottleneck, especially if a large number of requests appear in a stream from a particular
data source. A key advantage of the pipeline structure is that it provides opportunities for running parallel
instances of slow filters, enabling the system to spread the load and improve throughput.
The filters that make up a pipeline can run on different machines, enabling them to be scaled independently and
take advantage of the elasticity that many cloud environments provide. A filter that is computationally intensive
can run on high-performance hardware, while other less demanding filters can be hosted on less expensive
commodity hardware. The filters don't even have to be in the same datacenter or geographic location, which
allows each element in a pipeline to run in an environment close to the resources it requires. The next figure
shows an example applied to the pipeline for the data from Source 1.

If the input and output of a filter are structured as a stream, it's possible to perform the processing for each filter
in parallel. The first filter in the pipeline can start its work and output its results, which are passed directly on to
the next filter in the sequence before the first filter has completed its work.
Another benefit is the resiliency that this model can provide. If a filter fails or the machine it's running on is no
longer available, the pipeline can reschedule the work that the filter was performing and direct this work to
another instance of the component. Failure of a single filter doesn't necessarily result in failure of the entire
pipeline.
Using the Pipes and Filters pattern in conjunction with the Compensating Transaction pattern is an alternative
approach to implementing distributed transactions. A distributed transaction can be broken down into separate,
compensable tasks, each of which can be implemented by using a filter that also implements the Compensating
Transaction pattern. The filters in a pipeline can be implemented as separate hosted tasks running close to the
data that they maintain.

Issues and considerations


You should consider the following points when deciding how to implement this pattern:
Complexity . The increased flexibility that this pattern provides can also introduce complexity, especially if
the filters in a pipeline are distributed across different servers.
Reliability . Use an infrastructure that ensures that data flowing between filters in a pipeline won't be lost.
Idempotency . If a filter in a pipeline fails after receiving a message and the work is rescheduled to another
instance of the filter, part of the work might have already been completed. If this work updates some aspect
of the global state (such as information stored in a database), the same update could be repeated. A similar
issue might occur if a filter fails after posting its results to the next filter in the pipeline, but before
indicating that it's completed its work successfully. In these cases, the same work could be repeated by
another instance of the filter, causing the same results to be posted twice. This could result in subsequent
filters in the pipeline processing the same data twice. Therefore filters in a pipeline should be designed to
be idempotent. For more information, see Idempotency Patterns on Jonathan Oliver’s blog.
Repeated messages . If a filter in a pipeline fails after posting a message to the next stage of the pipeline,
another instance of the filter might be run, and it'll post a copy of the same message to the pipeline. This
could cause two instances of the same message to be passed to the next filter. To avoid this, the pipeline
should detect and eliminate duplicate messages.

If you're implementing the pipeline by using message queues (such as Microsoft Azure Service Bus
queues), the message queuing infrastructure might provide automatic duplicate message detection
and removal.

Context and state . In a pipeline, each filter essentially runs in isolation and shouldn't make any
assumptions about how it was invoked. This means that each filter should be provided with sufficient
context to perform its work. This context could include a large amount of state information.

When to use this pattern


Use this pattern when:
The processing required by an application can easily be broken down into a set of independent steps.
The processing steps performed by an application have different scalability requirements.

It's possible to group filters that should scale together in the same process. For more information, see
the Compute Resource Consolidation pattern.

Flexibility is required to allow reordering of the processing steps performed by an application, or the
capability to add and remove steps.
The system can benefit from distributing the processing for steps across different servers.
A reliable solution is required that minimizes the effects of failure in a step while data is being processed.
This pattern might not be useful when:
The processing steps performed by an application aren't independent, or they have to be performed
together as part of the same transaction.
The amount of context or state information required by a step makes this approach inefficient. It might be
possible to persist state information to a database instead, but don't use this strategy if the additional load
on the database causes excessive contention.

Example
You can use a sequence of message queues to provide the infrastructure required to implement a pipeline. An
initial message queue receives unprocessed messages. A component implemented as a filter task listens for a
message on this queue, performs its work, and then posts the transformed message to the next queue in the
sequence. Another filter task can listen for messages on this queue, process them, post the results to another
queue, and so on until the fully transformed data appears in the final message in the queue. The next figure
illustrates implementing a pipeline using message queues.

If you're building a solution on Azure you can use Service Bus queues to provide a reliable and scalable queuing
mechanism. The ServiceBusPipeFilter class shown below in C# demonstrates how you can implement a filter
that receives input messages from a queue, processes these messages, and posts the results to another queue.

The ServiceBusPipeFilter class is defined in the PipesAndFilters.Shared project available from GitHub.

public class ServiceBusPipeFilter


{
...
private readonly string inQueuePath;
private readonly string outQueuePath;
...
private QueueClient inQueue;
private QueueClient outQueue;
...

public ServiceBusPipeFilter(..., string inQueuePath, string outQueuePath = null)


{
...
this.inQueuePath = inQueuePath;
this.outQueuePath = outQueuePath;
}

public void Start()


{
...
// Create the outbound filter queue if it doesn't exist.
...
this.outQueue = QueueClient.CreateFromConnectionString(...);

...
// Create the inbound and outbound queue clients.
this.inQueue = QueueClient.CreateFromConnectionString(...);
}
public void OnPipeFilterMessageAsync(
Func<BrokeredMessage, Task<BrokeredMessage>> asyncFilterTask, ...)
{
...

this.inQueue.OnMessageAsync(
async (msg) =>
{
...
// Process the filter and send the output to the
// next queue in the pipeline.
var outMessage = await asyncFilterTask(msg);

// Send the message from the filter processor


// to the next queue in the pipeline.
if (outQueue != null)
{
await outQueue.SendAsync(outMessage);
}

// Note: There's a chance that the same message could be sent twice
// or that a message gets processed by an upstream or downstream
// filter at the same time.
// This would happen in a situation where processing of a message was
// completed, it was sent to the next pipe/queue, and then failed
// to complete when using the PeekLock method.
// Idempotent message processing and concurrency should be considered
// in a real-world implementation.
},
options);
}

public async Task Close(TimeSpan timespan)


{
// Pause the processing threads.
this.pauseProcessingEvent.Reset();

// There's no clean approach for waiting for the threads to complete


// the processing. This example simply stops any new processing, waits
// for the existing thread to complete, then closes the message pump
// and finally returns.
Thread.Sleep(timespan);

this.inQueue.Close();
...
}

...
}

The method in the ServiceBusPipeFilter class connects to a pair of input and output queues, and the
Start
Close method disconnects from the input queue. The OnPipeFilterMessageAsync method performs the actual
processing of messages, the asyncFilterTask parameter to this method specifies the processing to be performed.
The OnPipeFilterMessageAsync method waits for incoming messages on the input queue, runs the code specified
by the asyncFilterTask parameter over each message as it arrives, and posts the results to the output queue. The
queues themselves are specified by the constructor.
The sample solution implements filters in a set of worker roles. Each worker role can be scaled independently,
depending on the complexity of the business processing that it performs or the resources required for processing.
Additionally, multiple instances of each worker role can be run in parallel to improve throughput.
The following code shows an Azure worker role named PipeFilterARoleEntry , defined in the PipeFilterA project in
the sample solution.
public class PipeFilterARoleEntry : RoleEntryPoint
{
...
private ServiceBusPipeFilter pipeFilterA;

public override bool OnStart()


{
...
this.pipeFilterA = new ServiceBusPipeFilter(
...,
Constants.QueueAPath,
Constants.QueueBPath);

this.pipeFilterA.Start();
...
}

public override void Run()


{
this.pipeFilterA.OnPipeFilterMessageAsync(async (msg) =>
{
// Clone the message and update it.
// Properties set by the broker (Deliver count, enqueue time, ...)
// aren't cloned and must be copied over if required.
var newMsg = msg.Clone();

await Task.Delay(500); // DOING WORK

Trace.TraceInformation("Filter A processed message:{0} at {1}",


msg.MessageId, DateTime.UtcNow);

newMsg.Properties.Add(Constants.FilterAMessageKey, "Complete");

return newMsg;
});

...
}

...
}

This role contains a ServiceBusPipeFilter object. The OnStart method in the role connects to the queues for
receiving input messages and posting output messages (the names of the queues are defined in the Constants
class). The Run method invokes the OnPipeFilterMessagesAsync method to perform some processing on each
message that's received (in this example, the processing is simulated by waiting for a short period of time). When
processing is complete, a new message is constructed containing the results (in this case, the input message has a
custom property added), and this message is posted to the output queue.
The sample code contains another worker role named PipeFilterBRoleEntry in the PipeFilterB project. This role is
similar to PipeFilterARoleEntry except that it performs different processing in the Run method. In the example
solution, these two roles are combined to construct a pipeline, the output queue for the PipeFilterARoleEntry
role is the input queue for the PipeFilterBRoleEntry role.
The sample solution also provides two additional roles named InitialSenderRoleEntry (in the InitialSender
project) and FinalReceiverRoleEntry (in the FinalReceiver project). The InitialSenderRoleEntry role provides the
initial message in the pipeline. The OnStart method connects to a single queue and the Run method posts a
method to this queue. This queue is the input queue used by the PipeFilterARoleEntry role, so posting a message
to it causes the message to be received and processed by the PipeFilterARoleEntry role. The processed message
then passes through the PipeFilterBRoleEntry role.
The input queue for the FinalReceiveRoleEntry role is the output queue for the PipeFilterBRoleEntry role. The
Run method in the FinalReceiveRoleEntry role, shown below, receives the message and performs some final
processing. Then it writes the values of the custom properties added by the filters in the pipeline to the trace
output.

public class FinalReceiverRoleEntry : RoleEntryPoint


{
...
// Final queue/pipe in the pipeline to process data from.
private ServiceBusPipeFilter queueFinal;

public override bool OnStart()


{
...
// Set up the queue.
this.queueFinal = new ServiceBusPipeFilter(...,Constants.QueueFinalPath);
this.queueFinal.Start();
...
}

public override void Run()


{
this.queueFinal.OnPipeFilterMessageAsync(
async (msg) =>
{
await Task.Delay(500); // DOING WORK

// The pipeline message was received.


Trace.TraceInformation(
"Pipeline Message Complete - FilterA:{0} FilterB:{1}",
msg.Properties[Constants.FilterAMessageKey],
msg.Properties[Constants.FilterBMessageKey]);

return null;
});
...
}

...
}

Related patterns and guidance


The following patterns and guidance might also be relevant when implementing this pattern:
A sample that demonstrates this pattern is available on GitHub.
Competing Consumers pattern. A pipeline can contain multiple instances of one or more filters. This approach
is useful for running parallel instances of slow filters, enabling the system to spread the load and improve
throughput. Each instance of a filter will compete for input with the other instances, two instances of a filter
shouldn't be able to process the same data. Provides an explanation of this approach.
Compute Resource Consolidation pattern. It might be possible to group filters that should scale together into
the same process. Provides more information about the benefits and tradeoffs of this strategy.
Compensating Transaction pattern. A filter can be implemented as an operation that can be reversed, or that
has a compensating operation that restores the state to a previous version in the event of a failure. Explains
how this can be implemented to maintain or achieve eventual consistency.
Idempotency Patterns on Jonathan Oliver’s blog.
Priority Queue pattern
11/2/2020 • 9 minutes to read • Edit Online

Prioritize requests sent to services so that requests with a higher priority are received and processed more
quickly than those with a lower priority. This pattern is useful in applications that offer different service level
guarantees to individual clients.

Context and Problem


Applications can delegate specific tasks to other services, for example, to perform background processing or to
integrate with other applications or services. In the cloud, a message queue is typically used to delegate tasks to
background processing. In many cases the order requests are received in by a service isn't important. In some
cases, though, it's necessary to prioritize specific requests. These requests should be processed earlier than lower
priority requests that were sent previously by the application.

Solution
A queue is usually a first-in, first-out (FIFO) structure, and consumers typically receive messages in the same
order that they were posted to the queue. However, some message queues support priority messaging. The
application posting a message can assign a priority and the messages in the queue are automatically reordered
so that those with a higher priority will be received before those with a lower priority. The figure illustrates a
queue with priority messaging.

Most message queue implementations support multiple consumers (following the Competing Consumers
pattern), and the number of consumer processes can be scaled up or down depending on demand.

In systems that don't support priority-based message queues, an alternative solution is to maintain a separate
queue for each priority. The application is responsible for posting messages to the appropriate queue. Each
queue can have a separate pool of consumers. Higher priority queues can have a larger pool of consumers
running on faster hardware than lower priority queues. The next figure illustrates using separate message
queues for each priority.
A variation on this strategy is to have a single pool of consumers that check for messages on high priority
queues first, and only then start to fetch messages from lower priority queues. There are some semantic
differences between a solution that uses a single pool of consumer processes (either with a single queue that
supports messages with different priorities or with multiple queues that each handle messages of a single
priority), and a solution that uses multiple queues with a separate pool for each queue.
In the single pool approach, higher priority messages are always received and processed before lower priority
messages. In theory, messages that have a very low priority could be continually superseded and might never be
processed. In the multiple pool approach, lower priority messages will always be processed, just not as quickly as
those of a higher priority (depending on the relative size of the pools and the resources that they have available).
Using a priority queuing mechanism can provide the following advantages:
It allows applications to meet business requirements that require prioritization of availability or
performance, such as offering different levels of service to specific groups of customers.
It can help to minimize operational costs. In the single queue approach, you can scale back the number of
consumers if necessary. High priority messages will still be processed first (although possibly more
slowly), and lower priority messages might be delayed for longer. If you've implemented the multiple
message queue approach with separate pools of consumers for each queue, you can reduce the pool of
consumers for lower priority queues, or even suspend processing for some very low priority queues by
stopping all the consumers that listen for messages on those queues.
The multiple message queue approach can help maximize application performance and scalability by
partitioning messages based on processing requirements. For example, vital tasks can be prioritized to be
handled by receivers that run immediately while less important background tasks can be handled by
receivers that are scheduled to run at less busy periods.

Issues and Considerations


Consider the following points when deciding how to implement this pattern:
Define the priorities in the context of the solution. For example, high priority could mean that messages should
be processed within ten seconds. Identify the requirements for handling high priority items, and the other
resources that should be allocated to meet these criteria.
Decide if all high priority items must be processed before any lower priority items. If the messages are being
processed by a single pool of consumers, you have to provide a mechanism that can preempt and suspend a task
that's handling a low priority message if a higher priority message becomes available.
In the multiple queue approach, when using a single pool of consumer processes that listen on all queues rather
than a dedicated consumer pool for each queue, the consumer must apply an algorithm that ensures it always
services messages from higher priority queues before those from lower priority queues.
Monitor the processing speed on high and low priority queues to ensure that messages in these queues are
processed at the expected rates.
If you need to guarantee that low priority messages will be processed, it's necessary to implement the multiple
message queue approach with multiple pools of consumers. Alternatively, in a queue that supports message
prioritization, it's possible to dynamically increase the priority of a queued message as it ages. However, this
approach depends on the message queue providing this feature.
Using a separate queue for each message priority works best for systems that have a small number of well-
defined priorities.
Message priorities can be determined logically by the system. For example, rather than having explicit high and
low priority messages, they could be designated as “fee paying customer,” or “non-fee paying customer.”
Depending on your business model, your system can allocate more resources to processing messages from fee
paying customers than non-fee paying ones.
There might be a financial and processing cost associated with checking a queue for a message (some
commercial messaging systems charge a small fee each time a message is posted or retrieved, and each time a
queue is queried for messages). This cost increases when checking multiple queues.
It's possible to dynamically adjust the size of a pool of consumers based on the length of the queue that the pool
is servicing. For more information, see the Autoscaling Guidance.

When to use this pattern


This pattern is useful in scenarios where:
The system must handle multiple tasks that have different priorities.
Different users or tenants should be served with different priority.

Example
Microsoft Azure doesn't provide a queuing mechanism that natively supports automatic prioritization of
messages through sorting. However, it does provide Azure Service Bus topics and subscriptions that support a
queuing mechanism that provides message filtering, together with a wide range of flexible capabilities that make
it ideal for use in most priority queue implementations.
An Azure solution can implement a Service Bus topic an application can post messages to, in the same way as a
queue. Messages can contain metadata in the form of application-defined custom properties. Service Bus
subscriptions can be associated with the topic, and these subscriptions can filter messages based on their
properties. When an application sends a message to a topic, the message is directed to the appropriate
subscription where it can be read by a consumer. Consumer processes can retrieve messages from a subscription
using the same semantics as a message queue (a subscription is a logical queue). The following figure illustrates
implementing a priority queue with Azure Service Bus topics and subscriptions.
In the figure above, the application creates several messages and assigns a custom property called Priority in
each message with a value, either High or Low . The application posts these messages to a topic. The topic has
two associated subscriptions that both filter messages by examining the Priority property. One subscription
accepts messages where the Priority property is set to High , and the other accepts messages where the
Priority property is set to Low . A pool of consumers reads messages from each subscription. The high priority
subscription has a larger pool, and these consumers might be running on more powerful computers with more
resources available than the consumers in the low priority pool.
Note that there's nothing special about the designation of high and low priority messages in this example.
They're simply labels specified as properties in each message, and are used to direct messages to a specific
subscription. If additional priorities are required, it's relatively easy to create further subscriptions and pools of
consumer processes to handle these priorities.
The PriorityQueue solution available on GitHub contains an implementation of this approach. This solution
contains two worker role projects named PriorityQueue.High and PriorityQueue.Low . These worker roles inherit
from the PriorityWorkerRole class that contains the functionality for connecting to a specified subscription in the
OnStart method.

The PriorityQueue.High and PriorityQueue.Low worker roles connect to different subscriptions, defined by their
configuration settings. An administrator can configure different numbers of each role to be run. Typically there'll
be more instances of the PriorityQueue.High worker role than the PriorityQueue.Low worker role.
The Run method in the PriorityWorkerRole class arranges for the virtual ProcessMessage method (also defined
in the PriorityWorkerRole class) to be run for each message received on the queue. The following code shows
the Run and ProcessMessage methods. The QueueManager class, defined in the PriorityQueue.Shared project,
provides helper methods for using Azure Service Bus queues.
public class PriorityWorkerRole : RoleEntryPoint
{
private QueueManager queueManager;
...

public override void Run()


{
// Start listening for messages on the subscription.
var subscriptionName = CloudConfigurationManager.GetSetting("SubscriptionName");
this.queueManager.ReceiveMessages(subscriptionName, this.ProcessMessage);
...;
}
...

protected virtual async Task ProcessMessage(BrokeredMessage message)


{
// Simulating processing.
await Task.Delay(TimeSpan.FromSeconds(2));
}
}

The PriorityQueue.Highand PriorityQueue.Low worker roles both override the default functionality of the
ProcessMessage method. The code below shows the ProcessMessage method for the PriorityQueue.High worker
role.

protected override async Task ProcessMessage(BrokeredMessage message)


{
// Simulate message processing for High priority messages.
await base.ProcessMessage(message);
Trace.TraceInformation("High priority message processed by " +
RoleEnvironment.CurrentRoleInstance.Id + " MessageId: " + message.MessageId);
}

When an application posts messages to the topic associated with the subscriptions used by the
PriorityQueue.High and PriorityQueue.Low worker roles, it specifies the priority by using the Priority custom
property, as shown in the following code example. This code (implemented in the WorkerRole class in the
PriorityQueue.Sender project), uses the SendBatchAsync helper method of the QueueManager class to post
messages to a topic in batches.
// Send a low priority batch.
var lowMessages = new List<BrokeredMessage>();

for (int i = 0; i < 10; i++)


{
var message = new BrokeredMessage() { MessageId = Guid.NewGuid().ToString() };
message.Properties["Priority"] = Priority.Low;
lowMessages.Add(message);
}

this.queueManager.SendBatchAsync(lowMessages).Wait();
...

// Send a high priority batch.


var highMessages = new List<BrokeredMessage>();

for (int i = 0; i < 10; i++)


{
var message = new BrokeredMessage() { MessageId = Guid.NewGuid().ToString() };
message.Properties["Priority"] = Priority.High;
highMessages.Add(message);
}

this.queueManager.SendBatchAsync(highMessages).Wait();

Related patterns and guidance


The following patterns and guidance might also be relevant when implementing this pattern:
A sample that demonstrates this pattern is available on GitHub.
Asynchronous Messaging Primer. A consumer service that processes a request might need to send a reply
to the instance of the application that posted the request. Provides information on the strategies that you
can use to implement request/response messaging.
Competing Consumers pattern. To increase the throughput of the queues, it’s possible to have multiple
consumers that listen on the same queue, and process the tasks in parallel. These consumers will compete
for messages, but only one should be able to process each message. Provides more information on the
benefits and tradeoffs of implementing this approach.
Throttling pattern. You can implement throttling by using queues. Priority messaging can be used to
ensure that requests from critical applications, or applications being run by high-value customers, are
given priority over requests from less important applications.
Autoscaling Guidance. It might be possible to scale the size of the pool of consumer processes handling a
queue depending on the length of the queue. This strategy can help to improve performance, especially
for pools handling high priority messages.
Enterprise Integration Patterns with Service Bus on Abhishek Lal’s blog.
Publisher-Subscriber pattern
11/2/2020 • 6 minutes to read • Edit Online

Enable an application to announce events to multiple interested consumers asynchronously, without coupling the
senders to the receivers.
Also called : Pub/sub messaging

Context and problem


In cloud-based and distributed applications, components of the system often need to provide information to other
components as events happen.
Asynchronous messaging is an effective way to decouple senders from consumers, and avoid blocking the sender
to wait for a response. However, using a dedicated message queue for each consumer does not effectively scale to
many consumers. Also, some of the consumers might be interested in only a subset of the information. How can
the sender announce events to all interested consumers without knowing their identities?

Solution
Introduce an asynchronous messaging subsystem that includes the following:
An input messaging channel used by the sender. The sender packages events into messages, using a known
message format, and sends these messages via the input channel. The sender in this pattern is also called
the publisher.

NOTE
A message is a packet of data. An event is a message that notifies other components about a change or an action
that has taken place.

One output messaging channel per consumer. The consumers are known as subscribers.
A mechanism for copying each message from the input channel to the output channels for all subscribers
interested in that message. This operation is typically handled by an intermediary such as a message
broker or event bus.
The following diagram shows the logical components of this pattern:
Pub/sub messaging has the following benefits:
It decouples subsystems that still need to communicate. Subsystems can be managed independently, and
messages can be properly managed even if one or more receivers are offline.
It increases scalability and improves responsiveness of the sender. The sender can quickly send a single
message to the input channel, then return to its core processing responsibilities. The messaging
infrastructure is responsible for ensuring messages are delivered to interested subscribers.
It improves reliability. Asynchronous messaging helps applications continue to run smoothly under
increased loads and handle intermittent failures more effectively.
It allows for deferred or scheduled processing. Subscribers can wait to pick up messages until off-peak
hours, or messages can be routed or processed according to a specific schedule.
It enables simpler integration between systems using different platforms, programming languages, or
communication protocols, as well as between on-premises systems and applications running in the cloud.
It facilitates asynchronous workflows across an enterprise.
It improves testability. Channels can be monitored and messages can be inspected or logged as part of an
overall integration test strategy.
It provides separation of concerns for your applications. Each application can focus on its core capabilities,
while the messaging infrastructure handles everything required to reliably route messages to multiple
consumers.

Issues and considerations


Consider the following points when deciding how to implement this pattern:
Existing technologies. It is strongly recommended to use available messaging products and services that
support a publish-subscribe model, rather than building your own. In Azure, consider using Service Bus or
Event Grid. Other technologies that can be used for pub/sub messaging include Redis, RabbitMQ, and
Apache Kafka.
Subscription handling. The messaging infrastructure must provide mechanisms that consumers can use
to subscribe to or unsubscribe from available channels.
Security. Connecting to any message channel must be restricted by security policy to prevent
eavesdropping by unauthorized users or applications.
Subsets of messages. Subscribers are usually only interested in subset of the messages distributed by a
publisher. Messaging services often allow subscribers to narrow the set of messages received by:
Topics. Each topic has a dedicated output channel, and each consumer can subscribe to all relevant
topics.
Content filtering. Messages are inspected and distributed based on the content of each message. Each
subscriber can specify the content it is interested in.
Wildcard subscribers. Consider allowing subscribers to subscribe to multiple topics via wildcards.
Bi-directional communication. The channels in a publish-subscribe system are treated as unidirectional.
If a specific subscriber needs to send acknowledgment or communicate status back to the publisher,
consider using the Request/Reply Pattern. This pattern uses one channel to send a message to the
subscriber, and a separate reply channel for communicating back to the publisher.
Message ordering. The order in which consumer instances receive messages isn't guaranteed, and
doesn't necessarily reflect the order in which the messages were created. Design the system to ensure that
message processing is idempotent to help eliminate any dependency on the order of message handling.
Message priority. Some solutions may require that messages are processed in a specific order. The
Priority Queue pattern provides a mechanism for ensuring specific messages are delivered before others.
Poison messages. A malformed message, or a task that requires access to resources that aren't available,
can cause a service instance to fail. The system should prevent such messages being returned to the queue.
Instead, capture and store the details of these messages elsewhere so that they can be analyzed if
necessary.
Repeated messages. The same message might be sent more than once. For example, the sender might
fail after posting a message. Then a new instance of the sender might start up and repeat the message. The
messaging infrastructure should implement duplicate message detection and removal (also known as de-
duping) based on message IDs in order to provide at-most-once delivery of messages.
Message expiration. A message might have a limited lifetime. If it isn't processed within this period, it
might no longer be relevant and should be discarded. A sender can specify an expiration time as part of the
data in the message. A receiver can examine this information before deciding whether to perform the
business logic associated with the message.
Message scheduling. A message might be temporarily embargoed and should not be processed until a
specific date and time. The message should not be available to a receiver until this time.

When to use this pattern


Use this pattern when:
An application needs to broadcast information to a significant number of consumers.
An application needs to communicate with one or more independently-developed applications or services,
which may use different platforms, programming languages, and communication protocols.
An application can send information to consumers without requiring real-time responses from the
consumers.
The systems being integrated are designed to support an eventual consistency model for their data.
An application needs to communicate information to multiple consumers, which may have different
availability requirements or uptime schedules than the sender.
This pattern might not be useful when:
An application has only a few consumers who need significantly different information from the producing
application.
An application requires near real-time interaction with consumers.

Example
The following diagram shows an enterprise integration architecture that uses Service Bus to coordinate
workflows, and Event Grid to notify subsystems of events that occur. For more information, see Enterprise
integration on Azure using message queues and events.
Related patterns and guidance
The following patterns and guidance might be relevant when implementing this pattern:
Choose between Azure services that deliver messages.
The Event-driven architecture style is an architecture style that uses pub/sub messaging.
Asynchronous Messaging Primer. Message queues are an asynchronous communications mechanism. If a
consumer service needs to send a reply to an application, it might be necessary to implement some form
of response messaging. The Asynchronous Messaging Primer provides information on how to implement
request/reply messaging using message queues.
Observer Pattern. The Publish-Subscribe pattern builds on the Observer pattern by decoupling subjects
from observers via asynchronous messaging.
Message Broker Pattern. Many messaging subsystems that support a publish-subscribe model are
implemented via a message broker.
Queue-Based Load Leveling pattern
11/2/2020 • 5 minutes to read • Edit Online

Use a queue that acts as a buffer between a task and a service it invokes in order to smooth intermittent heavy
loads that can cause the service to fail or the task to time out. This can help to minimize the impact of peaks in
demand on availability and responsiveness for both the task and the service.

Context and problem


Many solutions in the cloud involve running tasks that invoke services. In this environment, if a service is
subjected to intermittent heavy loads, it can cause performance or reliability issues.
A service could be part of the same solution as the tasks that use it, or it could be a third-party service
providing access to frequently used resources such as a cache or a storage service. If the same service is used
by a number of tasks running concurrently, it can be difficult to predict the volume of requests to the service at
any time.
A service might experience peaks in demand that cause it to overload and be unable to respond to requests in
a timely manner. Flooding a service with a large number of concurrent requests can also result in the service
failing if it's unable to handle the contention these requests cause.

Solution
Refactor the solution and introduce a queue between the task and the service. The task and the service run
asynchronously. The task posts a message containing the data required by the service to a queue. The queue
acts as a buffer, storing the message until it's retrieved by the service. The service retrieves the messages from
the queue and processes them. Requests from a number of tasks, which can be generated at a highly variable
rate, can be passed to the service through the same message queue. This figure shows using a queue to level
the load on a service.

The queue decouples the tasks from the service, and the service can handle the messages at its own pace
regardless of the volume of requests from concurrent tasks. Additionally, there's no delay to a task if the
service isn't available at the time it posts a message to the queue.
This pattern provides the following benefits:
It can help to maximize availability because delays arising in services won't have an immediate and
direct impact on the application, which can continue to post messages to the queue even when the
service isn't available or isn't currently processing messages.
It can help to maximize scalability because both the number of queues and the number of services can
be varied to meet demand.
It can help to control costs because the number of service instances deployed only have to be adequate
to meet average load rather than the peak load.

Some services implement throttling when demand reaches a threshold beyond which the system
could fail. Throttling can reduce the functionality available. You can implement load leveling with
these services to ensure that this threshold isn't reached.

Issues and considerations


Consider the following points when deciding how to implement this pattern:
It's necessary to implement application logic that controls the rate at which services handle messages to
avoid overwhelming the target resource. Avoid passing spikes in demand to the next stage of the system.
Test the system under load to ensure that it provides the required leveling, and adjust the number of
queues and the number of service instances that handle messages to achieve this.
Message queues are a one-way communication mechanism. If a task expects a reply from a service, it
might be necessary to implement a mechanism that the service can use to send a response. For more
information, see the Asynchronous Messaging Primer.
Be careful if you apply autoscaling to services that are listening for requests on the queue. This can result in
increased contention for any resources that these services share and diminish the effectiveness of using
the queue to level the load.

When to use this pattern


This pattern is useful to any application that uses services that are subject to overloading.
This pattern isn't useful if the application expects a response from the service with minimal latency.

Example
A web app writes data to an external data store. If a large number of instances of the web app run
concurrently, the data store might be unable to respond to requests quickly enough, causing requests to time
out, be throttled, or otherwise fail. The following diagram shows a data store being overwhelmed by a large
number of concurrent requests from instances of an application.
To resolve this, you can use a queue to level the load between the application instances and the data store. An
Azure Functions app reads the messages from the queue and performs the read/write requests to the data
store. The application logic in the function app can control the rate at which it passes requests to the data
store, to prevent the store from being overwhelmed. (Otherwise the function app will just re-introduce the
same problem at the back end.)

Related patterns and guidance


The following patterns and guidance might also be relevant when implementing this pattern:
Asynchronous Messaging Primer. Message queues are inherently asynchronous. It might be necessary
to redesign the application logic in a task if it's adapted from communicating directly with a service to
using a message queue. Similarly, it might be necessary to refactor a service to accept requests from a
message queue. Alternatively, it might be possible to implement a proxy service, as described in the
example.
Competing Consumers pattern. It might be possible to run multiple instances of a service, each acting
as a message consumer from the load-leveling queue. You can use this approach to adjust the rate at
which messages are received and passed to a service.
Throttling pattern. A simple way to implement throttling with a service is to use queue-based load
leveling and route all requests to a service through a message queue. The service can process requests
at a rate that ensures that resources required by the service aren't exhausted, and to reduce the amount
of contention that could occur.
Choose between Azure messaging services. Information about choosing a messaging and queuing
mechanism in Azure applications.
Improve scalability in an Azure web application. This reference architecture includes queue-based load
leveling as part of the architecture.
Retry pattern
11/2/2020 • 10 minutes to read • Edit Online

Enable an application to handle transient failures when it tries to connect to a service or network resource, by
transparently retrying a failed operation. This can improve the stability of the application.

Context and problem


An application that communicates with elements running in the cloud has to be sensitive to the transient faults
that can occur in this environment. Faults include the momentary loss of network connectivity to components
and services, the temporary unavailability of a service, or timeouts that occur when a service is busy.
These faults are typically self-correcting, and if the action that triggered a fault is repeated after a suitable delay
it's likely to be successful. For example, a database service that's processing a large number of concurrent
requests can implement a throttling strategy that temporarily rejects any further requests until its workload has
eased. An application trying to access the database might fail to connect, but if it tries again after a delay it
might succeed.

Solution
In the cloud, transient faults aren't uncommon and an application should be designed to handle them elegantly
and transparently. This minimizes the effects faults can have on the business tasks the application is
performing.
If an application detects a failure when it tries to send a request to a remote service, it can handle the failure
using the following strategies:
Cancel . If the fault indicates that the failure isn't transient or is unlikely to be successful if repeated, the
application should cancel the operation and report an exception. For example, an authentication failure
caused by providing invalid credentials is not likely to succeed no matter how many times it's attempted.
Retr y . If the specific fault reported is unusual or rare, it might have been caused by unusual
circumstances such as a network packet becoming corrupted while it was being transmitted. In this case,
the application could retry the failing request again immediately because the same failure is unlikely to
be repeated and the request will probably be successful.
Retr y after delay . If the fault is caused by one of the more commonplace connectivity or busy failures,
the network or service might need a short period while the connectivity issues are corrected or the
backlog of work is cleared. The application should wait for a suitable time before retrying the request.
For the more common transient failures, the period between retries should be chosen to spread requests from
multiple instances of the application as evenly as possible. This reduces the chance of a busy service continuing
to be overloaded. If many instances of an application are continually overwhelming a service with retry
requests, it'll take the service longer to recover.
If the request still fails, the application can wait and make another attempt. If necessary, this process can be
repeated with increasing delays between retry attempts, until some maximum number of requests have been
attempted. The delay can be increased incrementally or exponentially, depending on the type of failure and the
probability that it'll be corrected during this time.
The following diagram illustrates invoking an operation in a hosted service using this pattern. If the request is
unsuccessful after a predefined number of attempts, the application should treat the fault as an exception and
handle it accordingly.

The application should wrap all attempts to access a remote service in code that implements a retry policy
matching one of the strategies listed above. Requests sent to different services can be subject to different
policies. Some vendors provide libraries that implement retry policies, where the application can specify the
maximum number of retries, the time between retry attempts, and other parameters.
An application should log the details of faults and failing operations. This information is useful to operators. If a
service is frequently unavailable or busy, it's often because the service has exhausted its resources. You can
reduce the frequency of these faults by scaling out the service. For example, if a database service is continually
overloaded, it might be beneficial to partition the database and spread the load across multiple servers.

Microsoft Entity Framework provides facilities for retrying database operations. Also, most Azure services
and client SDKs include a retry mechanism. For more information, see Retry guidance for specific services.

Issues and considerations


You should consider the following points when deciding how to implement this pattern.
The retry policy should be tuned to match the business requirements of the application and the nature of the
failure. For some noncritical operations, it's better to fail fast rather than retry several times and impact the
throughput of the application. For example, in an interactive web application accessing a remote service, it's
better to fail after a smaller number of retries with only a short delay between retry attempts, and display a
suitable message to the user (for example, “please try again later”). For a batch application, it might be more
appropriate to increase the number of retry attempts with an exponentially increasing delay between attempts.
An aggressive retry policy with minimal delay between attempts, and a large number of retries, could further
degrade a busy service that's running close to or at capacity. This retry policy could also affect the
responsiveness of the application if it's continually trying to perform a failing operation.
If a request still fails after a significant number of retries, it's better for the application to prevent further
requests going to the same resource and simply report a failure immediately. When the period expires, the
application can tentatively allow one or more requests through to see whether they're successful. For more
details of this strategy, see the Circuit Breaker pattern.
Consider whether the operation is idempotent. If so, it's inherently safe to retry. Otherwise, retries could cause
the operation to be executed more than once, with unintended side effects. For example, a service might receive
the request, process the request successfully, but fail to send a response. At that point, the retry logic might re-
send the request, assuming that the first request wasn't received.
A request to a service can fail for a variety of reasons raising different exceptions depending on the nature of
the failure. Some exceptions indicate a failure that can be resolved quickly, while others indicate that the failure
is longer lasting. It's useful for the retry policy to adjust the time between retry attempts based on the type of
the exception.
Consider how retrying an operation that's part of a transaction will affect the overall transaction consistency.
Fine tune the retry policy for transactional operations to maximize the chance of success and reduce the need
to undo all the transaction steps.
Ensure that all retry code is fully tested against a variety of failure conditions. Check that it doesn't severely
impact the performance or reliability of the application, cause excessive load on services and resources, or
generate race conditions or bottlenecks.
Implement retry logic only where the full context of a failing operation is understood. For example, if a task that
contains a retry policy invokes another task that also contains a retry policy, this extra layer of retries can add
long delays to the processing. It might be better to configure the lower-level task to fail fast and report the
reason for the failure back to the task that invoked it. This higher-level task can then handle the failure based on
its own policy.
It's important to log all connectivity failures that cause a retry so that underlying problems with the application,
services, or resources can be identified.
Investigate the faults that are most likely to occur for a service or a resource to discover if they're likely to be
long lasting or terminal. If they are, it's better to handle the fault as an exception. The application can report or
log the exception, and then try to continue either by invoking an alternative service (if one is available), or by
offering degraded functionality. For more information on how to detect and handle long-lasting faults, see the
Circuit Breaker pattern.

When to use this pattern


Use this pattern when an application could experience transient faults as it interacts with a remote service or
accesses a remote resource. These faults are expected to be short lived, and repeating a request that has
previously failed could succeed on a subsequent attempt.
This pattern might not be useful:
When a fault is likely to be long lasting, because this can affect the responsiveness of an application. The
application might be wasting time and resources trying to repeat a request that's likely to fail.
For handling failures that aren't due to transient faults, such as internal exceptions caused by errors in the
business logic of an application.
As an alternative to addressing scalability issues in a system. If an application experiences frequent busy
faults, it's often a sign that the service or resource being accessed should be scaled up.

Example
This example in C# illustrates an implementation of the Retry pattern. The OperationWithBasicRetryAsync
method, shown below, invokes an external service asynchronously through the TransientOperationAsync
method. The details of the TransientOperationAsync method will be specific to the service and are omitted from
the sample code.
private int retryCount = 3;
private readonly TimeSpan delay = TimeSpan.FromSeconds(5);

public async Task OperationWithBasicRetryAsync()


{
int currentRetry = 0;

for (;;)
{
try
{
// Call external service.
await TransientOperationAsync();

// Return or break.
break;
}
catch (Exception ex)
{
Trace.TraceError("Operation Exception");

currentRetry++;

// Check if the exception thrown was a transient exception


// based on the logic in the error detection strategy.
// Determine whether to retry the operation, as well as how
// long to wait, based on the retry strategy.
if (currentRetry > this.retryCount || !IsTransient(ex))
{
// If this isn't a transient error or we shouldn't retry,
// rethrow the exception.
throw;
}
}

// Wait to retry the operation.


// Consider calculating an exponential delay here and
// using a strategy best suited for the operation and fault.
await Task.Delay(delay);
}
}

// Async method that wraps a call to a remote service (details not shown).
private async Task TransientOperationAsync()
{
...
}

The statement that invokes this method is contained in a try/catch block wrapped in a for loop. The for loop
exits if the call to the TransientOperationAsync method succeeds without throwing an exception. If the
TransientOperationAsync method fails, the catch block examines the reason for the failure. If it's believed to be a
transient error the code waits for a short delay before retrying the operation.
The for loop also tracks the number of times that the operation has been attempted, and if the code fails three
times the exception is assumed to be more long lasting. If the exception isn't transient or it's long lasting, the
catch handler throws an exception. This exception exits the for loop and should be caught by the code that
invokes the OperationWithBasicRetryAsync method.
The IsTransient method, shown below, checks for a specific set of exceptions that are relevant to the
environment the code is run in. The definition of a transient exception will vary according to the resources
being accessed and the environment the operation is being performed in.
private bool IsTransient(Exception ex)
{
// Determine if the exception is transient.
// In some cases this is as simple as checking the exception type, in other
// cases it might be necessary to inspect other properties of the exception.
if (ex is OperationTransientException)
return true;

var webException = ex as WebException;


if (webException != null)
{
// If the web exception contains one of the following status values
// it might be transient.
return new[] {WebExceptionStatus.ConnectionClosed,
WebExceptionStatus.Timeout,
WebExceptionStatus.RequestCanceled }.
Contains(webException.Status);
}

// Additional exception checking logic goes here.


return false;
}

Related patterns and guidance


Circuit Breaker pattern. If a failure is expected to be more long lasting, it might be more appropriate to
implement the Circuit Breaker pattern. Combining the Retry and Circuit Breaker patterns provides a
comprehensive approach to handling faults.
For most Azure services, the client SDKs include built-in retry logic. For more information, see Retry
guidance for Azure services.
Before writing custom retry logic, consider using a general framework such as Polly for .NET or
Resilience4j for Java.
When processing commands that change business data, be aware that retries can result in the action
being performed twice, which could be problematic if that action is something like charging a customer's
credit card. Using the Idempotence pattern described in this blog post can help deal with these
situations.
Saga distributed transactions pattern
11/2/2020 • 7 minutes to read • Edit Online

The saga design pattern is a way to manage data consistency across microservices in distributed transaction
scenarios. A saga is a sequence of transactions that updates each service and publishes a message or event to
trigger the next transaction step. If a step fails, the saga executes compensating transactions that counteract the
preceding transactions.

Context and problem


A transaction is a single unit of logic or work, sometimes made up of multiple operations. Within a transaction, an
event is a state change that occurs to an entity, and a command encapsulates all information needed to perform an
action or trigger a later event.
Transactions must be atomic, consistent, isolated, and durable (ACID). Transactions within a single service are ACID,
but cross-service data consistency requires a cross-service transaction management strategy.
In multiservices architectures:
Atomicity is an indivisible and irreducible set of operations that must all occur or none occur.
Consistency means the transaction brings the data only from one valid state to another valid state.
Isolation guarantees that concurrent transactions produce the same data state that sequentially executed
transactions would have produced.
Durability ensures that committed transactions remain committed even in case of system failure or power
outage.
A database-per-microservice model provides many benefits for microservices architectures. Encapsulating domain
data lets each service use its best data store type and schema, scale its own data store as necessary, and be
insulated from other services' failures. However, ensuring data consistency across service-specific databases poses
challenges.
Distributed transactions like the two-phase commit (2PC) protocol require all participants in a transaction to
commit or roll back before the transaction can proceed. However some participant implementations, such as
NoSQL databases and message brokering, don't support this model.
Another distributed transaction limitation is interprocess communication (IPC) synchronicity and availability.
Operating system-provided IPC allows separate processes to share data. For distributed transactions to commit, all
participating services must be available, potentially reducing overall system availability. Architectural
implementations with IPC or transaction limitations are candidates for the saga pattern.

Solution
The saga pattern provides transaction management using a sequence of local transactions. A local transaction is
the atomic work effort performed by a saga participant. Each local transaction updates the database and publishes
a message or event to trigger the next local transaction in the saga. If a local transaction fails, the saga executes a
series of compensating transactions that undo the changes that were made by the preceding local transactions.
In saga patterns:
Compensable transactions are transactions that can potentially be reversed by processing another transaction
with the opposite effect.
A pivot transaction is the go/no-go point in a saga. If the pivot transaction commits, the saga runs until
completion. A pivot transaction can be a transaction that is neither compensable nor retryable, or it can be the
last compensable transaction or the first retryable transaction in the saga.
Retryable transactions are transactions that follow the pivot transaction and are guaranteed to succeed.
There are two common saga implementation approaches, choreography and orchestration. Each approach has its
own set of challenges and technologies to coordinate the workflow.
Choreography
Choreography is a way to coordinate sagas where participants exchange events without a centralized point of
control. With choreography, each local transaction publishes domain events that trigger local transactions in other
services.

Benefits
Good for simple workflows that require few participants and don't need a coordination logic.
Doesn't require additional service implementation and maintenance.
Doesn't introduce a single point of failure, since the responsibilities are distributed across the saga participants.
Drawbacks
Workflow can become confusing when adding new steps, as it's difficult to track which saga participants listen
to which commands.
There's a risk of cyclic dependency between saga participants because they have to consume each other's
commands.
Integration testing is difficult because all services must be running to simulate a transaction.
Orchestration
Orchestration is a way to coordinate sagas where a centralized controller tells the saga participants what local
transactions to execute. The saga orchestrator handles all the transactions and tells the participants which
operation to perform based on events. The orchestrator executes saga requests, stores and interprets the states of
each task, and handles failure recovery with compensating transactions.

Benefits
Good for complex workflows involving many participants or new participants added over time.
Suitable when there is control over every participant in the process, and control over the flow of activities.
Doesn't introduce cyclical dependencies, because the orchestrator unilaterally depends on the saga participants.
Saga participants don't need to know about commands for other participants. Clear separation of concerns
simplifies business logic.
Drawbacks
Additional design complexity requires an implementation of a coordination logic.
There's an additional point of failure, because the orchestrator manages the complete workflow.

Issues and considerations


Consider the following points when implementing the saga pattern:
The saga pattern may initially be challenging, as it requires a new way of thinking on how to coordinate a
transaction and maintain data consistency for a business process spanning multiple microservices.
The saga pattern is particularly hard to debug, and the complexity grows as participants increase.
Data can't be rolled back, because saga participants commit changes to their local databases.
The implementation must be capable of handling a set of potential transient failures, and provide idempotence
for reducing side-effects and ensuring data consistency. Idempotence means that the same operation can be
repeated multiple times without changing the initial result.
It's best to implement observability to monitor and track the saga workflow.
The lack of participant data isolation imposes durability challenges. The saga implementation must include
countermeasures to reduce anomalies.
The following anomalies can happen without proper measures:
Lost updates, when one saga writes without reading changes made by another saga.
Dirty reads, when a transaction or a saga reads updates made by a saga that has not yet completed those
updates.
Fuzzy/nonrepeatable reads, when different saga steps read different data because a data update occurs
between the reads.
Suggested countermeasures to reduce or prevent anomalies include:
Semantic lock, an application-level lock where a saga's compensable transaction uses a semaphore to indicate
an update is in progress.
Commutative updates that can be executed in any order and produce the same result.
Pessimistic view: It's possible for one saga to read dirty data, while another saga is running a compensable
transaction to roll back the operation. Pessimistic view reorders the saga so the underlying data updates in a
retryable transaction, which eliminates the possibility of a dirty read.
Reread value verifies that data is unchanged, and then updates the record. If the record has changed, the steps
abort and the saga may restart.
A version file records the operations on a record as they arrive, and then executes them in the correct order.
By value uses each request's business risk to dynamically select the concurrency mechanism. Low-risk requests
favor sagas, while high-risk requests favor distributed transactions.

When to use this pattern


Use the saga pattern when you need to:
Ensure data consistency in a distributed system without tight coupling.
Roll back or compensate if one of the operations in the sequence fails.
The saga pattern is less suitable for:
Tightly coupled transactions.
Compensating transactions that occur in earlier participants.
Cyclic dependencies.

Example
Orchestration-based Saga on Serverless is a saga implementation reference using the orchestration approach that
simulates a money transfer scenario with successful and failed workflows.

Related patterns
The following patterns might also be useful when implementing this pattern:
Choreography has each component of the system participate in the decision-making process about the
workflow of a business transaction, instead of relying on a central point of control.
Compensating transactions undo work performed by a series of steps, and eventually define a consistent
operation if one or more steps fail. Cloud-hosted applications that implement complex business processes and
workflows often follow this eventual consistency model.
Retry lets an application handle transient failures when it tries to connect to a service or network resource, by
transparently retrying the failed operation. Retry can improve the stability of the application.
Circuit breaker handles faults that take a variable amount of time to recover from, when connecting to a remote
service or resource. Circuit breaker can improve the stability and resiliency of an application.
Health endpoint monitoring implements functional checks in an application that external tools can access
through exposed endpoints at regular intervals. Health endpoint monitoring can help verify that applications
and services are performing correctly.

Related resources
Distributed data
Richardson, Chris. 2018: Microservices Patterns. Manning Publications.
Scheduler Agent Supervisor pattern
11/2/2020 • 16 minutes to read • Edit Online

Coordinate a set of distributed actions as a single operation. If any of the actions fail, try to handle the failures
transparently, or else undo the work that was performed, so the entire operation succeeds or fails as a whole.
This can add resiliency to a distributed system, by enabling it to recover and retry actions that fail due to
transient exceptions, long-lasting faults, and process failures.

Context and problem


An application performs tasks that include a number of steps, some of which might invoke remote services or
access remote resources. The individual steps might be independent of each other, but they are orchestrated by
the application logic that implements the task.
Whenever possible, the application should ensure that the task runs to completion and resolve any failures that
might occur when accessing remote services or resources. Failures can occur for many reasons. For example, the
network might be down, communications could be interrupted, a remote service might be unresponsive or in an
unstable state, or a remote resource might be temporarily inaccessible, perhaps due to resource constraints. In
many cases the failures will be transient and can be handled by using the Retry pattern.
If the application detects a more permanent fault it can't easily recover from, it must be able to restore the
system to a consistent state and ensure integrity of the entire operation.

Solution
The Scheduler Agent Supervisor pattern defines the following actors. These actors orchestrate the steps to be
performed as part of the overall task.
The Scheduler arranges for the steps that make up the task to be executed and orchestrates their
operation. These steps can be combined into a pipeline or workflow. The Scheduler is responsible for
ensuring that the steps in this workflow are performed in the right order. As each step is performed, the
Scheduler records the state of the workflow, such as "step not yet started," "step running," or "step
completed." The state information should also include an upper limit of the time allowed for the step to
finish, called the complete-by time. If a step requires access to a remote service or resource, the Scheduler
invokes the appropriate Agent, passing it the details of the work to be performed. The Scheduler typically
communicates with an Agent using asynchronous request/response messaging. This can be implemented
using queues, although other distributed messaging technologies could be used instead.

The Scheduler performs a similar function to the Process Manager in the Process Manager pattern.
The actual workflow is typically defined and implemented by a workflow engine that's controlled by
the Scheduler. This approach decouples the business logic in the workflow from the Scheduler.

The Agent contains logic that encapsulates a call to a remote service, or access to a remote resource
referenced by a step in a task. Each Agent typically wraps calls to a single service or resource,
implementing the appropriate error handling and retry logic (subject to a timeout constraint, described
later). If the steps in the workflow being run by the Scheduler use several services and resources across
different steps, each step might reference a different Agent (this is an implementation detail of the
pattern).
The Super visor monitors the status of the steps in the task being performed by the Scheduler. It runs
periodically (the frequency will be system-specific), and examines the status of steps maintained by the
Scheduler. If it detects any that have timed out or failed, it arranges for the appropriate Agent to recover
the step or execute the appropriate remedial action (this might involve modifying the status of a step).
Note that the recovery or remedial actions are implemented by the Scheduler and Agents. The Supervisor
should simply request that these actions be performed.
The Scheduler, Agent, and Supervisor are logical components and their physical implementation depends on the
technology being used. For example, several logical agents might be implemented as part of a single web
service.
The Scheduler maintains information about the progress of the task and the state of each step in a durable data
store, called the state store. The Supervisor can use this information to help determine whether a step has failed.
The figure illustrates the relationship between the Scheduler, the Agents, the Supervisor, and the state store.

NOTE
This diagram shows a simplified version of the pattern. In a real implementation, there might be many instances of the
Scheduler running concurrently, each a subset of tasks. Similarly, the system could run multiple instances of each Agent, or
even multiple Supervisors. In this case, Supervisors must coordinate their work with each other carefully to ensure that
they don’t compete to recover the same failed steps and tasks. The Leader Election pattern provides one possible solution
to this problem.

When the application is ready to run a task, it submits a request to the Scheduler. The Scheduler records initial
state information about the task and its steps (for example, step not yet started) in the state store and then starts
performing the operations defined by the workflow. As the Scheduler starts each step, it updates the information
about the state of that step in the state store (for example, step running).
If a step references a remote service or resource, the Scheduler sends a message to the appropriate Agent. The
message contains the information that the Agent needs to pass to the service or access the resource, in addition
to the complete-by time for the operation. If the Agent completes its operation successfully, it returns a response
to the Scheduler. The Scheduler can then update the state information in the state store (for example, step
completed) and perform the next step. This process continues until the entire task is complete.
An Agent can implement any retry logic that's necessary to perform its work. However, if the Agent doesn't
complete its work before the complete-by period expires, the Scheduler will assume that the operation has
failed. In this case, the Agent should stop its work and not try to return anything to the Scheduler (not even an
error message), or try any form of recovery. The reason for this restriction is that, after a step has timed out or
failed, another instance of the Agent might be scheduled to run the failing step (this process is described later).
If the Agent fails, the Scheduler won't receive a response. The pattern doesn't make a distinction between a step
that has timed out and one that has genuinely failed.
If a step times out or fails, the state store will contain a record that indicates that the step is running, but the
complete-by time will have passed. The Supervisor looks for steps like this and tries to recover them. One
possible strategy is for the Supervisor to update the complete-by value to extend the time available to complete
the step, and then send a message to the Scheduler identifying the step that has timed out. The Scheduler can
then try to repeat this step. However, this design requires the tasks to be idempotent.
The Supervisor might need to prevent the same step from being retried if it continually fails or times out. To do
this, the Supervisor could maintain a retry count for each step, along with the state information, in the state
store. If this count exceeds a predefined threshold the Supervisor can adopt a strategy of waiting for an extended
period before notifying the Scheduler that it should retry the step, in the expectation that the fault will be
resolved during this period. Alternatively, the Supervisor can send a message to the Scheduler to request the
entire task be undone by implementing a Compensating Transaction pattern. This approach will depend on the
Scheduler and Agents providing the information necessary to implement the compensating operations for each
step that completed successfully.

It isn't the purpose of the Supervisor to monitor the Scheduler and Agents, and restart them if they fail. This
aspect of the system should be handled by the infrastructure these components are running in. Similarly, the
Supervisor shouldn't have knowledge of the actual business operations that the tasks being performed by
the Scheduler are running (including how to compensate should these tasks fail). This is the purpose of the
workflow logic implemented by the Scheduler. The sole responsibility of the Supervisor is to determine
whether a step has failed and arrange either for it to be repeated or for the entire task containing the failed
step to be undone.

If the Scheduler is restarted after a failure, or the workflow being performed by the Scheduler terminates
unexpectedly, the Scheduler should be able to determine the status of any inflight task that it was handling when
it failed, and be prepared to resume this task from that point. The implementation details of this process are
likely to be system-specific. If the task can't be recovered, it might be necessary to undo the work already
performed by the task. This might also require implementing a compensating transaction.
The key advantage of this pattern is that the system is resilient in the event of unexpected temporary or
unrecoverable failures. The system can be constructed to be self-healing. For example, if an Agent or the
Scheduler fails, a new one can be started and the Supervisor can arrange for a task to be resumed. If the
Supervisor fails, another instance can be started and can take over from where the failure occurred. If the
Supervisor is scheduled to run periodically, a new instance can be automatically started after a predefined
interval. The state store can be replicated to reach an even greater degree of resiliency.

Issues and considerations


You should consider the following points when deciding how to implement this pattern:
This pattern can be difficult to implement and requires thorough testing of each possible failure mode of
the system.
The recovery/retry logic implemented by the Scheduler is complex and dependent on state information
held in the state store. It might also be necessary to record the information required to implement a
compensating transaction in a durable data store.
How often the Supervisor runs will be important. It should run often enough to prevent any failed steps
from blocking an application for an extended period, but it shouldn't run so often that it becomes an
overhead.
The steps performed by an Agent could be run more than once. The logic that implements these steps
should be idempotent.

When to use this pattern


Use this pattern when a process that runs in a distributed environment, such as the cloud, must be resilient to
communications failure and/or operational failure.
This pattern might not be suitable for tasks that don't invoke remote services or access remote resources.

Example
A web application that implements an ecommerce system has been deployed on Microsoft Azure. Users can run
this application to browse the available products and to place orders. The user interface runs as a web role, and
the order processing elements of the application are implemented as a set of worker roles. Part of the order
processing logic involves accessing a remote service, and this aspect of the system could be prone to transient
or more long-lasting faults. For this reason, the designers used the Scheduler Agent Supervisor pattern to
implement the order processing elements of the system.
When a customer places an order, the application constructs a message that describes the order and posts this
message to a queue. A separate submission process, running in a worker role, retrieves the message, inserts the
order details into the orders database, and creates a record for the order process in the state store. Note that the
inserts into the orders database and the state store are performed as part of the same operation. The
submission process is designed to ensure that both inserts complete together.
The state information that the submission process creates for the order includes:
OrderID . The ID of the order in the orders database.
LockedBy . The instance ID of the worker role handling the order. There might be multiple current
instances of the worker role running the Scheduler, but each order should only be handled by a single
instance.
CompleteBy . The time the order should be processed by.
ProcessState . The current state of the task handling the order. The possible states are:
Pending . The order has been created but processing hasn't yet been started.
Processing . The order is currently being processed.
Processed . The order has been processed successfully.
Error . The order processing has failed.
FailureCount . The number of times that processing has been tried for the order.
In this state information, the OrderID field is copied from the order ID of the new order. The LockedBy and
CompleteBy fields are set to null , the ProcessState field is set to Pending , and the FailureCount field is set to
0.
NOTE
In this example, the order handling logic is relatively simple and only has a single step that invokes a remote service. In a
more complex multistep scenario, the submission process would likely involve several steps, and so several records would
be created in the state store — each one describing the state of an individual step.

The Scheduler also runs as part of a worker role and implements the business logic that handles the order. An
instance of the Scheduler polling for new orders examines the state store for records where the LockedBy field is
null and the ProcessState field is pending. When the Scheduler finds a new order, it immediately populates the
LockedBy field with its own instance ID, sets the CompleteBy field to an appropriate time, and sets the
ProcessState field to processing. The code is designed to be exclusive and atomic to ensure that two concurrent
instances of the Scheduler can't try to handle the same order simultaneously.
The Scheduler then runs the business workflow to process the order asynchronously, passing it the value in the
OrderID field from the state store. The workflow handling the order retrieves the details of the order from the
orders database and performs its work. When a step in the order processing workflow needs to invoke the
remote service, it uses an Agent. The workflow step communicates with the Agent using a pair of Azure Service
Bus message queues acting as a request/response channel. The figure shows a high-level view of the solution.
The message sent to the Agent from a workflow step describes the order and includes the complete-by time. If
the Agent receives a response from the remote service before the complete-by time expires, it posts a reply
message on the Service Bus queue on which the workflow is listening. When the workflow step receives the
valid reply message, it completes its processing and the Scheduler sets the ProcessState field of the order state
to processed. At this point, the order processing has completed successfully.
If the complete-by time expires before the Agent receives a response from the remote service, the Agent simply
halts its processing and terminates handling the order. Similarly, if the workflow handling the order exceeds the
complete-by time, it also terminates. In both cases, the state of the order in the state store remains set to
processing, but the complete-by time indicates that the time for processing the order has passed and the
process is deemed to have failed. Note that if the Agent that's accessing the remote service, or the workflow
that's handling the order (or both) terminate unexpectedly, the information in the state store will again remain
set to processing and eventually will have an expired complete-by value.
If the Agent detects an unrecoverable, nontransient fault while it's trying to contact the remote service, it can
send an error response back to the workflow. The Scheduler can set the status of the order to error and raise an
event that alerts an operator. The operator can then try to resolve the reason for the failure manually and
resubmit the failed processing step.
The Supervisor periodically examines the state store looking for orders with an expired complete-by value. If the
Supervisor finds a record, it increments the FailureCount field. If the failure count value is below a specified
threshold value, the Supervisor resets the LockedBy field to null, updates the CompleteBy field with a new
expiration time, and sets the ProcessState field to pending. An instance of the Scheduler can pick up this order
and perform its processing as before. If the failure count value exceeds a specified threshold, the reason for the
failure is assumed to be nontransient. The Supervisor sets the status of the order to error and raises an event
that alerts an operator.

In this example, the Supervisor is implemented in a separate worker role. You can use a variety of strategies
to arrange for the Supervisor task to be run, including using the Azure Scheduler service (not to be confused
with the Scheduler component in this pattern). For more information about the Azure Scheduler service, visit
the Scheduler page.

Although it isn't shown in this example, the Scheduler might need to keep the application that submitted the
order informed about the progress and status of the order. The application and the Scheduler are isolated from
each other to eliminate any dependencies between them. The application has no knowledge of which instance of
the Scheduler is handling the order, and the Scheduler is unaware of which specific application instance posted
the order.
To allow the order status to be reported, the application could use its own private response queue. The details of
this response queue would be included as part of the request sent to the submission process, which would
include this information in the state store. The Scheduler would then post messages to this queue indicating the
status of the order (request received, order completed, order failed, and so on). It should include the order ID in
these messages so they can be correlated with the original request by the application.

Related patterns and guidance


The following patterns and guidance might also be relevant when implementing this pattern:
Retry pattern. An Agent can use this pattern to transparently retry an operation that accesses a remote
service or resource that has previously failed. Use when the expectation is that the cause of the failure is
transient and can be corrected.
Circuit Breaker pattern. An Agent can use this pattern to handle faults that take a variable amount of time to
correct when connecting to a remote service or resource.
Compensating Transaction pattern. If the workflow being performed by a Scheduler can't be completed
successfully, it might be necessary to undo any work it's previously performed. The Compensating
Transaction pattern describes how this can be achieved for operations that follow the eventual consistency
model. These types of operations are commonly implemented by a Scheduler that performs complex
business processes and workflows.
Asynchronous Messaging Primer. The components in the Scheduler Agent Supervisor pattern typically run
decoupled from each other and communicate asynchronously. Describes some of the approaches that can be
used to implement asynchronous communication based on message queues.
Leader Election pattern. It might be necessary to coordinate the actions of multiple instances of a Supervisor
to prevent them from attempting to recover the same failed process. The Leader Election pattern describes
how to do this.
Cloud Architecture: The Scheduler-Agent-Supervisor Pattern on Clemens Vasters' blog
Process Manager pattern
Reference 6: A Saga on Sagas. An example showing how the CQRS pattern uses a process manager (part of
the CQRS Journey guidance).
Microsoft Azure Scheduler
Sequential Convoy pattern
11/2/2020 • 3 minutes to read • Edit Online

Process a set of related messages in a defined order, without blocking processing of other groups of messages.

Context and problem


Applications often need to process a sequence of messages in the order they arrive, while still being able to scale
out to handle increased load. In a distributed architecture, processing these messages in order is not
straightforward, because the workers can scale independently and often pull messages independently, using a
Competing Consumers pattern.
For example, an order tracking system receives a ledger containing orders and the relevant operations on those
orders. These operations could be to create an order, add a transaction to the order, modify a past transaction, or
delete an order. In this system, operations must be performed in a first-in-first-out (FIFO) manner, but only at the
order level. However, the initial queue receives a ledger containing transactions for many orders, which may be
interleaved.

Solution
Push related messages into categories within the queuing system, and have the queue listeners lock and pull only
from one category, one message at a time.
Here's what the general Sequential Convoy pattern looks like:

In the queue, messages for different categories may be interleaved, as shown in the following diagram:

Issues and considerations


Consider the following points when deciding how to implement this pattern:
Category/scale unit. What property of your incoming messages can you scale out on? In the order tracking
scenario, this property is the order ID.
Throughput. What is your target message throughput? If it is very high, you may need to reconsider your FIFO
requirements. For example, can you enforce a start/end message, sort by time, then send a batch for
processing?
Service capabilities. Does your choice of message bus allow for one-at-a-time processing of messages within a
queue or category of a queue?
Evolvability. How will you add a new category of message to the system? For example, suppose the ledger
system described above is specific one customer. If you needed to onboard a new customer, could you have a
set of ledger processors that distribute work per customer ID?
It's possible that consumers might receive a message out of order, due to variable network latency when
sending messages. Consider using sequence numbers to verify ordering. You might also include a special "end
of sequence" flag in the last message of a transaction. Stream processing technologies such as Spark or Azure
Stream Analytics can process messages in order within a time window.

When to use this pattern


Use this pattern when:
You have messages that arrive in order and must be processed in the same order.
Arriving messages are or can be "categorized" in such a way that the category becomes a unit of scale for the
system.
This pattern might not be suitable for:
Extremely high throughput scenarios (millions of messages/minute or second), as the FIFO requirement limits
the scaling that can be done by the system.

Example
On Azure, this pattern can be implemented using Azure Service Bus message sessions. For the consumers, you can
use either Logic Apps with the Service Bus peek-lock connector or Azure Functions with the Service Bus trigger.
For the previous order-tracking example, process each ledger message in the order it's received, and send each
transaction to another queue where the category is set to the order ID. A transaction will never span multiple
orders in this scenario, so consumers process each category in parallel but FIFO within the category.
The ledge processor fan outs the messages by de-batching the content of each message in the first queue:

The ledger processor takes care of:


1. Walking the ledger one transaction at a time.
2. Setting the session ID of the message to match the order ID.
3. Sending each ledger transaction to a secondary queue with the session ID set to the order ID.
The consumers listen to the secondary queue where they process all messages with matching order IDs in order
from the queue. Consumers use peek-lock mode.
When considering scalability, the ledger queue is a primary bottleneck. Different transactions posted to the ledger
could reference the same order ID. However, messages can fan out after the ledger to the number of orders in a
serverless environment.

Next steps
The following information may be relevant when implementing this pattern:
Message sessions: first in, first out (FIFO)
Peek-Lock Message (Non-Destructive Read)
In order delivery of correlated messages in Logic Apps by using Service Bus sessions (MSDN blog)
Sharding pattern
11/2/2020 • 19 minutes to read • Edit Online

Divide a data store into a set of horizontal partitions or shards. This can improve scalability when storing and
accessing large volumes of data.

Context and problem


A data store hosted by a single server might be subject to the following limitations:
Storage space . A data store for a large-scale cloud application is expected to contain a huge volume of
data that could increase significantly over time. A server typically provides only a finite amount of disk
storage, but you can replace existing disks with larger ones, or add further disks to a machine as data
volumes grow. However, the system will eventually reach a limit where it isn't possible to easily increase
the storage capacity on a given server.
Computing resources . A cloud application is required to support a large number of concurrent users,
each of which run queries that retrieve information from the data store. A single server hosting the data
store might not be able to provide the necessary computing power to support this load, resulting in
extended response times for users and frequent failures as applications attempting to store and retrieve
data time out. It might be possible to add memory or upgrade processors, but the system will reach a
limit when it isn't possible to increase the compute resources any further.
Network bandwidth . Ultimately, the performance of a data store running on a single server is governed
by the rate the server can receive requests and send replies. It's possible that the volume of network
traffic might exceed the capacity of the network used to connect to the server, resulting in failed requests.
Geography . It might be necessary to store data generated by specific users in the same region as those
users for legal, compliance, or performance reasons, or to reduce latency of data access. If the users are
dispersed across different countries or regions, it might not be possible to store the entire data for the
application in a single data store.
Scaling vertically by adding more disk capacity, processing power, memory, and network connections can
postpone the effects of some of these limitations, but it's likely to only be a temporary solution. A commercial
cloud application capable of supporting large numbers of users and high volumes of data must be able to scale
almost indefinitely, so vertical scaling isn't necessarily the best solution.

Solution
Divide the data store into horizontal partitions or shards. Each shard has the same schema, but holds its own
distinct subset of the data. A shard is a data store in its own right (it can contain the data for many entities of
different types), running on a server acting as a storage node.
This pattern has the following benefits:
You can scale the system out by adding further shards running on additional storage nodes.
A system can use off-the-shelf hardware rather than specialized and expensive computers for each
storage node.
You can reduce contention and improve performance by balancing the workload across shards.
In the cloud, shards can be located physically close to the users that'll access the data.
When dividing a data store up into shards, decide which data should be placed in each shard. A shard typically
contains items that fall within a specified range determined by one or more attributes of the data. These
attributes form the shard key (sometimes referred to as the partition key). The shard key should be static. It
shouldn't be based on data that might change.
Sharding physically organizes the data. When an application stores and retrieves data, the sharding logic directs
the application to the appropriate shard. This sharding logic can be implemented as part of the data access code
in the application, or it could be implemented by the data storage system if it transparently supports sharding.
Abstracting the physical location of the data in the sharding logic provides a high level of control over which
shards contain which data. It also enables data to migrate between shards without reworking the business logic
of an application if the data in the shards need to be redistributed later (for example, if the shards become
unbalanced). The tradeoff is the additional data access overhead required in determining the location of each
data item as it's retrieved.
To ensure optimal performance and scalability, it's important to split the data in a way that's appropriate for the
types of queries that the application performs. In many cases, it's unlikely that the sharding scheme will exactly
match the requirements of every query. For example, in a multi-tenant system an application might need to
retrieve tenant data using the tenant ID, but it might also need to look up this data based on some other
attribute such as the tenant’s name or location. To handle these situations, implement a sharding strategy with a
shard key that supports the most commonly performed queries.
If queries regularly retrieve data using a combination of attribute values, you can likely define a composite shard
key by linking attributes together. Alternatively, use a pattern such as Index Table to provide fast lookup to data
based on attributes that aren't covered by the shard key.

Sharding strategies
Three strategies are commonly used when selecting the shard key and deciding how to distribute data across
shards. Note that there doesn't have to be a one-to-one correspondence between shards and the servers that
host them—a single server can host multiple shards. The strategies are:
The Lookup strategy . In this strategy the sharding logic implements a map that routes a request for data to
the shard that contains that data using the shard key. In a multi-tenant application all the data for a tenant might
be stored together in a shard using the tenant ID as the shard key. Multiple tenants might share the same shard,
but the data for a single tenant won't be spread across multiple shards. The figure illustrates sharding tenant
data based on tenant IDs.
The mapping between the shard key and the physical storage can be based on physical shards where each shard
key maps to a physical partition. Alternatively, a more flexible technique for rebalancing shards is virtual
partitioning, where shard keys map to the same number of virtual shards, which in turn map to fewer physical
partitions. In this approach, an application locates data using a shard key that refers to a virtual shard, and the
system transparently maps virtual shards to physical partitions. The mapping between a virtual shard and a
physical partition can change without requiring the application code be modified to use a different set of shard
keys.
The Range strategy . This strategy groups related items together in the same shard, and orders them by shard
key—the shard keys are sequential. It's useful for applications that frequently retrieve sets of items using range
queries (queries that return a set of data items for a shard key that falls within a given range). For example, if an
application regularly needs to find all orders placed in a given month, this data can be retrieved more quickly if
all orders for a month are stored in date and time order in the same shard. If each order was stored in a different
shard, they'd have to be fetched individually by performing a large number of point queries (queries that return
a single data item). The next figure illustrates storing sequential sets (ranges) of data in shard.
In this example, the shard key is a composite key containing the order month as the most significant element,
followed by the order day and the time. The data for orders is naturally sorted when new orders are created and
added to a shard. Some data stores support two-part shard keys containing a partition key element that
identifies the shard and a row key that uniquely identifies an item in the shard. Data is usually held in row key
order in the shard. Items that are subject to range queries and need to be grouped together can use a shard key
that has the same value for the partition key but a unique value for the row key.
The Hash strategy . The purpose of this strategy is to reduce the chance of hotspots (shards that receive a
disproportionate amount of load). It distributes the data across the shards in a way that achieves a balance
between the size of each shard and the average load that each shard will encounter. The sharding logic
computes the shard to store an item in based on a hash of one or more attributes of the data. The chosen
hashing function should distribute data evenly across the shards, possibly by introducing some random element
into the computation. The next figure illustrates sharding tenant data based on a hash of tenant IDs.
To understand the advantage of the Hash strategy over other sharding strategies, consider how a multi-tenant
application that enrolls new tenants sequentially might assign the tenants to shards in the data store. When
using the Range strategy, the data for tenants 1 to n will all be stored in shard A, the data for tenants n+1 to m
will all be stored in shard B, and so on. If the most recently registered tenants are also the most active, most data
activity will occur in a small number of shards, which could cause hotspots. In contrast, the Hash strategy
allocates tenants to shards based on a hash of their tenant ID. This means that sequential tenants are most likely
to be allocated to different shards, which will distribute the load across them. The previous figure shows this for
tenants 55 and 56.
The three sharding strategies have the following advantages and considerations:
Lookup . This offers more control over the way that shards are configured and used. Using virtual shards
reduces the impact when rebalancing data because new physical partitions can be added to even out the
workload. The mapping between a virtual shard and the physical partitions that implement the shard can
be modified without affecting application code that uses a shard key to store and retrieve data. Looking
up shard locations can impose an additional overhead.
Range . This is easy to implement and works well with range queries because they can often fetch
multiple data items from a single shard in a single operation. This strategy offers easier data
management. For example, if users in the same region are in the same shard, updates can be scheduled
in each time zone based on the local load and demand pattern. However, this strategy doesn't provide
optimal balancing between shards. Rebalancing shards is difficult and might not resolve the problem of
uneven load if the majority of activity is for adjacent shard keys.
Hash . This strategy offers a better chance of more even data and load distribution. Request routing can
be accomplished directly by using the hash function. There's no need to maintain a map. Note that
computing the hash might impose an additional overhead. Also, rebalancing shards is difficult.
Most common sharding systems implement one of the approaches described above, but you should also
consider the business requirements of your applications and their patterns of data usage. For example, in a
multi-tenant application:
You can shard data based on workload. You could segregate the data for highly volatile tenants in
separate shards. The speed of data access for other tenants might be improved as a result.
You can shard data based on the location of tenants. You can take the data for tenants in a specific
geographic region offline for backup and maintenance during off-peak hours in that region, while the
data for tenants in other regions remains online and accessible during their business hours.
High-value tenants could be assigned their own private, high performing, lightly loaded shards, whereas
lower-value tenants might be expected to share more densely-packed, busy shards.
The data for tenants that need a high degree of data isolation and privacy can be stored on a completely
separate server.

Scaling and data movement operations


Each of the sharding strategies implies different capabilities and levels of complexity for managing scale in, scale
out, data movement, and maintaining state.
The Lookup strategy permits scaling and data movement operations to be carried out at the user level, either
online or offline. The technique is to suspend some or all user activity (perhaps during off-peak periods), move
the data to the new virtual partition or physical shard, change the mappings, invalidate or refresh any caches
that hold this data, and then allow user activity to resume. Often this type of operation can be centrally
managed. The Lookup strategy requires state to be highly cacheable and replica friendly.
The Range strategy imposes some limitations on scaling and data movement operations, which must typically
be carried out when a part or all of the data store is offline because the data must be split and merged across
the shards. Moving the data to rebalance shards might not resolve the problem of uneven load if the majority of
activity is for adjacent shard keys or data identifiers that are within the same range. The Range strategy might
also require some state to be maintained in order to map ranges to the physical partitions.
The Hash strategy makes scaling and data movement operations more complex because the partition keys are
hashes of the shard keys or data identifiers. The new location of each shard must be determined from the hash
function, or the function modified to provide the correct mappings. However, the Hash strategy doesn't require
maintenance of state.

Issues and considerations


Consider the following points when deciding how to implement this pattern:
Sharding is complementary to other forms of partitioning, such as vertical partitioning and functional
partitioning. For example, a single shard can contain entities that have been partitioned vertically, and a
functional partition can be implemented as multiple shards. For more information about partitioning, see
the Data Partitioning Guidance.
Keep shards balanced so they all handle a similar volume of I/O. As data is inserted and deleted, it's
necessary to periodically rebalance the shards to guarantee an even distribution and to reduce the
chance of hotspots. Rebalancing can be an expensive operation. To reduce the necessity of rebalancing,
plan for growth by ensuring that each shard contains sufficient free space to handle the expected volume
of changes. You should also develop strategies and scripts you can use to quickly rebalance shards if this
becomes necessary.
Use stable data for the shard key. If the shard key changes, the corresponding data item might have to
move between shards, increasing the amount of work performed by update operations. For this reason,
avoid basing the shard key on potentially volatile information. Instead, look for attributes that are
invariant or that naturally form a key.
Ensure that shard keys are unique. For example, avoid using autoincrementing fields as the shard key. Is
some systems, autoincremented fields can't be coordinated across shards, possibly resulting in items in
different shards having the same shard key.

Autoincremented values in other fields that are not shard keys can also cause problems. For example,
if you use autoincremented fields to generate unique IDs, then two different items located in different
shards might be assigned the same ID.

It might not be possible to design a shard key that matches the requirements of every possible query
against the data. Shard the data to support the most frequently performed queries, and if necessary
create secondary index tables to support queries that retrieve data using criteria based on attributes that
aren't part of the shard key. For more information, see the Index Table pattern.
Queries that access only a single shard are more efficient than those that retrieve data from multiple
shards, so avoid implementing a sharding system that results in applications performing large numbers
of queries that join data held in different shards. Remember that a single shard can contain the data for
multiple types of entities. Consider denormalizing your data to keep related entities that are commonly
queried together (such as the details of customers and the orders that they have placed) in the same
shard to reduce the number of separate reads that an application performs.

If an entity in one shard references an entity stored in another shard, include the shard key for the
second entity as part of the schema for the first entity. This can help to improve the performance of
queries that reference related data across shards.
If an application must perform queries that retrieve data from multiple shards, it might be possible to
fetch this data by using parallel tasks. Examples include fan-out queries, where data from multiple shards
is retrieved in parallel and then aggregated into a single result. However, this approach inevitably adds
some complexity to the data access logic of a solution.
For many applications, creating a larger number of small shards can be more efficient than having a
small number of large shards because they can offer increased opportunities for load balancing. This can
also be useful if you anticipate the need to migrate shards from one physical location to another. Moving
a small shard is quicker than moving a large one.
Make sure the resources available to each shard storage node are sufficient to handle the scalability
requirements in terms of data size and throughput. For more information, see the section “Designing
Partitions for Scalability” in the Data Partitioning Guidance.
Consider replicating reference data to all shards. If an operation that retrieves data from a shard also
references static or slow-moving data as part of the same query, add this data to the shard. The
application can then fetch all of the data for the query easily, without having to make an additional round
trip to a separate data store.

If reference data held in multiple shards changes, the system must synchronize these changes across
all shards. The system can experience a degree of inconsistency while this synchronization occurs. If
you do this, you should design your applications to be able to handle it.

It can be difficult to maintain referential integrity and consistency between shards, so you should
minimize operations that affect data in multiple shards. If an application must modify data across shards,
evaluate whether complete data consistency is actually required. Instead, a common approach in the
cloud is to implement eventual consistency. The data in each partition is updated separately, and the
application logic must take responsibility for ensuring that the updates all complete successfully, as well
as handling the inconsistencies that can arise from querying data while an eventually consistent
operation is running. For more information about implementing eventual consistency, see the Data
Consistency Primer.
Configuring and managing a large number of shards can be a challenge. Tasks such as monitoring,
backing up, checking for consistency, and logging or auditing must be accomplished on multiple shards
and servers, possibly held in multiple locations. These tasks are likely to be implemented using scripts or
other automation solutions, but that might not completely eliminate the additional administrative
requirements.
Shards can be geolocated so that the data that they contain is close to the instances of an application that
use it. This approach can considerably improve performance, but requires additional consideration for
tasks that must access multiple shards in different locations.

When to use this pattern


Use this pattern when a data store is likely to need to scale beyond the resources available to a single storage
node, or to improve performance by reducing contention in a data store.

NOTE
The primary focus of sharding is to improve the performance and scalability of a system, but as a by-product it can also
improve availability due to how the data is divided into separate partitions. A failure in one partition doesn't necessarily
prevent an application from accessing data held in other partitions, and an operator can perform maintenance or
recovery of one or more partitions without making the entire data for an application inaccessible. For more information,
see the Data Partitioning Guidance.
Example
The following example in C# uses a set of SQL Server databases acting as shards. Each database holds a subset
of the data used by an application. The application retrieves data that's distributed across the shards using its
own sharding logic (this is an example of a fan-out query). The details of the data that's located in each shard is
returned by a method called GetShards . This method returns an enumerable list of ShardInformation objects,
where the ShardInformation type contains an identifier for each shard and the SQL Server connection string
that an application should use to connect to the shard (the connection strings aren't shown in the code example).

private IEnumerable<ShardInformation> GetShards()


{
// This retrieves the connection information from a shard store
// (commonly a root database).
return new[]
{
new ShardInformation
{
Id = 1,
ConnectionString = ...
},
new ShardInformation
{
Id = 2,
ConnectionString = ...
}
};
}

The code below shows how the application uses the list of ShardInformation objects to perform a query that
fetches data from each shard in parallel. The details of the query aren't shown, but in this example the data that's
retrieved contains a string that could hold information such as the name of a customer if the shards contain the
details of customers. The results are aggregated into a ConcurrentBag collection for processing by the
application.
// Retrieve the shards as a ShardInformation[] instance.
var shards = GetShards();

var results = new ConcurrentBag<string>();

// Execute the query against each shard in the shard list.


// This list would typically be retrieved from configuration
// or from a root/primary shard store.
Parallel.ForEach(shards, shard =>
{
// NOTE: Transient fault handling isn't included,
// but should be incorporated when used in a real world application.
using (var con = new SqlConnection(shard.ConnectionString))
{
con.Open();
var cmd = new SqlCommand("SELECT ... FROM ...", con);

Trace.TraceInformation("Executing command against shard: {0}", shard.Id);

var reader = cmd.ExecuteReader();


// Read the results in to a thread-safe data structure.
while (reader.Read())
{
results.Add(reader.GetString(0));
}
}
});

Trace.TraceInformation("Fanout query complete - Record Count: {0}",


results.Count);

Related patterns and guidance


The following patterns and guidance might also be relevant when implementing this pattern:
Data Consistency Primer. It might be necessary to maintain consistency for data distributed across different
shards. Summarizes the issues surrounding maintaining consistency over distributed data, and describes the
benefits and tradeoffs of different consistency models.
Data Partitioning Guidance. Sharding a data store can introduce a range of additional issues. Describes these
issues in relation to partitioning data stores in the cloud to improve scalability, reduce contention, and
optimize performance.
Index Table pattern. Sometimes it isn't possible to completely support queries just through the design of the
shard key. Enables an application to quickly retrieve data from a large data store by specifying a key other
than the shard key.
Materialized View pattern. To maintain the performance of some query operations, it's useful to create
materialized views that aggregate and summarize data, especially if this summary data is based on
information that's distributed across shards. Describes how to generate and populate these views.
Sidecar pattern
11/2/2020 • 5 minutes to read • Edit Online

Deploy components of an application into a separate process or container to provide isolation and encapsulation.
This pattern can also enable applications to be composed of heterogeneous components and technologies.
This pattern is named Sidecar because it resembles a sidecar attached to a motorcycle. In the pattern, the sidecar
is attached to a parent application and provides supporting features for the application. The sidecar also shares
the same lifecycle as the parent application, being created and retired alongside the parent. The sidecar pattern is
sometimes referred to as the sidekick pattern and is a decomposition pattern.

Context and Problem


Applications and services often require related functionality, such as monitoring, logging, configuration, and
networking services. These peripheral tasks can be implemented as separate components or services.
If they are tightly integrated into the application, they can run in the same process as the application, making
efficient use of shared resources. However, this also means they are not well isolated, and an outage in one of
these components can affect other components or the entire application. Also, they usually need to be
implemented using the same language as the parent application. As a result, the component and the application
have close interdependence on each other.
If the application is decomposed into services, then each service can be built using different languages and
technologies. While this gives more flexibility, it means that each component has its own dependencies and
requires language-specific libraries to access the underlying platform and any resources shared with the parent
application. In addition, deploying these features as separate services can add latency to the application.
Managing the code and dependencies for these language-specific interfaces can also add considerable
complexity, especially for hosting, deployment, and management.

Solution
Co-locate a cohesive set of tasks with the primary application, but place them inside their own process or
container, providing a homogeneous interface for platform services across languages.

A sidecar service is not necessarily part of the application, but is connected to it. It goes wherever the parent
application goes. Sidecars are supporting processes or services that are deployed with the primary application.
On a motorcycle, the sidecar is attached to one motorcycle, and each motorcycle can have its own sidecar. In the
same way, a sidecar service shares the fate of its parent application. For each instance of the application, an
instance of the sidecar is deployed and hosted alongside it.
Advantages of using a sidecar pattern include:
A sidecar is independent from its primary application in terms of runtime environment and programming
language, so you don't need to develop one sidecar per language.
The sidecar can access the same resources as the primary application. For example, a sidecar can monitor
system resources used by both the sidecar and the primary application.
Because of its proximity to the primary application, there’s no significant latency when communicating
between them.
Even for applications that don’t provide an extensibility mechanism, you can use a sidecar to extend
functionality by attaching it as its own process in the same host or sub-container as the primary
application.
The sidecar pattern is often used with containers and referred to as a sidecar container or sidekick container.

Issues and Considerations


Consider the deployment and packaging format you will use to deploy services, processes, or containers.
Containers are particularly well suited to the sidecar pattern.
When designing a sidecar service, carefully decide on the interprocess communication mechanism. Try to use
language- or framework-agnostic technologies unless performance requirements make that impractical.
Before putting functionality into a sidecar, consider whether it would work better as a separate service or a
more traditional daemon.
Also consider whether the functionality could be implemented as a library or using a traditional extension
mechanism. Language-specific libraries may have a deeper level of integration and less network overhead.

When to Use this Pattern


Use this pattern when:
Your primary application uses a heterogeneous set of languages and frameworks. A component located in a
sidecar service can be consumed by applications written in different languages using different frameworks.
A component is owned by a remote team or a different organization.
A component or feature must be co-located on the same host as the application
You need a service that shares the overall lifecycle of your main application, but can be independently updated.
You need fine-grained control over resource limits for a particular resource or component. For example, you
may want to restrict the amount of memory a specific component uses. You can deploy the component as a
sidecar and manage memory usage independently of the main application.
This pattern may not be suitable:
When interprocess communication needs to be optimized. Communication between a parent application and
sidecar services includes some overhead, notably latency in the calls. This may not be an acceptable trade-off
for chatty interfaces.
For small applications where the resource cost of deploying a sidecar service for each instance is not worth
the advantage of isolation.
When the service needs to scale differently than or independently from the main applications. If so, it may be
better to deploy the feature as a separate service.
Example
The sidecar pattern is applicable to many scenarios. Some common examples:
Infrastructure API. The infrastructure development team creates a service that's deployed alongside each
application, instead of a language-specific client library to access the infrastructure. The service is loaded as a
sidecar and provides a common layer for infrastructure services, including logging, environment data,
configuration store, discovery, health checks, and watchdog services. The sidecar also monitors the parent
application's host environment and process (or container) and logs the information to a centralized service.
Manage NGINX/HAProxy. Deploy NGINX with a sidecar service that monitors environment state, then updates
the NGINX configuration file and recycles the process when a change in state is needed.
Ambassador sidecar. Deploy an ambassador service as a sidecar. The application calls through the ambassador,
which handles request logging, routing, circuit breaking, and other connectivity related features.
Offload proxy. Place an NGINX proxy in front of a node.js service instance, to handle serving static file content
for the service.

Related guidance
Ambassador pattern
Static Content Hosting pattern
11/2/2020 • 5 minutes to read • Edit Online

Deploy static content to a cloud-based storage service that can deliver them directly to the client. This can reduce
the need for potentially expensive compute instances.

Context and problem


Web applications typically include some elements of static content. This static content might include HTML pages
and other resources such as images and documents that are available to the client, either as part of an HTML page
(such as inline images, style sheets, and client-side JavaScript files) or as separate downloads (such as PDF
documents).
Although web servers are optimized for dynamic rendering and output caching, they still have to handle requests
to download static content. This consumes processing cycles that could often be put to better use.

Solution
In most cloud hosting environments, you can put some of an application's resources and static pages in a storage
service. The storage service can serve requests for these resources, reducing load on the compute resources that
handle other web requests. The cost for cloud-hosted storage is typically much less than for compute instances.
When hosting some parts of an application in a storage service, the main considerations are related to
deployment of the application and to securing resources that aren't intended to be available to anonymous users.

Issues and considerations


Consider the following points when deciding how to implement this pattern:
The hosted storage service must expose an HTTP endpoint that users can access to download the static
resources. Some storage services also support HTTPS, so it's possible to host resources in storage services
that require SSL.
For maximum performance and availability, consider using a content delivery network (CDN) to cache the
contents of the storage container in multiple datacenters around the world. However, you'll likely have to
pay for using the CDN.
Storage accounts are often geo-replicated by default to provide resiliency against events that might affect a
datacenter. This means that the IP address might change, but the URL will remain the same.
When some content is located in a storage account and other content is in a hosted compute instance, it
becomes more challenging to deploy and update the application. You might have to perform separate
deployments, and version the application and content to manage it more easily—especially when the static
content includes script files or UI components. However, if only static resources have to be updated, they
can simply be uploaded to the storage account without needing to redeploy the application package.
Storage services might not support the use of custom domain names. In this case it's necessary to specify
the full URL of the resources in links because they'll be in a different domain from the dynamically-
generated content containing the links.
The storage containers must be configured for public read access, but it's vital to ensure that they aren't
configured for public write access to prevent users being able to upload content.
Consider using a valet key or token to control access to resources that shouldn't be available anonymously.
See the Valet Key pattern for more information.

When to use this pattern


This pattern is useful for:
Minimizing the hosting cost for websites and applications that contain some static resources.
Minimizing the hosting cost for websites that consist of only static content and resources. Depending on
the capabilities of the hosting provider's storage system, it might be possible to entirely host a fully static
website in a storage account.
Exposing static resources and content for applications running in other hosting environments or on-
premises servers.
Locating content in more than one geographical area using a content delivery network that caches the
contents of the storage account in multiple datacenters around the world.
Monitoring costs and bandwidth usage. Using a separate storage account for some or all of the static
content allows the costs to be more easily separated from hosting and runtime costs.
This pattern might not be useful in the following situations:
The application needs to perform some processing on the static content before delivering it to the client.
For example, it might be necessary to add a timestamp to a document.
The volume of static content is very small. The overhead of retrieving this content from separate storage
can outweigh the cost benefit of separating it out from the compute resource.

Example
Azure Storage supports serving static content directly from a storage container. Files are served through
anonymous access requests. By default, files have a URL in a subdomain of core.windows.net , such as
https://contoso.z4.web.core.windows.net/image.png . You can configure a custom domain name, and use Azure
CDN to access the files over HTTPS. For more information, see Static website hosting in Azure Storage.

Static website hosting makes the files available for anonymous access. If you need to control who can access the
files, you can store files in Azure blob storage and then generate shared access signatures to limit access.
The links in the pages delivered to the client must specify the full URL of the resource. If the resource is protected
with a valet key, such as a shared access signature, this signature must be included in the URL.
A sample application that demonstrates using external storage for static resources is available on GitHub. This
sample uses configuration files to specify the storage account and container that holds the static content.

<Setting name="StaticContent.StorageConnectionString"
value="UseDevelopmentStorage=true" />
<Setting name="StaticContent.Container" value="static-content" />

The Settings class in the file Settings.cs of the StaticContentHosting.Web project contains methods to extract
these values and build a string value containing the cloud storage account container URL.

public class Settings


{
public static string StaticContentStorageConnectionString {
get
{
return RoleEnvironment.GetConfigurationSettingValue(
"StaticContent.StorageConnectionString");
}
}

public static string StaticContentContainer


{
get
{
return RoleEnvironment.GetConfigurationSettingValue("StaticContent.Container");
}
}

public static string StaticContentBaseUrl


{
get
{
var blobServiceClient = new BlobServiceClient(StaticContentStorageConnectionString);

return string.Format("{0}/{1}", blobServiceClient.Uri.ToString().TrimEnd('/'),


StaticContentContainer.TrimStart('/'));
}
}
}

The StaticContentUrlHtmlHelperclass in the file StaticContentUrlHtmlHelper.cs exposes a method named


StaticContentUrl that generates a URL containing the path to the cloud storage account if the URL passed to it
starts with the ASP.NET root path character (~).

public static class StaticContentUrlHtmlHelper


{
public static string StaticContentUrl(this HtmlHelper helper, string contentPath)
{
if (contentPath.StartsWith("~"))
{
contentPath = contentPath.Substring(1);
}

contentPath = string.Format("{0}/{1}", Settings.StaticContentBaseUrl.TrimEnd('/'),


contentPath.TrimStart('/'));

var url = new UrlHelper(helper.ViewContext.RequestContext);

return url.Content(contentPath);
}
}

The file Index.cshtml in the Views\Home folder contains an image element that uses the StaticContentUrl
method to create the URL for its src attribute.

<img src="@Html.StaticContentUrl("~/media/orderedList1.png")" alt="Test Image" />

Related patterns and guidance


Static Content Hosting sample. A sample application that demonstrates this pattern.
Valet Key pattern. If the target resources aren't supposed to be available to anonymous users, use this pattern
to restrict direct access.
Serverless web application on Azure. A reference architecture that uses static website hosting with Azure
Functions to implement a serverless web app.
Strangler pattern
11/2/2020 • 2 minutes to read • Edit Online

Incrementally migrate a legacy system by gradually replacing specific pieces of functionality with new applications
and services. As features from the legacy system are replaced, the new system eventually replaces all of the old
system's features, strangling the old system and allowing you to decommission it.

Context and problem


As systems age, the development tools, hosting technology, and even system architectures they were built on can
become increasingly obsolete. As new features and functionality are added, the complexity of these applications
can increase dramatically, making them harder to maintain or add new features to.
Completely replacing a complex system can be a huge undertaking. Often, you will need a gradual migration to a
new system, while keeping the old system to handle features that haven't been migrated yet. However, running
two separate versions of an application means that clients have to know where particular features are located.
Every time a feature or service is migrated, clients need to be updated to point to the new location.

Solution
Incrementally replace specific pieces of functionality with new applications and services. Create a façade that
intercepts requests going to the backend legacy system. The façade routes these requests either to the legacy
application or the new services. Existing features can be migrated to the new system gradually, and consumers
can continue using the same interface, unaware that any migration has taken place.

This pattern helps to minimize risk from the migration, and spread the development effort over time. With the
façade safely routing users to the correct application, you can add functionality to the new system at whatever
pace you like, while ensuring the legacy application continues to function. Over time, as features are migrated to
the new system, the legacy system is eventually "strangled" and is no longer necessary. Once this process is
complete, the legacy system can safely be retired.

Issues and considerations


Consider how to handle services and data stores that are potentially used by both new and legacy systems.
Make sure both can access these resources side-by-side.
Structure new applications and services in a way that they can easily be intercepted and replaced in future
strangler migrations.
At some point, when the migration is complete, the strangler façade will either go away or evolve into an
adaptor for legacy clients.
Make sure the façade keeps up with the migration.
Make sure the façade doesn't become a single point of failure or a performance bottleneck.

When to use this pattern


Use this pattern when gradually migrating a back-end application to a new architecture.
This pattern may not be suitable:
When requests to the back-end system cannot be intercepted.
For smaller systems where the complexity of wholesale replacement is low.

Related guidance
Martin Fowler's blog post on StranglerApplication
Throttling pattern
11/2/2020 • 7 minutes to read • Edit Online

Control the consumption of resources used by an instance of an application, an individual tenant, or an entire
service. This can allow the system to continue to function and meet service level agreements, even when an
increase in demand places an extreme load on resources.

Context and problem


The load on a cloud application typically varies over time based on the number of active users or the types of
activities they are performing. For example, more users are likely to be active during business hours, or the
system might be required to perform computationally expensive analytics at the end of each month. There might
also be sudden and unanticipated bursts in activity. If the processing requirements of the system exceed the
capacity of the resources that are available, it'll suffer from poor performance and can even fail. If the system has
to meet an agreed level of service, such failure could be unacceptable.
There're many strategies available for handling varying load in the cloud, depending on the business goals for
the application. One strategy is to use autoscaling to match the provisioned resources to the user needs at any
given time. This has the potential to consistently meet user demand, while optimizing running costs. However,
while autoscaling can trigger the provisioning of additional resources, this provisioning isn't immediate. If
demand grows quickly, there can be a window of time where there's a resource deficit.

Solution
An alternative strategy to autoscaling is to allow applications to use resources only up to a limit, and then
throttle them when this limit is reached. The system should monitor how it's using resources so that, when
usage exceeds the threshold, it can throttle requests from one or more users. This will enable the system to
continue functioning and meet any service level agreements (SLAs) that are in place. For more information on
monitoring resource usage, see the Instrumentation and Telemetry Guidance.
The system could implement several throttling strategies, including:
Rejecting requests from an individual user who's already accessed system APIs more than n times per
second over a given period of time. This requires the system to meter the use of resources for each tenant
or user running an application. For more information, see the Service Metering Guidance.
Disabling or degrading the functionality of selected nonessential services so that essential services can
run unimpeded with sufficient resources. For example, if the application is streaming video output, it
could switch to a lower resolution.
Using load leveling to smooth the volume of activity (this approach is covered in more detail by the
Queue-based Load Leveling pattern). In a multi-tenant environment, this approach will reduce the
performance for every tenant. If the system must support a mix of tenants with different SLAs, the work
for high-value tenants might be performed immediately. Requests for other tenants can be held back, and
handled when the backlog has eased. The Priority Queue pattern could be used to help implement this
approach.
Deferring operations being performed on behalf of lower priority applications or tenants. These
operations can be suspended or limited, with an exception generated to inform the tenant that the system
is busy and that the operation should be retried later.
The figure shows an area graph for resource use (a combination of memory, CPU, bandwidth, and other factors)
against time for applications that are making use of three features. A feature is an area of functionality, such as a
component that performs a specific set of tasks, a piece of code that performs a complex calculation, or an
element that provides a service such as an in-memory cache. These features are labeled A, B, and C.

The area immediately below the line for a feature indicates the resources that are used by applications when
they invoke this feature. For example, the area below the line for Feature A shows the resources used by
applications that are making use of Feature A, and the area between the lines for Feature A and Feature B
indicates the resources used by applications invoking Feature B. Aggregating the areas for each feature
shows the total resource use of the system.

The previous figure illustrates the effects of deferring operations. Just prior to time T1, the total resources
allocated to all applications using these features reach a threshold (the limit of resource use). At this point, the
applications are in danger of exhausting the resources available. In this system, Feature B is less critical than
Feature A or Feature C, so it's temporarily disabled and the resources that it was using are released. Between
times T1 and T2, the applications using Feature A and Feature C continue running as normal. Eventually, the
resource use of these two features diminishes to the point when, at time T2, there is sufficient capacity to enable
Feature B again.
The autoscaling and throttling approaches can also be combined to help keep the applications responsive and
within SLAs. If the demand is expected to remain high, throttling provides a temporary solution while the system
scales out. At this point, the full functionality of the system can be restored.
The next figure shows an area graph of the overall resource use by all applications running in a system against
time, and illustrates how throttling can be combined with autoscaling.
At time T1, the threshold specifying the soft limit of resource use is reached. At this point, the system can start to
scale out. However, if the new resources don't become available quickly enough, then the existing resources
might be exhausted and the system could fail. To prevent this from occurring, the system is temporarily throttled,
as described earlier. When autoscaling has completed and the additional resources are available, throttling can
be relaxed.

Issues and considerations


You should consider the following points when deciding how to implement this pattern:
Throttling an application, and the strategy to use, is an architectural decision that impacts the entire
design of a system. Throttling should be considered early in the application design process because it isn't
easy to add once a system has been implemented.
Throttling must be performed quickly. The system must be capable of detecting an increase in activity and
react accordingly. The system must also be able to revert to its original state quickly after the load has
eased. This requires that the appropriate performance data is continually captured and monitored.
If a service needs to temporarily deny a user request, it should return a specific error code so the client
application understands that the reason for the refusal to perform an operation is due to throttling. The
client application can wait for a period before retrying the request.
Throttling can be used as a temporary measure while a system autoscales. In some cases it's better to
simply throttle, rather than to scale, if a burst in activity is sudden and isn't expected to be long lived
because scaling can add considerably to running costs.
If throttling is being used as a temporary measure while a system autoscales, and if resource demands
grow very quickly, the system might not be able to continue functioning—even when operating in a
throttled mode. If this isn't acceptable, consider maintaining larger capacity reserves and configuring
more aggressive autoscaling.

When to use this pattern


Use this pattern:
To ensure that a system continues to meet service level agreements.
To prevent a single tenant from monopolizing the resources provided by an application.
To handle bursts in activity.
To help cost-optimize a system by limiting the maximum resource levels needed to keep it functioning.

Example
The final figure illustrates how throttling can be implemented in a multi-tenant system. Users from each of the
tenant organizations access a cloud-hosted application where they fill out and submit surveys. The application
contains instrumentation that monitors the rate at which these users are submitting requests to the application.
In order to prevent the users from one tenant affecting the responsiveness and availability of the application for
all other users, a limit is applied to the number of requests per second the users from any one tenant can submit.
The application blocks requests that exceed this limit.

Related patterns and guidance


The following patterns and guidance may also be relevant when implementing this pattern:
Instrumentation and Telemetry Guidance. Throttling depends on gathering information about how heavily a
service is being used. Describes how to generate and capture custom monitoring information.
Service Metering Guidance. Describes how to meter the use of services in order to gain an understanding of
how they are used. This information can be useful in determining how to throttle a service.
Autoscaling Guidance. Throttling can be used as an interim measure while a system autoscales, or to remove
the need for a system to autoscale. Contains information on autoscaling strategies.
Queue-based Load Leveling pattern. Queue-based load leveling is a commonly used mechanism for
implementing throttling. A queue can act as a buffer that helps to even out the rate at which requests sent by
an application are delivered to a service.
Priority Queue pattern. A system can use priority queuing as part of its throttling strategy to maintain
performance for critical or higher value applications, while reducing the performance of less important
applications.
Valet Key pattern
11/2/2020 • 12 minutes to read • Edit Online

Use a token that provides clients with restricted direct access to a specific resource, in order to offload data
transfer from the application. This is particularly useful in applications that use cloud-hosted storage systems or
queues, and can minimize cost and maximize scalability and performance.

Context and problem


Client programs and web browsers often need to read and write files or data streams to and from an
application’s storage. Typically, the application will handle the movement of the data — either by fetching it from
storage and streaming it to the client, or by reading the uploaded stream from the client and storing it in the data
store. However, this approach absorbs valuable resources such as compute, memory, and bandwidth.
Data stores have the ability to handle upload and download of data directly, without requiring that the application
perform any processing to move this data. But, this typically requires the client to have access to the security
credentials for the store. This can be a useful technique to minimize data transfer costs and the requirement to
scale out the application, and to maximize performance. It means, though, that the application is no longer able to
manage the security of the data. After the client has a connection to the data store for direct access, the
application can't act as the gatekeeper. It's no longer in control of the process and can't prevent subsequent
uploads or downloads from the data store.
This isn't a realistic approach in distributed systems that need to serve untrusted clients. Instead, applications
must be able to securely control access to data in a granular way, but still reduce the load on the server by setting
up this connection and then allowing the client to communicate directly with the data store to perform the
required read or write operations.

Solution
You need to resolve the problem of controlling access to a data store where the store can't manage authentication
and authorization of clients. One typical solution is to restrict access to the data store’s public connection and
provide the client with a key or token that the data store can validate.
This key or token is usually referred to as a valet key. It provides time-limited access to specific resources and
allows only predefined operations such as reading and writing to storage or queues, or uploading and
downloading in a web browser. Applications can create and issue valet keys to client devices and web browsers
quickly and easily, allowing clients to perform the required operations without requiring the application to
directly handle the data transfer. This removes the processing overhead, and the impact on performance and
scalability, from the application and the server.
The client uses this token to access a specific resource in the data store for only a specific period, and with specific
restrictions on access permissions, as shown in the figure. After the specified period, the key becomes invalid and
won't allow access to the resource.
It's also possible to configure a key that has other dependencies, such as the scope of the data. For example,
depending on the data store capabilities, the key can specify a complete table in a data store, or only specific rows
in a table. In cloud storage systems the key can specify a container, or just a specific item within a container.
The key can also be invalidated by the application. This is a useful approach if the client notifies the server that
the data transfer operation is complete. The server can then invalidate that key to prevent further access.
Using this pattern can simplify managing access to resources because there's no requirement to create and
authenticate a user, grant permissions, and then remove the user again. It also makes it easy to limit the location,
the permission, and the validity period—all by simply generating a key at runtime. The important factors are to
limit the validity period, and especially the location of the resource, as tightly as possible so that the recipient can
only use it for the intended purpose.

Issues and considerations


Consider the following points when deciding how to implement this pattern:
Manage the validity status and period of the key . If leaked or compromised, the key effectively unlocks the
target item and makes it available for malicious use during the validity period. A key can usually be revoked or
disabled, depending on how it was issued. Server-side policies can be changed or, the server key it was signed
with can be invalidated. Specify a short validity period to minimize the risk of allowing unauthorized operations
to take place against the data store. However, if the validity period is too short, the client might not be able to
complete the operation before the key expires. Allow authorized users to renew the key before the validity period
expires if multiple accesses to the protected resource are required.
Control the level of access the key will provide . Typically, the key should allow the user to only perform the
actions necessary to complete the operation, such as read-only access if the client shouldn't be able to upload
data to the data store. For file uploads, it's common to specify a key that provides write-only permission, as well
as the location and the validity period. It's critical to accurately specify the resource or the set of resources to
which the key applies.
Consider how to control users’ behavior . Implementing this pattern means some loss of control over the
resources users are granted access to. The level of control that can be exerted is limited by the capabilities of the
policies and permissions available for the service or the target data store. For example, it's usually not possible to
create a key that limits the size of the data to be written to storage, or the number of times the key can be used to
access a file. This can result in huge unexpected costs for data transfer, even when used by the intended client,
and might be caused by an error in the code that causes repeated upload or download. To limit the number of
times a file can be uploaded, where possible, force the client to notify the application when one operation has
completed. For example, some data stores raise events the application code can use to monitor operations and
control user behavior. However, it's hard to enforce quotas for individual users in a multi-tenant scenario where
the same key is used by all the users from one tenant.
Validate, and optionally sanitize, all uploaded data . A malicious user that gains access to the key could
upload data designed to compromise the system. Alternatively, authorized users might upload data that's invalid
and, when processed, could result in an error or system failure. To protect against this, ensure that all uploaded
data is validated and checked for malicious content before use.
Audit all operations . Many key-based mechanisms can log operations such as uploads, downloads, and
failures. These logs can usually be incorporated into an audit process, and also used for billing if the user is
charged based on file size or data volume. Use the logs to detect authentication failures that might be caused by
issues with the key provider, or accidental removal of a stored access policy.
Deliver the key securely . It can be embedded in a URL that the user activates in a web page, or it can be used
in a server redirection operation so that the download occurs automatically. Always use HTTPS to deliver the key
over a secure channel.
Protect sensitive data in transit . Sensitive data delivered through the application will usually take place using
SSL or TLS, and this should be enforced for clients accessing the data store directly.
Other issues to be aware of when implementing this pattern are:
If the client doesn't, or can't, notify the server of completion of the operation, and the only limit is the
expiration period of the key, the application won't be able to perform auditing operations such as counting
the number of uploads or downloads, or preventing multiple uploads or downloads.
The flexibility of key policies that can be generated might be limited. For example, some mechanisms only
allow the use of a timed expiration period. Others aren't able to specify a sufficient granularity of
read/write permissions.
If the start time for the key or token validity period is specified, ensure that it's a little earlier than the
current server time to allow for client clocks that might be slightly out of synchronization. The default, if
not specified, is usually the current server time.
The URL containing the key will be recorded in server log files. While the key will typically have expired
before the log files are used for analysis, ensure that you limit access to them. If log data is transmitted to a
monitoring system or stored in another location, consider implementing a delay to prevent leakage of
keys until after their validity period has expired.
If the client code runs in a web browser, the browser might need to support cross-origin resource sharing
(CORS) to enable code that executes within the web browser to access data in a different domain from the
one that served the page. Some older browsers and some data stores don't support CORS, and code that
runs in these browsers might not be able to use a valet key to provide access to data in a different domain,
such as a cloud storage account.

When to use this pattern


This pattern is useful for the following situations:
To minimize resource loading and maximize performance and scalability. Using a valet key doesn't require
the resource to be locked, no remote server call is required, there's no limit on the number of valet keys
that can be issued, and it avoids a single point of failure resulting from performing the data transfer
through the application code. Creating a valet key is typically a simple cryptographic operation of signing
a string with a key.
To minimize operational cost. Enabling direct access to stores and queues is resource and cost efficient, can
result in fewer network round trips, and might allow for a reduction in the number of compute resources
required.
When clients regularly upload or download data, particularly where there's a large volume or when each
operation involves large files.
When the application has limited compute resources available, either due to hosting limitations or cost
considerations. In this scenario, the pattern is even more helpful if there are many concurrent data uploads
or downloads because it relieves the application from handling the data transfer.
When the data is stored in a remote data store or a different datacenter. If the application was required to
act as a gatekeeper, there might be a charge for the additional bandwidth of transferring the data between
datacenters, or across public or private networks between the client and the application, and then between
the application and the data store.
This pattern might not be useful in the following situations:
If the application must perform some task on the data before it's stored or before it's sent to the client. For
example, if the application needs to perform validation, log access success, or execute a transformation on
the data. However, some data stores and clients are able to negotiate and carry out simple transformations
such as compression and decompression (for example, a web browser can usually handle GZip formats).
If the design of an existing application makes it difficult to incorporate the pattern. Using this pattern
typically requires a different architectural approach for delivering and receiving data.
If it's necessary to maintain audit trails or control the number of times a data transfer operation is
executed, and the valet key mechanism in use doesn't support notifications that the server can use to
manage these operations.
If it's necessary to limit the size of the data, especially during upload operations. The only solution to this is
for the application to check the data size after the operation is complete, or check the size of uploads after
a specified period or on a scheduled basis.

Example
Azure supports shared access signatures on Azure Storage for granular access control to data in blobs, tables,
and queues, and for Service Bus queues and topics. A shared access signature token can be configured to provide
specific access rights such as read, write, update, and delete to a specific table; a key range within a table; a queue;
a blob; or a blob container. The validity can be a specified time period or with no time limit.
Azure shared access signatures also support server-stored access policies that can be associated with a specific
resource such as a table or blob. This feature provides additional control and flexibility compared to application-
generated shared access signature tokens, and should be used whenever possible. Settings defined in a server-
stored policy can be changed and are reflected in the token without requiring a new token to be issued, but
settings defined in the token can't be changed without issuing a new token. This approach also makes it possible
to revoke a valid shared access signature token before it's expired.

For more information, see Introducing Table SAS (Shared Access Signature), Queue SAS and update to Blob
SAS and Using Shared Access Signatures on MSDN.

The following code shows how to create a shared access signature token that's valid for five minutes. The
GetSharedAccessReferenceForUpload method returns a shared access signatures token that can be used to upload
a file to Azure Blob Storage.
public class ValuesController : ApiController
{
private readonly BlobServiceClient blobServiceClient;
private readonly string blobContainer;
...
/// <summary>
/// Return a limited access key that allows the caller to upload a file
/// to this specific destination for a defined period of time.
/// </summary>
private StorageEntitySas GetSharedAccessReferenceForUpload(string blobName)
{
var blob = blobServiceClient.GetBlobContainerClient(this.blobContainer).GetBlobClient(blobName);

var storageSharedKeyCredential = new StorageSharedKeyCredential(blobServiceClient.AccountName,


ConfigurationManager.AppSettings["AzureStorageEmulatorAccountKey"]);

var blobSasBuilder = new BlobSasBuilder

{
BlobContainerName = this.blobContainer,
BlobName = blobName,
Resource = "b",
StartsOn = DateTimeOffset.UtcNow.AddMinutes(-5),
ExpiresOn = DateTimeOffset.UtcNow.AddMinutes(5)
};
policy.SetPermissions(BlobSasPermissions.Write);
var sas = policy.ToSasQueryParameters(storageSharedKeyCredential).ToString();

return new StorageEntitySas


{
BlobUri = blob.Uri,
Credentials = sas
};
}
public struct StorageEntitySas
{
public string Credentials;
public Uri BlobUri;
}
}

The complete sample is available in the ValetKey solution available for download from GitHub. The
ValetKey.Web project in this solution contains a web application that includes the ValuesController class
shown above. A sample client application that uses this web application to retrieve a shared access signatures
key and upload a file to blob storage is available in the ValetKey.Client project.

Next steps
The following patterns and guidance might also be relevant when implementing this pattern:
A sample that demonstrates this pattern is available on GitHub.
Gatekeeper pattern. This pattern can be used in conjunction with the Valet Key pattern to protect applications
and services by using a dedicated host instance that acts as a broker between clients and the application or
service. The gatekeeper validates and sanitizes requests, and passes requests and data between the client and
the application. Can provide an additional layer of security, and reduce the attack surface of the system.
Static Content Hosting pattern. Describes how to deploy static resources to a cloud-based storage service that
can deliver these resources directly to the client to reduce the requirement for expensive compute instances.
Where the resources aren't intended to be publicly available, the Valet Key pattern can be used to secure them.
Introducing Table SAS (Shared Access Signature), Queue SAS and update to Blob SAS
Using Shared Access Signatures
Shared Access Signature Authentication with Service Bus
Artificial intelligence (AI)
11/13/2020 • 15 minutes to read • Edit Online

Artificial intelligence (AI) is the capability of a computer to imitate intelligent human behavior. Through AI, machines
can analyze images, comprehend speech, interact in natural ways, and make predictions using data.

AI concepts
Algorithm
An algorithm is a sequence of calculations and rules used to solve a problem or analyze a set of data. It is like a flow
chart, with step-by-step instructions for questions to ask, but written in math and programming code. An algorithm
may describe how to determine whether a pet is a cat, dog, fish, bird, or lizard. Another far more complicated
algorithm may describe how to identify a written or spoken language, analyze its words, translate them into a
different language, and then check the translation for accuracy.
Machine learning
Machine learning (ML) is an AI technique that uses mathematical algorithms to create predictive models. An
algorithm is used to parse data fields and to "learn" from that data by using patterns found within it to generate
models. Those models are then used to make informed predictions or decisions about new data.
The predictive models are validated against known data, measured by performance metrics selected for specific
business scenarios, and then adjusted as needed. This process of learning and validation is called training. Through
periodic retraining, ML models are improved over time.
Machine learning at scale
What are the machine learning products at Microsoft?
Deep learning
Deep learning is a type of ML that can determine for itself whether its predictions are accurate. It also uses
algorithms to analyze data, but it does so on a larger scale than ML.
Deep learning uses artificial neural networks, which consist of multiple layers of algorithms. Each layer looks at the
incoming data, performs its own specialized analysis, and produces an output that other layers can understand. This
output is then passed to the next layer, where a different algorithm does its own analysis, and so on.
With many layers in each neural network-and sometimes using multiple neural networks-a machine can learn
through its own data processing. This requires much more data and much more computing power than ML.
Deep learning versus machine learning
Distributed training of deep learning models on Azure
Batch scoring of deep learning models on Azure
Training of Python scikit-learn and deep learning models on Azure
Real-time scoring of Python scikit-learn and deep learning models on Azure
Bots
A bot is an automated software program designed to perform a particular task. Think of it as a robot without a
body. Early bots were comparatively simple, handling repetitive and voluminous tasks with relatively
straightforward algorithmic logic. An example would be web crawlers used by search engines to automatically
explore and catalog web content.
Bots have become much more sophisticated, using AI and other technologies to mimic human activity and
decision-making, often while interacting directly with humans through text or even speech. Examples include bots
that can take a dinner reservation, chatbots (or conversational AI) that help with customer service interactions, and
social bots that post breaking news or scientific data to social media sites.
Microsoft offers the Azure Bot Service, a managed service purpose-built for enterprise-grade bot development.
About Azure Bot Service
Ten guidelines for responsible bots
Azure reference architecture: Enterprise-grade conversational bot
Example workload: Conversational chatbot for hotel reservations on Azure
Autonomous systems
Autonomous systems are part of an evolving new class that goes beyond basic automation. Instead of performing
a specific task repeatedly with little or no variation (like bots do), autonomous systems bring intelligence to
machines so they can adapt to changing environments to accomplish a desired goal.
Smart buildings use autonomous systems to automatically control operations like lighting, ventilation, air
conditioning, and security. A more sophisticated example would be a self-directed robot exploring a collapsed mine
shaft to thoroughly map its interior, determine which portions are structurally sound, analyze the air for
breathability, and detect signs of trapped miners in need of rescue-all without a human monitoring in real time on
the remote end.
Autonomous systems and solutions from Microsoft AI
General info on Microsoft AI
Learn more about Microsoft AI, and keep up-to-date with related news:
Microsoft AI School
Azure AI platform page
Microsoft AI platform page
Microsoft AI Blog
Microsoft AI on GitHub: Samples, reference architectures, and best practices
Azure Architecture Center

High-level architectural types


Prebuilt AI
Prebuilt AI is exactly what it sounds like-off-the-shelf AI models, services, and APIs that are ready to use. These help
you add intelligence to apps, websites, and flows without having to gather data and then build, train, and publish
your own models.
One example of prebuilt AI might be a pretrained model that can be incorporated as is or used to provide a
baseline for further custom training. Another example would be a cloud-based API service that can be called at will
to process natural language in a desired fashion.
Azure Cognitive Services
Cognitive Services provide developers the opportunity to use prebuilt APIs and integration toolkits to create
applications that can see, hear, speak, understand, and even begin to reason. The catalog of services within
Cognitive Services can be categorized into five main pillars: Vision, Speech, Language, Web Search, and
Decision/Recommendation.
Azure Cognitive Services documentation
Try Azure Cognitive Services for free
Choosing a Microsoft cognitive services technology
Choosing a natural language processing technology in Azure
Prebuilt AI models in AI Builder
AI Builder is a new capability in Microsoft Power Platform that provides a point-and-click interface for adding AI to
your apps, even if you have no coding or data science skills. (Some features in AI Builder have not yet released for
general availability and remain in preview status. For more information, refer to the Feature availability by region
page.)
You can build and train your own models, but AI Builder also provides select prebuilt AI models that are ready for
use right away. For example, you can add a component in Microsoft Power Apps based on a prebuilt model that
recognizes contact information from business cards.
Power Apps on Azure
AI Builder documentation
AI model types in AI Builder
Overview of prebuilt AI models in AI Builder
Custom AI
Although prebuilt AI is useful (and increasingly flexible), the best way to get what you need from AI is probably to
build a system yourself. This is obviously a very deep and complex subject, but let's look at some basic concepts
beyond what we've just covered.
Code languages
The core concept of AI is the use of algorithms to analyze data and generate models to describe (or score) it in ways
that are useful. Algorithms are written by developers and data scientists (and sometimes by other algorithms) using
programming code. Two of the most popular programming languages for AI development are currently Python
and R.
Python is a general-purpose, high-level programming language. It has a simple, easy-to-learn syntax that
emphasizes readability. There is no compiling step. Python has a large standard library, but it also supports the
ability to add modules and packages. This encourages modularity and lets you expand capabilities when needed.
There is a large and growing ecosystem of AI and ML libraries for Python, including many that are readily available
in Azure.
Python on Azure product home page
Azure for Python developers
Azure Machine Learning SDK for Python
Introduction to machine learning with Python and Azure Notebooks
scikit-learn. An open-source ML library for Python
PyTorch. An open-source Python library with a rich ecosystem that can be used for deep learning, computer
vision, natural language processing, and more
TensorFlow. An open-source symbolic math library also used for ML applications and neural networks
Tutorial: Apply machine learning models in Azure Functions with Python and TensorFlow
R is a language and environment for statistical computing and graphics. It can be used for everything from
mapping broad social and marketing trends online to developing financial and climate models.
Microsoft has fully embraced the R programming language and provides many different options for R developers
to run their code in Azure.
R developer's guide to Azure
Microsoft R Open. An enhanced distribution of R from Microsoft, fully compatible with R-3.5.3, with
additional capabilities for improved performance and reproducibility, in addition to support for Windows-
and Linux-based platforms
Using R in Azure Machine Learning Studio. A video illustrating how to incorporate your R code in ML studio.
Tutorial: Create a logistic regression model in R with Azure Machine Learning
Training
Training is core to machine learning. It is the iterative process of "teaching" an algorithm to create models, which
are used to analyze data and then make accurate predictions from it. In practice, this process has three general
phases: training, validation, and testing.
During the training phase, a quality set of known data is tagged so that individual fields are identifiable. The tagged
data is fed to an algorithm configured to make a particular prediction. When finished, the algorithm outputs a
model that describes the patterns it found as a set of parameters. During validation, fresh data is tagged and used
to test the model. The algorithm is adjusted as needed and possibly put through more training. Finally, the testing
phase uses real-world data without any tags or preselected targets. Assuming the model's results are accurate, it is
considered ready for use and can be deployed.
Train models with Azure Machine Learning
Hyper par am et er t u n i n g

Hyperparameters are data variables that govern the training process itself. They are configuration variables that
control how the algorithm operates. Hyperparameters are thus typically set before model training begins and are
not modified within the training process in the way that parameters are. Hyperparameter tuning involves running
trials within the training task, assessing how well they are getting the job done, and then adjusting as needed. This
process generates multiple models, each trained using different families of hyperparameters.
Tune hyperparameters for your model with Azure Machine Learning
Azure Machine Learning Studio (classic): Tune Model Hyperparameters
M o d e l se l e c t i o n

The process of training and hyperparameter tuning produces numerous candidate models. These can have many
different variances, including the effort needed to prepare the data, the flexibility of the model, the amount of
processing time, and of course the degree of accuracy of its results. Choosing the best trained model for your needs
and constraints is called model selection, but this is as much about preplanning before training as it is about
choosing the one that works best.
A u t o m a t e d m a c h i n e l e a r n i n g (A u t o M L )

Automated machine learning, also known as AutoML, is the process of automating the time-consuming, iterative
tasks of machine learning model development. It can significantly reduce the time it takes to get production-ready
ML models. Automated ML can assist with model selection, hyperparameter tuning, model training, and other tasks,
without requiring extensive programming or domain knowledge.
What is automated machine learning?
Scoring
Scoring is also called prediction and is the process of generating values based on a trained machine learning
model, given some new input data. The values, or scores, that are created can represent predictions of future values,
but they might also represent a likely category or outcome. The scoring process can generate many different types
of values:
A list of recommended items and a similarity score
Numeric values, for time series models and regression models
A probability value, indicating the likelihood that a new input belongs to some existing category
The name of a category or cluster to which a new item is most similar
A predicted class or outcome, for classification models
Batch scoring is when data is collected during some fixed period of time and then processed in a batch. This
might include generating business reports or analyzing customer loyalty.
Real-time scoring is exactly that-scoring that is ongoing and performed as quickly as possible. The classic
example is credit card fraud detection, but real-time scoring can also be used in speech recognition, medical
diagnoses, market analyses, and many other applications.
General info on custom AI on Azure
Microsoft AI on GitHub: Samples, reference architectures, and best practices
Custom AI on Azure GitHub repo. A collection of scripts and tutorials to help developers effectively use Azure
for their AI workloads
Azure Machine Learning SDK for Python
Azure Machine Learning service example notebooks (Python). A GitHub repo of example notebooks
demonstrating the Azure Machine Learning Python SDK
Azure Machine Learning SDK for R

Azure AI platform offerings


Following is a breakdown of Azure technologies, platforms, and services you can use to develop AI solutions for
your needs.
Azure Machine Learning
This is an enterprise-grade machine learning service to build and deploy models faster. Azure Machine Learning
offers web interfaces and SDKs so you can quickly train and deploy your machine learning models and pipelines at
scale. Use these capabilities with open-source Python frameworks, such as PyTorch, TensorFlow, and scikit-learn.
What are the machine learning products at Microsoft?
Azure Machine Learning product home page
Azure Machine Learning Data Architecture Guide overview
Azure Machine Learning documentation overview
What is Azure Machine Learning? General orientation with links to many learning resources, SDKs,
documentation, and more
Azure Machine Learning versus Azure Machine Learning Studio (classic)
Machine learning reference architectures for Azure
Training of Python scikit-learn and deep learning models on Azure
Distributed training of deep learning models on Azure
Batch scoring of Python machine learning models on Azure
Batch scoring of deep learning models on Azure
Real-time scoring of Python scikit-learn and deep learning models on Azure
Machine learning operationalization (MLOps) for Python models using Azure Machine Learning
Batch scoring of R machine learning models on Azure
Real-time scoring of R machine learning models on Azure
Batch scoring of Spark machine learning models on Azure Databricks
Enterprise-grade conversational bot
Build a real-time recommendation API on Azure
Azure automated machine learning
Azure provides extensive support for automated ML. Developers can build models using a no-code UI or through a
code-first notebooks experience.
Azure automated machine learning product home page
Azure automated ML infographic (PDF)
Tutorial: Create a classification model with automated ML in Azure Machine Learning
Tutorial: Use automated machine learning to predict taxi fares
Configure automated ML experiments in Python
Use the CLI extension for Azure Machine Learning
Automate machine learning activities with the Azure Machine Learning CLI
Azure Cognitive Services
This is a comprehensive family of AI services and cognitive APIs to help you build intelligent apps. These domain-
specific, pretrained AI models can be customized with your data.
Cognitive Services product home page
Azure Cognitive Services documentation
Azure Cognitive Search
This is an AI-powered cloud search service for mobile and web app development. The service can search over
private heterogenous content, with options for AI enrichment if your content is unstructured or unsearchable in raw
form.
Azure Cognitive Search product home page
Getting started with AI enrichment
Azure Cognitive Search documentation overview
Choosing a natural language processing technology in Azure
Quickstart: Create an Azure Cognitive Search cognitive skill set in the Azure portal
Azure Bot Service
This is a purpose-built bot development environment with out-of-the-box templates to get started quickly.
Azure Bot Service product home page
Azure Bot Service documentation overview
Azure reference architecture: Enterprise-grade conversational bot
Example workload: Conversational chatbot for hotel reservations on Azure
Microsoft Bot Framework
GitHub Bot Builder repo
Apache Spark on Azure
Apache Spark is a parallel processing framework that supports in-memory processing to boost the performance of
big data analytic applications. Spark provides primitives for in-memory cluster computing. A Spark job can load
and cache data into memory and query it repeatedly, which is much faster than disk-based applications, such as
Hadoop.
Apache Spark in Azure HDInsight is the Microsoft implementation of Apache Spark in the cloud. Spark clusters in
HDInsight are compatible with Azure Storage and Azure Data Lake Storage, so you can use HDInsight Spark clusters
to process your data stored in Azure.
The Microsoft Machine Learning library for Apache Spark is MMLSpark (Microsoft ML for Apache Spark). It is an
open-source library that adds many deep learning and data science tools, networking capabilities, and production-
grade performance to the Spark ecosystem. Learn more about MMLSpark features and capabilities.
Azure HDInsight overview. Basic information about features, cluster architecture, and use cases, with
pointers to quickstarts and tutorials.
Tutorial: Build an Apache Spark machine learning application in Azure HDInsight
Apache Spark best practices on HDInsight
Configure HDInsight Apache Spark Cluster settings
Machine learning on HDInsight
GitHub repo for MMLSpark: Microsoft Machine Learning library for Apache Spark
Create an Apache Spark machine learning pipeline on HDInsight
Azure Databricks Runtime for Machine Learning
Azure Databricks is an Apache Spark–based analytics platform with one-click setup, streamlined workflows, and an
interactive workspace for collaboration between data scientists, engineers, and business analysts.
Databricks Runtime for Machine Learning (Databricks Runtime ML) lets you start a Databricks cluster with all of the
libraries required for distributed training. It provides a ready-to-go environment for machine learning and data
science. Plus, it contains multiple popular libraries, including TensorFlow, PyTorch, Keras, and XGBoost. It also
supports distributed training using Horovod.
Azure Databricks product home page
Azure Databricks documentation
Machine learning capabilities in Azure Databricks
How-to guide: Databricks Runtime for Machine Learning
Batch scoring of Spark machine learning models on Azure Databricks
Deep learning overview for Azure Databricks

Customer stories
Different industries are applying AI in innovative and inspiring ways. Following are a number of customer case
studies and success stories:
ASOS: Online retailer solves challenges with Azure Machine Learning service
KPMG helps financial institutions save millions in compliance costs with Azure Cognitive Services
Volkswagen: Machine translation speaks Volkswagen – in 40 languages
Buncee: NYC school empowers readers of all ages and abilities with Azure AI
InterSystems: Data platform company boosts healthcare IT by generating critical information at
unprecedented speed
Zencity: Data-driven startup uses funding to help local governments support better quality of life for
residents
Bosch uses IoT innovation to drive traffic safety improvements by helping drivers avoid serious accidents
Automation Anywhere: Robotic process automation platform developer enriches its software with Azure
Cognitive Services
Wix deploys smart, scalable search across 150 million websites with Azure Cognitive Search
Asklepios Klinik Altona: Precision surgeries with Microsoft HoloLens 2 and 3D visualization
AXA Global P&C: Global insurance firm models complex natural disasters with cloud-based HPC
Browse more AI customer stories

Next steps
To learn about the artificial intelligence development products available from Microsoft, refer to
the Microsoft AI platform page.
For training in how to develop AI solutions, refer to Microsoft AI School.
Microsoft AI on GitHub: Samples, reference architectures, and best practices organizes the Microsoft open
source AI-based repositories, providing tutorials and learning materials.
Choosing a Microsoft cognitive services technology
11/2/2020 • 3 minutes to read • Edit Online

Microsoft cognitive services are cloud-based APIs that you can use in artificial intelligence (AI) applications and
data flows. They provide you with pretrained models that are ready to use in your application, requiring no data
and no model training on your part. The cognitive services are developed by Microsoft's AI and Research team and
leverage the latest deep learning algorithms. They are consumed over HTTP REST interfaces. In addition, SDKs are
available for many common application development frameworks.
The cognitive services include:
Text analysis
Computer vision
Video analytics
Speech recognition and generation
Natural language understanding
Intelligent search
Key benefits:
Minimal development effort for state-of-the-art AI services.
Easy integration into apps via HTTP REST interfaces.
Built-in support for consuming cognitive services in Azure Data Lake Analytics.
Considerations:
Only available over the web. Internet connectivity is generally required. An exception is the Custom Vision
Service, whose trained model you can export for prediction on devices and at the IoT edge.
Although considerable customization is supported, the available services may not suit all predictive
analytics requirements.

What are your options when choosing amongst the cognitive services?
In Azure, there are dozens of Cognitive Services available. The current listing of these is available in a directory
categorized by the functional area they support:
Vision
Speech
Decision
Search
Language

Key Selection Criteria


To narrow the choices, start by answering these questions:
What type of data are you dealing with? Narrow your options based on the type of input data you are
working with. For example, if your input is text, select from the services that have an input type of text.
Do you have the data to train a model? If yes, consider the custom services that enable you to train their
underlying models with data that you provide, for improved accuracy and performance.
Capability matrix
The following tables summarize the key differences in capabilities.
Uses prebuilt models
C A PA B IL IT Y IN P UT T Y P E K EY B EN EF IT

Text Analytics API Text Evaluate sentiment and topics to


understand what users want.

Entity Linking API Text Power your app's data links with named
entity recognition and disambiguation.

Language Understanding Intelligent Text Teach your apps to understand


Service (LUIS) commands from your users.

QnA Maker Service Text Distill FAQ formatted information into


conversational, easy-to-navigate
answers.

Linguistic Analysis API Text Simplify complex language concepts


and parse text.

Knowledge Exploration Service Text Enable interactive search experiences


over structured data via natural
language inputs.

Web Language Model API Text Use predictive language models trained
on web-scale data.

Academic Knowledge API Text Tap into the wealth of academic content
in the Microsoft Academic Graph
populated by Bing.

Bing Autosuggest API Text Give your app intelligent autosuggest


options for searches.

Bing Spell Check API Text Detect and correct spelling mistakes in
your app.

Translator Text API Text Machine translation.

Recommendations API Text Predict and recommend items your


customers want.

Bing Entity Search API Text (web search query) Identify and augment entity
information from the web.

Bing Image Search API Text (web search query) Search for images.

Bing News Search API Text (web search query) Search for news.

Bing Video Search API Text (web search query) Search for videos.
C A PA B IL IT Y IN P UT T Y P E K EY B EN EF IT

Bing Web Search API Text (web search query) Get enhanced search details from
billions of web documents.

Bing Speech API Text or Speech Convert speech to text and back again.

Speaker Recognition API Speech Use speech to identify and authenticate


individual speakers.

Translator Speech API Speech Perform real-time speech translation.

Computer Vision API Images (or frames from video) Distill actionable information from
images, automatically create description
of photos, derive tags, recognize
celebrities, extract text, and create
accurate thumbnails.

Content Moderator Text, Images or Video Automated image, text, and video
moderation.

Emotion API Images (photos with human subjects) Identify the range emotions of human
subjects.

Face API Images (photos with human subjects) Detect, identify, analyze, organize, and
tag faces in photos.

Video Indexer Video Video insights such as sentiment,


transcript speech, translate speech,
recognize faces and emotions, and
extract keywords.

Trained with custom data you provide


C A PA B IL IT Y IN P UT T Y P E K EY B EN EF IT

Custom Vision Service Images (or frames from video) Customize your own computer vision
models.

Custom Speech Service Speech Overcome speech recognition barriers


like speaking style, background noise,
and vocabulary.

Custom Decision Service Web content (for example, RSS feed) Use machine learning to automatically
select the appropriate content for your
home page

Bing Custom Search API Text (web search query) Commercial-grade search tool.
Compare the machine learning products and
technologies from Microsoft
11/2/2020 • 8 minutes to read • Edit Online

Learn about the machine learning products and technologies from Microsoft. Compare options to help you
choose how to most effectively build, deploy, and manage your machine learning solutions.

Cloud-based machine learning products


The following options are available for machine learning in the Azure cloud.

C LO UD O P T IO N S W H AT IT IS W H AT Y O U C A N DO W IT H IT

Azure Machine Learning Managed platform for machine learning Use a pretrained model. Or, train,
deploy, and manage models on Azure
using Python and CLI

Azure Cognitive Services Pre-built AI capabilities implemented Build intelligent applications quickly
through REST APIs and SDKs using standard programming
languages. Doesn't require machine
learning and data science expertise

Azure SQL Managed Instance Machine In-database machine learning for SQL Train and deploy models inside Azure
Learning Services SQL Managed Instance

Machine learning in Azure Synapse Analytics service with machine learning Train and deploy models inside Azure
Analytics SQL Managed Instance

Machine learning and AI with ONNX in Machine learning in SQL on IoT Train and deploy models inside Azure
Azure SQL Edge SQL Edge

Azure Databricks Apache Spark-based analytics platform Build and deploy models and data
workflows using integrations with
open-source machine learning libraries
and the MLFlow platform.

On-premises machine learning products


The following options are available for machine learning on-premises. On-premises servers can also run in a
virtual machine in the cloud.

O N - P REM ISES O P T IO N S W H AT IT IS W H AT Y O U C A N DO W IT H IT

SQL Server Machine Learning Services In-database machine learning for SQL Train and deploy models inside SQL
Server

Machine Learning Services on SQL Machine learning in Big Data Clusters Train and deploy models on SQL Server
Server Big Data Clusters Big Data Clusters

Development platforms and tools


The following development platforms and tools are available for machine learning.

P L AT F O RM S/ TO O L S W H AT IT IS W H AT Y O U C A N DO W IT H IT

Azure Data Science Virtual Machine Virtual machine with pre-installed data Develop machine learning solutions in a
science tools pre-configured environment

ML.NET Open-source, cross-platform machine Develop machine learning solutions for


learning SDK .NET applications

Windows ML Windows 10 machine learning platform Evaluate trained models on a Windows


10 device

MMLSpark Open-source, distributed, machine Create and deploy scalable machine


learning and microservices framework learning applications for Scala and
for Apache Spark Python.

Machine Learning extension for Azure Open-source and cross-platform Manage packages, import machine
Data Studio machine learning extension for Azure learning models, make predictions, and
Data Studio create notebooks to run experiments
for your SQL databases

Azure Machine Learning


Azure Machine Learning is a fully managed cloud service used to train, deploy, and manage machine learning
models at scale. It fully supports open-source technologies, so you can use tens of thousands of open-source
Python packages such as TensorFlow, PyTorch, and scikit-learn. Rich tools are also available, such as Compute
instances, Jupyter notebooks, or the Azure Machine Learning for Visual Studio Code extension, a free extension
that allows you to manage your resources, model training workflows and deployments in Visual Studio Code.
Azure Machine Learning includes features that automate model generation and tuning with ease, efficiency, and
accuracy.
Use Python SDK, Jupyter notebooks, R, and the CLI for machine learning at cloud scale. For a low-code or no-code
option, use Azure Machine Learning's interactive, designer in the studio to easily and quickly build, test, and
deploy models using pre-built machine learning algorithms.
Try the free or paid version of Azure Machine Learning.

Type Cloud-based machine learning solution

Suppor ted languages Python, R

Machine learning phases Model training


Deployment
MLOps/Management

Key benefits Code first (SDK) and studio & drag-and-drop designer web
interface authoring options.

Central management of scripts and run history, making it


easy to compare model versions.

Easy deployment and management of models to the cloud or


edge devices.
Considerations Requires some familiarity with the model management model.

Azure Cognitive Services


Azure Cognitive Services is a set of pre-built APIs that enable you to build apps that use natural methods of
communication. The term pre-built suggests that you do not need to bring datasets or data science expertise to
train models to use in your applications. That's all done for you and packaged as APIs and SDKs that allow your
apps to see, hear, speak, understand, and interpret user needs with just a few lines of code. You can easily add
intelligent features to your apps, such as:
Vision - Object detection, face recognition, OCR, etc. See Computer Vision, Face, Form Recognizer.
Speech - Speech-to-text, text-to-speech, speaker recognition, etc. See Speech Service.
Language - Translation, Sentiment analysis, key phrase extraction, language understanding, etc. See Translator,
Text Analytics, Language Understanding, QnA Maker
Decision - Anomaly detection, content moderation, reinforcement learning. See Anomaly Detector, Content
Moderator, Personalizer.
Use Cognitive Services to develop apps across devices and platforms. The APIs keep improving, and are easy to
set up.

Type APIs for building intelligent applications

Suppor ted languages Various options depending on the service. Standard ones are
C#, Java, JavaScript, and Python.

Machine learning phases Deployment

Key benefits Build intelligent applications using pre-trained models


available through REST API and SDK.
Variety of models for natural communication methods with
vision, speech, language, and decision.
No machine learning or data science expertise required.

SQL machine learning


SQL machine learning adds statistical analysis, data visualization, and predictive analytics in Python and R for
relational data, both on-premises and in the cloud. Current platforms and tools include:
SQL Server Machine Learning Services
Machine Learning Services on SQL Server Big Data Clusters
Azure SQL Managed Instance Machine Learning Services
Machine learning in Azure Synapse Analytics
Machine learning and AI with ONNX in Azure SQL Edge
Machine Learning extension for Azure Data Studio
Use SQL machine learning when you need built-in AI and predictive analytics on relational data in SQL.

Type On-premises predictive analytics for relational data


Suppor ted languages Python, R, SQL

Machine learning phases Data preparation


Model training
Deployment

Key benefits Encapsulate predictive logic in a database function, making it


easy to include in data-tier logic.

Considerations Assumes a SQL database as the data tier for your application.

Azure Data Science Virtual Machine


The Azure Data Science Virtual Machine is a customized virtual machine environment on the Microsoft Azure
cloud. It is available in versions for both Windows and Linux Ubuntu. The environment is built specifically for
doing data science and developing ML solutions. It has many popular data science, ML frameworks, and other
tools pre-installed and pre-configured to jump-start building intelligent applications for advanced analytics.
Use the Data Science VM when you need to run or host your jobs on a single node. Or if you need to remotely
scale up your processing on a single machine.

Type Customized virtual machine environment for data science

Key benefits Reduced time to install, manage, and troubleshoot data


science tools and frameworks.

The latest versions of all commonly used tools and


frameworks are included.

Virtual machine options include highly scalable images with


GPU capabilities for intensive data modeling.

Considerations The virtual machine cannot be accessed when offline.

Running a virtual machine incurs Azure charges, so you must


be careful to have it running only when required.

Azure Databricks
Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services
platform. Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive
workspace that enables collaboration between data scientists, data engineers, and business analysts. Use Python,
R, Scala, and SQL code in web-based notebooks to query, visualize, and model data.
Use Databricks when you want to collaborate on building machine learning solutions on Apache Spark.

Type Apache Spark-based analytics platform

Suppor ted languages Python, R, Scala, SQL


Machine learning phases Data preparation
Data preprocessing
Model training
Model tuning
Model inference
Management
Deployment

ML.NET
ML.NET is an open-source, and cross-platform machine learning framework. With ML.NET, you can build custom
machine learning solutions and integrate them into your .NET applications. ML.NET offers varying levels of
interoperability with popular frameworks like TensorFlow and ONNX for training and scoring machine learning
and deep learning models. For resource-intensive tasks like training image classification models, you can take
advantage of Azure to train your models in the cloud.
Use ML.NET when you want to integrate machine learning solutions into your .NET applications. Choose between
the API for a code-first experience and Model Builder or the CLI for a low-code experience.

Type Open-source cross-platform framework for developing


custom machine learning applications with .NET

Languages suppor ted C#, F#

Machine learning phases Data preparation


Training
Deployment

Key benefits Data science & ML experience not required


Use familiar tools (Visual Studio, VS Code) and languages
Deploy where .NET runs
Extensible
Scalable
Local-first experience

Windows ML
Windows ML inference engine allows you to use trained machine learning models in your applications, evaluating
trained models locally on Windows 10 devices.
Use Windows ML when you want to use trained machine learning models within your Windows applications.

Type Inference engine for trained models in Windows devices

Languages suppor ted C#/C++, JavaScript

MMLSpark
Microsoft ML for Apache Spark (MMLSpark) is an open-source library that expands the distributed computing
framework Apache Spark. MMLSpark adds many deep learning and data science tools to the Spark ecosystem,
including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK),
LightGBM, LIME (Model Interpretability), and OpenCV. You can use these tools to create powerful predictive
models on any Spark cluster, such as Azure Databricks or Cosmic Spark.
MMLSpark also brings new networking capabilities to the Spark ecosystem. With the HTTP on Spark project, users
can embed any web service into their SparkML models. Additionally, MMLSpark provides easy-to-use tools for
orchestrating Azure Cognitive Services at scale. For production-grade deployment, the Spark Serving project
enables high throughput, submillisecond latency web services, backed by your Spark cluster.

Type Open-source, distributed machine learning and microservices


framework for Apache Spark

Languages suppor ted Scala 2.11, Java, Python 3.5+, R (beta)

Machine learning phases Data preparation


Model training
Deployment

Key benefits Scalability


Streaming + Serving compatible
Fault-tolerance

Considerations Requires Apache Spark

Next steps
To learn about all the Artificial Intelligence (AI) development products available from Microsoft, see Microsoft
AI platform
For training in developing AI and Machine Learning solutions with Microsoft, see Microsoft Learn
Machine learning at scale
11/2/2020 • 3 minutes to read • Edit Online

Machine learning (ML) is a technique used to train predictive models based on mathematical algorithms. Machine
learning analyzes the relationships between data fields to predict unknown values.
Creating and deploying a machine learning model is an iterative process:
Data scientists explore the source data to determine relationships between features and predicted labels.
The data scientists train and validate models based on appropriate algorithms to find the optimal model for
prediction.
The optimal model is deployed into production, as a web service or some other encapsulated function.
As new data is collected, the model is periodically retrained to improve its effectiveness.
Machine learning at scale addresses two different scalability concerns. The first is training a model against large
data sets that require the scale-out capabilities of a cluster to train. The second centers on operationalizing the
learned model so it can scale to meet the demands of the applications that consume it. Typically this is
accomplished by deploying the predictive capabilities as a web service that can then be scaled out.
Machine learning at scale has the benefit that it can produce powerful, predictive capabilities because better
models typically result from more data. Once a model is trained, it can be deployed as a stateless, highly
performant scale-out web service.

Model preparation and training


During the model preparation and training phase, data scientists explore the data interactively using languages like
Python and R to:
Extract samples from high volume data stores.
Find and treat outliers, duplicates, and missing values to clean the data.
Determine correlations and relationships in the data through statistical analysis and visualization.
Generate new calculated features that improve the predictiveness of statistical relationships.
Train ML models based on predictive algorithms.
Validate trained models using data that was withheld during training.
To support this interactive analysis and modeling phase, the data platform must enable data scientists to explore
data using a variety of tools. Additionally, the training of a complex machine learning model can require a lot of
intensive processing of high volumes of data, so sufficient resources for scaling out the model training is essential.

Model deployment and consumption


When a model is ready to be deployed, it can be encapsulated as a web service and deployed in the cloud, to an
edge device, or within an enterprise ML execution environment. This deployment process is referred to as
operationalization.

Challenges
Machine learning at scale produces a few challenges:
You typically need a lot of data to train a model, especially for deep learning models.
You need to prepare these big data sets before you can even begin training your model.
The model training phase must access the big data stores. It's common to perform the model training using the
same big data cluster, such as Spark, that is used for data preparation.
For scenarios such as deep learning, not only will you need a cluster that can provide you scale-out on CPUs,
but your cluster will need to consist of GPU-enabled nodes.

Machine learning at scale in Azure


Before deciding which ML services to use in training and operationalization, consider whether you need to train a
model at all, or if a prebuilt model can meet your requirements. In many cases, using a prebuilt model is just a
matter of calling a web service or using an ML library to load an existing model. Some options include:
Use the web services provided by Azure Cognitive Services.
Use the pretrained neural network models provided by the Cognitive Toolkit.
Embed the serialized models provided by Core ML for an iOS app.
If a prebuilt model does not fit your data or your scenario, options in Azure include Azure Machine Learning,
HDInsight with Spark MLlib and MMLSpark, Azure Databricks, Cognitive Toolkit, and SQL Machine Learning
Services. If you decide to use a custom model, you must design a pipeline that includes model training and
operationalization.
For a list of technology choices for ML in Azure, see:
Choosing a cognitive services technology
Choosing a machine learning technology
Choosing a natural language processing technology

Next steps
The following reference architectures show machine learning scenarios in Azure:
Batch scoring on Azure for deep learning models
Real-time scoring of Python Scikit-Learn and Deep Learning Models on Azure
Choosing a natural language processing technology
in Azure
11/2/2020 • 3 minutes to read • Edit Online

Natural language processing (NLP) is used for tasks such as sentiment analysis, topic detection, language
detection, key phrase extraction, and document categorization.

NLP can be use to classify documents, such as labeling documents as sensitive or spam. The output of NLP can be
used for subsequent processing or search. Another use for NLP is to summarize text by identifying the entities
present in the document. These entities can also be used to tag documents with keywords, which enables search
and retrieval based on content. Entities might be combined into topics, with summaries that describe the
important topics present in each document. The detected topics may be used to categorize the documents for
navigation, or to enumerate related documents given a selected topic. Another use for NLP is to score text for
sentiment, to assess the positive or negative tone of a document. These approaches use many techniques from
natural language processing, such as:
Tokenizer . Splitting the text into words or phrases.
Stemming and lemmatization . Normalizing words so that different forms map to the canonical word with
the same meaning. For example, "running" and "ran" map to "run."
Entity extraction . Identifying subjects in the text.
Par t of speech detection . Identifying text as a verb, noun, participle, verb phrase, and so on.
Sentence boundar y detection . Detecting complete sentences within paragraphs of text.
When using NLP to extract information and insight from free-form text, the starting point is typically the raw
documents stored in object storage such as Azure Storage or Azure Data Lake Store.

Challenges
Processing a collection of free-form text documents is typically computationally resource intensive, as well as
being time intensive.
Without a standardized document format, it can be difficult to achieve consistently accurate results using free-
form text processing to extract specific facts from a document. For example, think of a text representation of an
invoice—it can be difficult to build a process that correctly extracts the invoice number and invoice date for
invoices across any number of vendors.
What are your options when choosing an NLP service?
In Azure, the following services provide natural language processing (NLP) capabilities:
Azure HDInsight with Spark and Spark MLlib
Azure Databricks
Microsoft Cognitive Services

Key selection criteria


To narrow the choices, start by answering these questions:
Do you want to use prebuilt models? If yes, consider using the APIs offered by Microsoft Cognitive Services.
Do you need to train custom models against a large corpus of text data? If yes, consider using Azure
HDInsight with Spark MLlib and Spark NLP.
Do you need low-level NLP capabilities like tokenization, stemming, lemmatization, and term
frequency/inverse document frequency (TF/IDF)? If yes, consider using Azure HDInsight with Spark MLlib
and Spark NLP.
Do you need simple, high-level NLP capabilities like entity and intent identification, topic detection, spell
check, or sentiment analysis? If yes, consider using the APIs offered by Microsoft Cognitive Services.

Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
C A PA B IL IT Y A Z URE H DIN SIGH T M IC RO SO F T C O GN IT IVE SERVIC ES

Provides pretrained models as a service No Yes

REST API Yes Yes

Programmability Python, Scala, Java C#, Java, Node.js, Python, PHP, Ruby

Support processing of big data sets and Yes No


large documents

Low-level natural language processing capabilities


C A PA B IL IT Y A Z URE H DIN SIGH T M IC RO SO F T C O GN IT IVE SERVIC ES

Tokenizer Yes (Spark NLP) Yes (Linguistic Analysis API)

Stemmer Yes (Spark NLP) No

Lemmatizer Yes (Spark NLP) No

Part of speech tagging Yes (Spark NLP) Yes (Linguistic Analysis API)

Term frequency/inverse-document Yes (Spark MLlib) No


frequency (TF/IDF)
C A PA B IL IT Y A Z URE H DIN SIGH T M IC RO SO F T C O GN IT IVE SERVIC ES

String similarity—edit distance Yes (Spark MLlib) No


calculation

N-gram calculation Yes (Spark MLlib) No

Stop word removal Yes (Spark MLlib) No

High-level natural language processing capabilities


C A PA B IL IT Y A Z URE H DIN SIGH T M IC RO SO F T C O GN IT IVE SERVIC ES

Entity/intent identification and No Yes (Language Understanding


extraction Intelligent Service (LUIS) API)

Topic detection Yes (Spark NLP) Yes (Text Analytics API)

Spell checking Yes (Spark NLP) Yes (Bing Spell Check API)

Sentiment analysis Yes (Spark NLP) Yes (Text Analytics API)

Language detection No Yes (Text Analytics API)

Supports multiple languages besides No Yes (varies by API)


English

See also
Natural language processing
R developer's guide to Azure
11/13/2020 • 8 minutes to read • Edit Online

Many data scientists dealing with ever-increasing volumes of data are looking for ways to
harness the power of cloud computing for their analyses. This article provides an
overview of the various ways that data scientists can use their existing skills with the R
programming language in Azure.
Microsoft has fully embraced the R programming language as a first-class tool for data
scientists. By providing many different options for R developers to run their code in
Azure, the company is enabling data scientists to extend their data science workloads into the cloud when tackling
large-scale projects.
Let's examine the various options and the most compelling scenarios for each one.

Azure services with R language support


This article covers the following Azure services that support the R language:

SERVIC E DESC RIP T IO N

Azure Machine Learning Server enterprise software for data science, providing R and Python
interpreters

Data Science Virtual Machine a customized VM to use as a data science workstation or as a


custom compute target

ML Services on HDInsight cluster-based system for running R analyses on large datasets


across many nodes

Azure Databricks collaborative Spark environment that supports R and other


languages

Azure Machine Learning cloud service that you use to train, deploy, automate, and
manage machine learning models

Azure Machine Learning Studio (classic) run custom R scripts in Azure's machine learning experiments

Azure Batch offers a variety options for economically running R code across
many nodes in a cluster

Azure Notebooks a no-cost cloud-based version of Jupyter notebooks

Azure SQL Database run R scripts inside of the SQL Server database engine

Data Science Virtual Machine


The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft’s Azure cloud platform built
specifically for doing data science. It has many popular data science tools, including:
Microsoft R Open
Microsoft Machine Learning Server
RStudio Desktop
RStudio Server
The DSVM can be provisioned with either Windows or Linux as the operating system. You can use the DSVM in two
different ways: as an interactive workstation or as a compute platform for a custom cluster.
As a workstation
If you want to get started with R in the cloud quickly and easily, this is your best bet. The environment will be
familiar to anyone who has worked with R on a local workstation. However, instead of using local resources, the R
environment runs on a VM in the cloud. If your data is already stored in Azure, this has the added benefit of
allowing your R scripts to run "closer to the data." Instead of transferring the data across the Internet, the data can
be accessed over Azure's internal network, which provides much faster access times.
The DSVM can be particularly useful to small teams of R developers. Instead of investing in powerful workstations
for each developer and requiring team members to synchronize on which versions of the various software
packages they will use, each developer can spin up an instance of the DSVM whenever needed.
As a compute platform
In addition to being used as a workstation, the DSVM is also used as an elastically scalable compute platform for R
projects. Using the AzureDSVM R package, you can programmatically control the creation and deletion of DSVM
instances. You can form the instances into a cluster and deploy a distributed analysis to be performed in the cloud.
This entire process can be controlled by R code running on your local workstation.
To learn more about the DSVM, see Introduction to Azure Data Science Virtual Machine for Linux and Windows.

ML Services on HDInsight
Microsoft ML Services provide data scientists, statisticians, and R programmers with on-demand access to scalable,
distributed methods of analytics on HDInsight. This solution provides the latest capabilities for R-based analytics on
datasets of virtually any size, loaded to either Azure Blob or Data Lake storage.
This is an enterprise-grade solution that allows you to scale your R code across a cluster. By using functions in
Microsoft's RevoScaleR package, your R scripts on HDInsight can run data processing functions in parallel across
many nodes in a cluster. This allows R to crunch data on a much larger scale than is possible with single-threaded R
running on a workstation.
This ability to scale makes ML Services on HDInsight a great option for R developers with massive data sets. It
provides a flexible and scalable platform for running your R scripts in the cloud.
For a walk-through on creating an ML Services cluster, see Get started with ML Services on Azure HDInsight.

Azure Databricks
Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services
platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click
setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data
engineers, and business analysts.
The collaboration in Databricks is enabled by the platform's notebook system. Users can create, share, and edit
notebooks with other users of the systems. These notebooks allow users to write code that executes against Spark
clusters managed in the Databricks environment. These notebooks fully support R and give users access to Spark
through both the SparkR and sparklyr packages.
Since Databricks is built on Spark and has a strong focus on collaboration, the platform is often used by teams of
data scientists that work together on complex analyses of large data sets. Because the notebooks in Databricks
support other languages in addition to R, it is especially useful for teams where analysts use different languages
for their primary work.
The article What is Azure Databricks? can provide more details about the platform and help you get started.

Azure Machine Learning


Azure Machine Learning can be used for any kind of machine learning, from classical machine learning to deep
learning, supervised and unsupervised learning. Whether you prefer to write Python or R code or zero-code/low-
code options such as the designer, you can build, train and track highly accurate machine learning and deep-
learning models in an Azure Machine Learning Workspace.
Start training on your local machine and then scale out to the cloud. Train your first model in R with Azure Machine
Learning today.

Azure Machine Learning Studio (classic)


Azure Machine Learning Studio (classic) is a collaborative, drag-and-drop tool you can use to build, test, and deploy
predictive analytics solutions in the cloud. It enables emerging data scientists to create and deploy machine
learning models without the need to write much code.
Azure Machine Learning Studio (classic) supports both R and Python.
Customers currently using or evaluating Azure Machine Learning Studio (classic) are encouraged to try the
designer in Azure Machine Learning, which provides drag-n-drop ML modules plus scalability, version control, and
enterprise security.

Azure Batch
For large-scale R jobs, you can use Azure Batch. This service provides cloud-scale job scheduling and compute
management so you can scale your R workload across tens, hundreds, or thousands of virtual machines. Since it is
a generalized computing platform, there a few options for running R jobs on Azure Batch.
One option is to use Microsoft's doAzureParallel package. This R package is a parallel backend for the foreach
package. It allows each iteration of the foreach loop to run in parallel on a node within the Azure Batch cluster. For
an introduction to the package, see the blog post doAzureParallel: Take advantage of Azure’s flexible compute
directly from your R session.
Another option for running an R script in Azure Batch is to bundle your code with "RScript.exe" as a Batch App in
the Azure portal. For a detailed walkthrough, see R Workloads on Azure Batch.
A third option is to use the Azure Distributed Data Engineering Toolkit (AZTK), which allows you to provision on-
demand Spark clusters using Docker containers in Azure Batch. This provides an economical way to run Spark jobs
in Azure. By using SparklyR with AZTK, your R scripts can be scaled out in the cloud easily and economically.

Azure Notebooks
Azure Notebooks is a low-cost, low-friction method for R developers who prefer working with notebooks to bring
their code to Azure. It is a free service for anyone to develop and run code in their browser using Jupyter, which is
an open-source project that enables combing markdown prose, executable code, and graphics onto a single canvas.
The free service tier of Azure Notebooks is a viable option for small-scale projects, as it limits each notebook's
process to 4 GB of memory and 1 GB data sets. If you need compute and data power beyond these limitations,
however, you can run notebooks in a Data Science Virtual Machine instance. For more information, see Manage
and configure Azure Notebooks projects - Compute tier.

Azure SQL Database


Azure SQL Database
Azure SQL Database is Microsoft's intelligent, fully managed relational cloud database service. It allows you to use
the full power of SQL Server without any hassle of setting up the infrastructure. This includes Machine Learning
Services in SQL Server, which is one of the more recent additions to SQL.
This feature offers an embedded, predictive analytics and data science engine that can execute R code within a SQL
Server database as stored procedures, as T-SQL scripts containing R statements, or as R code containing T-SQL.
Instead of extracting data from the database and loading it into the R environment, you load your R code directly
into the database and let it run right alongside the data.
While Machine Learning Services has been part of on-premises SQL Server since 2016, it is relatively new to Azure
SQL Database. It is currently in limited preview but will continue to evolve.
Next steps
Running your R code on Azure with mrsdeploy
Machine Learning Server in the Cloud
Additional Resources for Machine Learning Server and Microsoft R
R on Azure: an overview of packages, tools, and case studies for using R with Azure

The R logo is © 2016 The R Foundation and is used under the terms of the Creative Commons Attribution-ShareAlike 4.0 International license.
AI at the Edge with Azure Stack Hub
11/13/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

With the Azure AI tools and cloud platform, the next generation of AI-enabled hybrid applications can run where
your data lives. With Azure Stack Hub, bring a trained AI model to the edge and integrate it with your applications
for low-latency intelligence, with no tool or process changes for local applications.

Architecture

Download an SVG of this architecture.

Data Flow
1. Data scientists train a model using Azure Machine Learning workbench and an HDInsight cluster. The model is
containerized and put into an Azure Container Registry.
2. The model is deployed to a Kubernetes cluster on Azure Stack Hub.
3. End users provide data that's scored against the model.
4. Insights and anomalies from scoring are placed into a queue.
5. A function sends compliant data and anomalies to Azure Storage.
6. Globally relevant and compliant insights are available in the global app.
7. Data from edge scoring is used to improve the model.

Components
Azure Machine Learning: Build, deploy, and manage predictive analytics solutions
HDInsight: Provision cloud Hadoop, Spark, R Server, HBase, and Storm clusters
Container Registry: Store and manage container images across all types of Azure deployments
Azure Kubernetes Service (AKS): Simplify the deployment, management, and operations of Kubernetes
Storage: Durable, highly available, and massively scalable cloud storage
Azure Stack Hub: Build and run innovative hybrid applications across cloud boundaries

Next steps
Azure Machine Learning documentation
HDInsight documentation
Container Registry documentation
Azure Kubernetes Service (AKS) documentation
Storage documentation
Azure Stack Hub documentation
AI enrichment with Azure Cognitive Search
11/2/2020 • 4 minutes to read • Edit Online

Large, unstructured datasets like the JFK Files, which contain over 34,000 pages of documents about the CIA
investigation of the 1963 JFK assassination, include typewritten and handwritten notes, photos and diagrams, and
other unstructured data that standard search solutions can't parse.
AI enrichment in Azure Cognitive Search can extract and enhance searchable, indexable text from images, blobs,
and other unstructured data sources like the JFK Files by using pre-trained machine learning skillsets from the
Cognitive Services Computer Vision and Text Analytics APIs. You can also create and attach custom skills to add
special processing for domain-specific data like CIA Cryptonyms. Azure Cognitive Search can then index and search
the context.
Image processing skills like Optical Character Recognition (OCR), Read, and image analysis include object
and face detection, tag and caption generation, and celebrity and landmark identification. These skills create
text representations of image content, which are searchable using the query capabilities of Azure Cognitive
Search.
Natural language processing skills like entity recognition, language detection, key phrase extraction, and text
recognition map unstructured text to searchable and filterable fields in an index.
This example solution uses Azure Cognitive Search AI enrichment to extract meaning from the original complex,
unstructured JFK Files dataset. You can work through the project, watch the process in action in an online video, or
explore the JFK Files with an online demo.

Potential use cases


Increase the value and utility of unstructured text and image content in search and data science apps.
Use custom skills to integrate open-source, third-party, or first-party code into indexing pipelines.
Make scanned BMP or JPG documents full-text searchable.
Produce better outcomes than standard PDF text extraction for PDFs with combined image and text.
Create new information from inherently meaningful raw content or context that's hidden in larger unstructured
or semi-structured documents.

Architecture
This diagram illustrates the process of passing unstructured data through the Cognitive Search skills pipeline to
produce structured, indexable data.
1. Blob storage provides unstructured document and image data to Cognitive Search.
2. Cognitive Search applies pre-built cognitive skillsets to the data, including OCR, text and handwriting
recognition, image analysis, entity recognition, and full-text search.
3. The Cognitive Search extensibility mechanism uses an Azure Function to apply the CIA Cryptonyms custom skill
to the data.
4. The pre-built and custom skillsets deliver structured knowledge that Azure Cognitive Search can index.

Components
Azure Cognitive Search works with other Azure components to provide this solution.
Azure Blob Storage
Azure Blob Storage is REST-based object storage for data that you can access from anywhere in the world via
HTTPS. You can use Blob storage to expose data publicly to the world, or to store application data privately. Blob
storage is ideal for large amounts of unstructured data like text or graphics.
Azure Cognitive Search
Cognitive Search indexes the content and powers the user experience. You use Cognitive Search capabilities to
apply pre-built cognitive skills to the content, and use the extensibility mechanism to add custom skills.
The Computer Vision API uses text recognition APIs to extract and recognize text information from images.
Read uses the latest recognition models, and is optimized for large, text-heavy documents and noisy images.
OCR isn't optimized for large documents, but supports more languages. The current example uses OCR to
produce data in the hOCR format.
The Text Analytics API extracts text information from unstructured documents by using capabilities like
Named Entity Recognition (NER), key phrase extraction, and full-text search.
custom skills extend Cognitive Search to apply specific enrichment transformations to content. The current
example creates a custom skill to apply CIA Cryptonyms, which decode uppercased code names in CIA
documents. For example, the CIA assigned the cryptonym GPFLOOR to Lee Harvey Oswald, so the custom CIA
Cryptonym skill links any JFK files containing that cryptonym with Oswald.
Azure Functions
Azure Functions is a serverless compute service that lets you run small pieces of event-triggered code without
having to explicitly provision or manage infrastructure. This example uses an Azure Function method to apply the
CIA Cryptonyms list to the JFK Files as a custom skill.
Azure App Service
The example solution also builds a standalone web app in Azure App Service for testing, demonstrating, searching
the index, and exploring connections in the enriched and indexed documents.

Issues and considerations


The code project and demo showcase a particular Cognitive Search use case. This example isn't intended to be a
framework or scalable architecture for all scenarios, but to provide a general guideline and example.
OCR results vary greatly depending on scan and image quality. Read uses the latest recognition models, but has
less language support than OCR.
Some scanned and native PDF formats may not parse correctly in Cognitive Search.
The JFK Files sample project and demo create a public website and publicly readable storage container for
extracted images, so don't use this solution with non-public data.

Next steps
The JFK Files lab
Azure Cognitive Search
Get started with AI enrichment
Computer Vision
Text Analytics
Recognize printed and handwritten text
How to use Named Entity Recognition in Text Analytics
Azure Blob storage
Azure Functions
AI for Earth
11/2/2020 • 7 minutes to read • Edit Online

The earth currently faces urgent and accelerating environmental challenges. Climate change, biodiversity and
habitat loss, pollution, overpopulation, and food and fresh water shortages need quick and comprehensive
mitigation, but can seem so overwhelming that philosopher Timothy Morton calls them hyperobjects, entities so
vast and distributed that they're difficult to fully define or comprehend.
Technology may have contributed to many of these issues, but we need technology to be able to conceptualize or
solve them. Artificial intelligence (AI) is one tool that can help us both understand and potentially mitigate global
environmental issues.
AI approaches to environmental challenges require substantial amounts of data, computing power, specialized
tools, and expertise. The Microsoft AI for Earth initiative provides people and organizations with AI and cloud tools
like open data sets, cloud compute grants, open-source APIs, and education to help them address global
environmental challenges.
This article showcases the public Microsoft AI for Earth APIs, and how they can work with Azure services and
resources to help provide conservation solutions.

Use cases
AI for Earth projects are helping the environment today.
Basemap
SilviaTerra used the Land Cover Mapping API to help develop a high-resolution inventory for every forested acre in
the continental US. Basemap plots tree coverage, tree species and size, tree value, carbon sequestration, and habitat
suitability, delivering insight to forest stakeholders from governments and conservation groups to large and small
landowners.
Through an AI for Earth grant, Microsoft AI for Earth worked with SilviaTerra to combine their forestry expertise
with Azure Open Datasets, Azure Storage, and Azure Machine Learning, and scale their coverage to hundreds of
millions of acres. With better inventories, it's easier for conservationists to target their efforts, and frequent updates
make short-term conservation projects and commitments feasible.
Wildbook
Wild Me used the Species Classification API, computer vision, and deep learning algorithms to create the open-
source Wildbook platform. Fighting species extinction from poaching, habitat loss, and climate change requires
huge amounts of data on population counts, location, birth rates, and migration patterns. Gathering and annotating
this data manually is time- and cost-prohibitive, making citizen engagement critical to data collection. On
Wildbooks like the Wildbook for Whale Sharks, the public can report their observations, and adopt and follow
individual animals. Scientists can use the aggregated data to help inform conservation decisions.
Microsoft hosts Wildbook on Azure and makes Wild Me's open-source algorithms available as APIs. AI for Earth
worked with Wild Me to scale Wildbook to the Azure cloud, allowing them to handle an ever-increasing number of
supported species.
For more information about Wildbook, see the following links:
Video: Fighting extinction with Microsoft AI and citizen science
Wildbook open-source code projects
Architecture

1. Species observations come from biologists and citizen scientists.


2. Observational data, datasets, AI data, and third-party data feed into data management servers in the cloud.
3. From the data servers, the latest data and AI models update the image analysis server, which applies constantly
retrained cognitive skills.
4. All communication is bi-directional, feeding back to data collection and user interface enhancements.

Components
The AI for Earth APIs work with Azure components to provide solutions.
Land Cover Mapping API
Environmental scientists use satellite and aerial imagery to understand land use patterns. Strategic conservation
planning depends on understanding land use, in particular the impacts of climate change and human population
expansion on natural resources. However, distilling imagery into actionable data in the form of land cover maps has
required extensive manual annotation. Land Cover Mapping uses computer vision and machine learning to classify
imagery into natural or human-made terrain types, providing high-resolution land cover information for precision
conservation planning.
The public Land Cover Mapping API can take a supplied TIFF or JPEG 1 meter-resolution aerial or satellite image, or
any United States latitude and longitude data, and return an image file showing land cover classifications. The land
cover classification image shows areas classified into water, forest, field, or built coverage. You can analyze and
make decisions based on the land cover predicted in the image. The API also allows you to train your model by
correcting the predictions and adding and training new classifications.
Land Cover training demo
Land Cover Mapping video
Land Cover Mapping source code
Species Classification API
The Species Classification API lets developers use deep learning models for recognizing plants and animals.
Wildlife conservation depends on accurate, up-to-date wildlife population estimates, but population surveys often
depend on humans to annotate millions of images. The public Species Classification API helps automate citizen-
scientist observations by identifying plants and animals in images from over 5000 species. The API has a single
endpoint that takes an image as input and returns a predicted species and the confidence of the prediction.
Species Classification API demo.
Species Classification source code
Azure Open Datasets
Azure Open Datasets are free, curated public datasets that help you train machine learning models and enrich
predictive solutions. The datasets cover public-domain weather, census, holidays, public safety, and location data.
You can share or request other public datasets through Azure Open Datasets.
Azure Open Datasets preprocesses data to save you time. At regular intervals, Open Datasets pulls data from the
sources, such as by an FTP connection to the National Oceanic and Atmospheric Administration (NOAA). After
parsing the data into a structured format, Open Datasets enriches it with features like ZIP Code or location of the
nearest weather station.
With an Azure account, you can access Open Datasets through the Azure portal or through APIs. Colocating with
Azure compute and other services makes access and manipulation easier. Open Datasets are also available through
the Azure Machine Learning UI and SDK. Open Datasets also provides Azure Notebooks and Azure Databricks
notebooks you can use to connect data to Azure Machine Learning and Azure Databricks. You can also access Open
Datasets through a Python SDK.
You don't need an Azure account to get Open Datasets. You can access them from any Python environment, with or
without Spark.
Overview of Azure Open Datasets
Full catalog of Azure Open Datasets
Azure Notebooks
Azure Notebooks is a free service for anyone to develop and run code in their browser using Jupyter. Jupyter is an
open-source project that enables combining markdown prose, executable code, and graphics onto a single canvas
called a notebook. Notebooks are an excellent way for data scientists and AI engineers to share and collaborate in
the cloud.
Other components
These Azure components can contribute to an AI for Earth solution:
Azure Blob Storage stores raw or preprocessed data.
Cognitive Service APIs like Computer Vision extract and process structured data and text from unstructured
sources.
High-performing Linux virtual machines (VMs) scale out to process and analyze data as needed. Data Science
Virtual Machines come preconfigured with data science modeling and machine learning tools.
Azure Backup backs up VMs and protects critical data.
Container Registry stores containers for apps and data.
Application Insights monitors servers and services for performance and outages.
Azure DevOps monitors and manages operations during development.

Issues and considerations


Azure Open Datasets can consume hundreds of terabytes of resources, so it's best to perform large-scale
processing in the same Azure datacenter where you store the data. If you're using Open Datasets for
environmental science applications, consider applying for an AI for Earth grant to support your compute
requirements.
Azure Notebooks is currently in public preview. The preview version doesn't have a service level agreement, and
isn't recommended for production workloads. The preview version doesn't support some features, and may
have constrained capabilities.
Some US addresses might not work in the demo Land Cover Mapping Notebook. If the address you supply
doesn't work, try another address.
The Species Classification API version in the demo notebook might misidentify some species. The current
version of the API is 2.0.

Deployment
The hands-on demo Azure Notebooks showcase the AI for Earth Land Cover Mapping and Species
Classification APIs.
Prerequisites
Get keys for the Land Cover Mapping and Species Classification APIs by emailing [email protected].
Have a Microsoft or Azure DevOps organizational account.
To run the demo Notebooks:
1. Open the Land Cover Mapping or Species Classification project.
2. Select Sign in at upper right, and after signing in with your Microsoft or Azure account, select Clone .
3. On the pop-up screen, select I trust the contents of this project , and then select Clone again.
4. After you clone the project, select Run on Free Compute .
5. On the next screen, select the .ipynb file.
6. In the Notebook, paste the API Key you received into the appropriate placeholder.
7. Select each cell in turn, and then select Run to run the cell. You can ignore any pip warnings that appear when
installing packages.
8. Follow the prompts to enter different addresses for the Land Cover Mapping demo, or different species for the
Species Classification demo, and see the APIs in action.

Next steps
If you have an idea for an environmental solution, consider applying for a Microsoft AI for Earth grant. The
four focus areas for AI for Earth grants are agriculture, biodiversity, climate change, and water issues. Grants
are for Azure compute credits or for data labeling services, an important prerequisite for most AI projects.
Grantees can also get technical advice and support, online Azure training materials, and networking and
educational opportunities. For more information, see AI for Earth's grant process and details.
See the other initiatives in Microsoft AI for Good, providing technology, resources, and expertise to help
solve humanitarian issues and create a more sustainable and accessible world.
Auditing, risk, and compliance management
11/2/2020 • 2 minutes to read • Edit Online

In the ever-changing world of regulations, organizations face the challenge of staying on top of audits and
compliance. Mistakes in contracts and record-keeping can have serious financial ramifications. At the enterprise
level, teams of lawyers might not be enough to catch everything. Knowledge mining can provide helpful assistance
for organizations looking to stay above board.
For most organizations, the legal department faces the challenge of reviewing thousands of pages of
documentation. Developers could use knowledge mining to help attorneys quickly identify entities of importance
from discovery documents and flag important ideas across documents

Data Flow
There are three steps:
Ingest, Enrich and Exploration. First, the unstructured and structured data is ingested then enrichment of this data
with AI to extract information and find and finally explore the newly structured data via search, existing business
applications or analytics solutions.
1. The user can ingest different types of content like affidavits, meeting minutes, operating agreements, non-
disclosure agreements, privacy policies, terms of use, memorandums of understanding, licensing agreements,
letters of intent, power of attorney, deeds, discovery documentation,company bylaws, operating agreements,
bank statements, legal agreements, balance sheets, income statements, cash flow statements, company
disclosures, SEC documents, annual reports, transcripts from shareholder meetings
2. This content is enriched by using key phrase extraction, language detection, language translation,entity
extraction (organizations and people) and custom models to identify certain regulatory obligations and custom
models to identify specific legal terms and clauses
3. And finally, the user can index data in a searchable internal application or use the data for a searchable web
application for financial risks

Components
Key technologies used to implement tools for technical content review and research
Azure Cognitive Search
Microsoft Text Analytics API
Microsoft Translator Text API
Web API custom skill interface

Next Steps
Using the knowledge mining solution accelerator to build an initial knowledge mining prototype with Azure
Cognitive Search.
Building custom skills with Microsoft's Custom Web API
Machine teaching for autonomous systems
11/2/2020 • 7 minutes to read • Edit Online

Artificial intelligence (AI) and machine learning (ML) offer unique opportunities and challenges for automating
complex industrial systems. Machine teaching is a new paradigm for building ML systems that moves the focus
away from algorithms and onto successful model generation and deployment.
Machine teaching infuses subject matter expertise into automated AI system training with deep reinforcement
learning (DRL) and simulations. Abstracting away AI complexity to focus on subject matter expertise and real-world
conditions creates models that turn automated control systems into autonomous systems.
Autonomous systems:
Combine human domain knowledge with AI and ML through machine teaching.
Automate generation and management of DRL algorithms and models.
Integrate simulations for model optimization and scalability during training.
Deploy and scale for real-world use.
The Microsoft Autonomous Systems platform is an innovative framework for building, training, and deploying
models by using machine teaching and simulations. Use the Autonomous Systems platform to help automate
systems when:
Existing control systems are fragile when deployed.
ML logic doesn't adequately cover all scenarios.
Describing the desired system behavior requires subject matter experts who understand the problem domain.
Generating sufficient real-world data to cover all scenarios is difficult or impossible.
Traditional control systems are difficult to deploy and scale to the real world.
Machine teaching bridges AI science and software with traditional engineering and domain expertise. Example
applications include motion control, machine calibration, smart buildings, industrial robotics, and process control.

Architecture
The Microsoft Autonomous Systems platform manages the full end-to-end machine teaching lifecycle.
Development and deployment has three phases: Build, Train, and Deploy.
1. The Build phase consists of writing the machine teaching program and connecting to a domain-specific training
simulator. Simulators generate sufficient training data for experiments and machine practice.
2. In the Train phase, the training engine automates DRL model generation and training by combining high-level
domain models with appropriate DRL algorithms and neural networks.
3. The Deploy phase deploys the trained brain to the target application in the cloud, on-premises, or embedded on
site. Specific SDKs and deployment APIs deploy trained AI systems to various target applications, perform
machine tuning, and control the physical systems.
Subject matter experts with no AI background can break down their expertise into steps and tasks, criteria, and
desired outcomes. Engineers building autonomous systems create accurate, detailed models of systems and
environments, and make them intelligent using methods like deep learning, imitation learning, and reinforcement
learning.
Simulations help train the models across different kinds of environmental conditions and scenarios, much faster
and safer than is feasible in the real world. Experts can supervise the agents as they work to solve problems in
simulated environments, and provide feedback and guidance that lets the agents dynamically adapt within the
simulation.
After training is complete, engineers deploy these trained agents to the real world, where they use their knowledge
to power autonomous systems.

Machine teaching service


Project Bonsai is the machine teaching service for the Microsoft Autonomous Systems platform. Bonsai simplifies
machine teaching with deep reinforcement learning (DRL) to train and deploy smarter autonomous systems.
You can use Bonsai to:
Teach adaptive brains with intuitive goals and learning objectives, real-time success assessments, and automatic
versioning control.
Integrate training simulations that implement real-world problems and provide realistic feedback.
Export trained brains and deploy them on-premises, in the cloud, or to IoT Edge or embedded devices.
In Bonsai, managed Azure graphics processing unit (GPU) clusters run AI training on complex neural networks at
scale, with built-in support for retraining and analyzing AI system versions. The deployment and runtime
frameworks package and deploy the resulting AI system models at scale.
The Bonsai platform runs on Azure and charges resource costs to your Azure subscription.
Azure Container Registry (basic tier) for storing exported brains and uploaded simulators.
Azure Container Instances for running simulations.
Azure Storage for storing uploaded simulators as ZIP files.
Inkling
Inkling is a declarative, statically-typed programming language for training AI in Bonsai. Inkling abstracts away the
dynamic AI algorithms that require expertise in machine learning, enabling more developers to program AI. An
Inkling file defines concepts necessary to teach the AI, and curriculum, or methods for teaching the concepts.

For more information about Inkling, see the Inkling programming language reference.
Training engine
The training engine in Bonsai compiles machine teaching programs to automatically generate and train AI systems.
The engine:
Automates model generation, management, and tuning.
Chooses neural network architecture such as number of layers and topology, selects the best DRL algorithm,
and tunes hyper-parameters of the model.
Connects to the simulator and orchestrates the training.
Broadly similar to how a software compiler hides the bare metal machine code from the high-level programmer,
the training engine hides much of the complexity and details of the ML models and DRL algorithms. As the state of
the art in AI evolves and new algorithms and network topologies are invented, the training engine can recompile
the same machine teaching programs to exploit these technological advances.
Cartpole sample
Bonsai includes a couple of machine teaching samples, Cartpole and Moab.
The Cartpole sample has a pole attached by an unactivated joint to a cart, which moves along a frictionless track.
The available sensor information includes the cart position and velocity, and the pole angle and angular velocity.
Applying a force to the cart controls the system. The supported agent actions are to push the cart to the left or the
right.
The pole starts upright, and the goal is to keep it upright while keeping the cart on the track. Every time interval
that the pole remains upright generates a reward. A training episode ends when the pole is over 15 degrees from
vertical, or the cart moves more than a predefined number of units from the center.
The sample uses Inkling language to write the machine teaching program, and the provided Cartpole simulator to
speed and improve the training.

The following Bonsai screenshot shows Cartpole training progress, with Goal satisfaction on the y-axis and
Training iterations on the x-axis. The dashboard also shows the percentage of goal satisfaction and the total
elapsed training time.
For more information about the cartpole example or to try it yourself, see:
Quickstart: Balance a pole with AI (Cartpole)
Learn how you can teach an AI agent to balance a pole

Simulators
Simulations model a system in a virtual representation of its intended physical environment. Simulations are an
alternative approach to creating learning policies by hand or collecting large amounts of real-world training data.
Simulations allow training in hazardous environments, or in conditions difficult to reproduce in the real world.
Simulations are the ideal training source for DRL because they're:
Flexible for creating custom environments.
Safe and cost-effective for data generation.
Parallelizable across training machines, allowing for faster training times.
Simulations are available across a broad range of industries and systems, including mechanical and electrical
engineering, autonomous vehicles, security and networking, transportation and logistics, and robotics. Simulation
tools include:
Simulink, a graphical programming tool developed by MathWorks for modeling, simulating, and analyzing
dynamic systems.
Gazebo, a tool to allow accurate simulation of populations of robots in complex indoor and outdoor
environments.
Microsoft AirSim, an open-source robotics simulation platform.
The Bonsai platform includes Simulink and AnyLogic simulators, and you can add other simulators.
AirSim
Microsoft AirSim (Aerial Informatics and Robotics Simulation) is an open-source robotics simulation platform
designed to train autonomous systems. AirSim provides a realistic simulation tool for designers and developers to
generate the large amounts of data they need for model training and debugging.
AirSim can capture data for models from ground vehicles, wheeled robotics, aerial drones, and even static IoT
devices, without costly field operations.

AirSim works as a plug-in to Epic Games' Unreal Engine editor, providing control over building environments and
simulating difficult-to-reproduce, real-world events to capture meaningful data. AirSim leverages current game
engine rendering, physics, and perception computation to create an accurate, real-world simulation.
This realism, based on efficiently generated ground-truth data, enables the study and execution of complex
missions that are time-consuming or risky in the real world. For example, AirSim provides realistic environments,
vehicle dynamics, and multi-modal sensing for researchers building autonomous vehicles. Collisions in a simulator
cost virtually nothing, yet provide actionable information to improve the design of the system.
You can use an Azure Resource Manager (ARM) template to automatically create a development environment, and
code and debug a Python application connected to AirSim in Visual Studio Code. For more information, see AirSim
Development Environment on Azure.

Next steps
Machine teaching
Autonomous systems with Microsoft AI
Autonomy for industrial control systems
Innovation space: Autonomous systems (Video)
Microsoft The AI Blog
Bonsai documentation
Aerial Informatics and Robotics Platform (AirSim)
AI at the Edge with Azure Stack Hub - disconnected
11/13/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

With the Azure AI tools and cloud platform, the next generation of AI-enabled hybrid applications can run where
your data lives. With Azure Stack Hub, bring a trained AI model to the edge and integrate it with your applications
for low-latency intelligence, with no tool or process changes for local applications. With Azure Stack Hub, you can
ensure that your cloud solutions work even when disconnected from the internet.

Architecture

Download an SVG of this architecture.

Data Flow
1. Data scientists train a model using Azure Machine Learning Studio (classic) and an HDInsight cluster. The model
is containerized and put in to an Azure Container Registry.
2. The model is deployed via steps not represented in the diagram to a Kubernetes cluster on Azure Stack Hub.
3. End users provide data that is scored against the model.
4. Insights and anomalies from scoring are placed into storage for later upload.
5. Globally-relevant and compliant insights are available in the global app.
6. Data from edge scoring is used to improve the model.

Components
HDInsight: Provision cloud Hadoop, Spark, R Server, HBase, and Storm clusters
Azure Machine Learning Studio (classic): Easily build, deploy, and manage predictive analytics solutions
Virtual Machines: Provision Windows and Linux virtual machines in seconds
Azure Kubernetes Service (AKS): Simplify the deployment, management, and operations of Kubernetes
Storage: Durable, highly available, and massively scalable cloud storage
Azure Stack Hub: Build and run innovative hybrid applications across cloud boundaries

Next steps
HDInsight documentation
Azure Machine Learning Studio (classic) documentation
Virtual Machines documentation
Azure Kubernetes Service (AKS) documentation
Storage documentation
Azure Stack Hub documentation
Baseball decision analysis with ML.NET and Blazor
11/2/2020 • 5 minutes to read • Edit Online

This scenario describes a web application, called the Baseball Machine Learning Workbench, which provides an
interface for non-technical users to use artificial intelligence (AI) and machine learning (ML) to perform decision
analysis techniques in order to rapidly gain insights and make informed predictions.
This solution uses historical baseball data to generate National Baseball Hall of Fame insights. Machine intelligence
powers the what-if analysis, decision thresholding, and improvements over traditional rule-based systems. User-
friendly interface controls set adjustable parameters and surface the results in real time, with clear visual cues to
highlight positive or negative outcomes.
The architecture provides rapid results by using in-memory models and rapid two-way communication between
the user and server. Delta rendering pushes down only modified content to the web browser, updating the display
without having to reload the entire page. This solution can scale to tens of in-memory models serving hundreds of
concurrent sessions in real time.
The following article explains the architecture of the Baseball Machine Learning Workbench, where to get the
source code for it, and how to deploy it. You can also view a live demo of this solution.

Relevant use cases


Consider this scenario for the following use cases:
Decision analysis system replacement with AI and ML
AI-assisted decision support systems or decision management systems
Fantasy baseball
Financial forecasting
Business goal and objective modeling (budgeting and project planning, for example)

Architecture
The processing sequence in this solution flows as follows:
1. The user accesses the workbench application with any browser. They choose which analysis method to
employ and then which player to analyze.
2. SignalR brokers two-way communication with the server in real time.
3. Azure App Service hosts the application, including AI logic with the machine learning models.
4. One of three different decision analysis mechanisms is utilized, depending on which mode the user has
selected.
5. Historical data is analyzed using the designated set of rules or ML models in ML.NET, operating in-memory
for very quick inference.
6. Blazor Server surfaces the results to the end user’s browser, updating only the portions of the interface that
have changed, and transmits back to the user via SignalR.
7. Azure Application Insights is optionally used to monitor performance and instrumentation resources as
needed.
Components
The following assets and technologies were used to craft the Baseball Machine Learning Workbench:
Azure App Service enables you to build and host web applications in the programming language of your
choice without managing infrastructure.
.NET Core 3.1 is an open-source, cross-platform, general-purpose development framework that runs on
Windows, Linux, and macOS platforms.
ML.NET is a cross-platform, open-source framework for creating machine learning and artificial intelligence
models using .NET. It is used in the inference sections of this application.
Blazor Server lets you build interactive web UIs using C# instead of JavaScript, as in this solution. In this
architecture, Blazor renders the UI on the server and pushes HTML and other changes out to the browser.
SignalR provides asynchronous communication between the browser and the server (including Blazor). It
handles event updates, UI updates, and any processing done on the server (such as model inference).
Visual Studio 2019 is the software programming environment used for this project. This architecture uses
cross-platform components, so either the Windows or Mac version can be used.
Azure Application Insights (a feature of Azure Monitor) can be used for performance monitoring and
analytics and to drive autoscaling.

Considerations
Scalability
This solution uses the prediction engine functionality in ML.NET to scale the model response times. Object pooling
allows the ML.NET models to be accessed by multiple requests in a thread-safe manner. Learn more about ML.NET
object pooling in Deploy a model in an ASP.NET Core Web API.
Azure App Service is used for hosting the workbench in the cloud. With App Service you can automatically scale the
number of instances that run your app, letting you keep up with customer demand. For more information on
autoscale, refer to Autoscaling best practices in the Azure Architecture Center.
In Blazor Server, the state of many components might be maintained concurrently by the server. Because of this,
memory exhaustion is a concern that must be addressed. For guidance on how to author a Blazor Server app to
help ensure the best use of server memory, consult Threat mitigation guidance for ASP.NET Core Blazor Server.
Applying these best practices allows a server-side Blazor application to scale to thousands of concurrent users—
even on relatively small server hosts.
General guidance on designing scalable solutions is provided in the Azure Architecture Center’s Performance
efficiency checklist.
Resiliency and support
Use .NET Core 3.1.x because it is a Long Term Support (LTS) release. Although Blazor Server is also available in .NET
Core 3.0, that is not an LTS release and thus continuing compatibility with future component updates is not assured.
Learn more about the .NET Core Support Policy.

Deploy this solution


All of the source code for this scenario is available in the Baseball Machine Learning Workbench repository. This
solution is open source and provided with an MIT License.
Prerequisites
For online deployments, you must have an existing Azure account. If you need one, create a free Azure account
before you begin.
For deployment as an Azure application instance, you need the Visual Studio 2019 IDE and you must have Git
installed locally.
The historical baseball data used for the analysis and machine learning models comes from Sean Lahman’s
Baseball Database, which is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. The
Major League Baseball data itself is public domain.
Deployment to Azure
1. Make sure you have your Azure subscription information handy.
2. Start by cloning the workbench GitHub repository:
git clone
https://github.com/bartczernicki/MachineLearning-BaseballPrediction-BlazorApp.git

3. Follow the instructions provided in the GETSTARTED.md file.


Alternative: Docker container
This application is also available as a complete, ready-to-run Docker container downloadable from Docker Hub.
The container can be run locally (offline) in your own environment. It can also be deployed online in an Azure
Container Instance. Instructions for getting started with either use case are provided in the main GitHub repo’s Get
Started documentation:
Run the Docker Container locally in your own environment
Publish Docker Container to the Azure Cloud using Azure Container Instances

Related resources
View a live demo of this solution
Baseball HOF prediction using R mlr and DALEX packages is a GitHub repo using R and cutting edge “black
box” model techniques to explain ML.NET models related to this workload
Blazor documentation
ML.NET documentation
ASP.NET Core Blazor hosting models
MLOps (DevOps for Machine Learning) helps data science teams deliver innovation faster, increasing the
pace of ML model development
Learn about the National Baseball Hall of Fame voting process and rules
XAI Stories: Case Studies for Explainable Artificial Intelligence (Warsaw University of Technology and
University of Warsaw, 2020)
Batch scoring of deep learning models on Azure
11/2/2020 • 8 minutes to read • Edit Online

This reference architecture shows how to apply neural style transfer to a video, using Azure Machine Learning.
Style transfer is a deep learning technique that composes an existing image in the style of another image. This
architecture can be generalized for any scenario that uses batch scoring with deep learning. Deploy this
solution .

Scenario : A media organization has a video whose style they want to change to look like a specific painting. The
organization wants to be able to apply this style to all frames of the video in a timely manner and in an automated
fashion. For more background about neural style transfer algorithms, see Image Style Transfer Using
Convolutional Neural Networks (PDF).

ST Y L E IM A GE: IN P UT / C O N T EN T VIDEO : O UT P UT VIDEO :

click to view video click to view video

This reference architecture is designed for workloads that are triggered by the presence of new media in Azure
storage.
Processing involves the following steps:
1. Upload a video file to storage.
2. The video file triggers a Logic App to send a request to the Azure Machine Learning pipeline published
endpoint.
3. The pipeline processes the video, applies style transfer with MPI, and postprocesses the video.
4. The output is saved back to blob storage once the pipeline is completed.

Architecture
This architecture consists of the following components.
Compute
Azure Machine Learning uses pipelines to create reproducible and easy-to-manage sequences of computation.
It also offers a managed compute target (on which a pipeline computation can run) called Azure Machine Learning
Compute for training, deploying, and scoring machine learning models.
Storage
Blob storage is used to store all images (input images, style images, and output images). Azure Machine
Learning integrates with Blob storage so that users do not have to manually move data across compute platforms
and Blob storage. Blob storage is also very cost-effective for the performance that this workload requires.
Trigger / scheduling
Azure Logic Apps is used to trigger the workflow. When the Logic App detects that a blob has been added to the
container, it triggers the Azure Machine Learning Pipeline. Logic Apps is a good fit for this reference architecture
because it's an easy way to detect changes to blob storage and provides an easy process for changing the trigger.
Preprocessing and postprocessing our data
This reference architecture uses video footage of an orangutan in a tree. You can download the footage from here.
1. Use FFmpeg to extract the audio file from the video footage, so that the audio file can be stitched back into the
output video later.
2. Use FFmpeg to break the video into individual frames. The frames will be processed independently, in parallel.
3. At this point, we can apply neural style transfer to each individual frame in parallel.
4. One each frame has been processed, we need to use FFmpeg to restitch the frames back together.
5. Finally we reattach the audio file to the restitched footage.

Performance considerations
GPU versus CPU
For deep learning workloads, GPUs will generally out-perform CPUs by a considerable amount, to the extent that a
sizeable cluster of CPUs is usually needed to get comparable performance. While it's an option to use only CPUs in
this architecture, GPUs will provide a much better cost/performance profile. We recommend using the latest
[NCv3 series]vm-sizes-gpu of GPU optimized VMs.
GPUs are not enabled by default in all regions. Make sure to select a region with GPUs enabled. In addition,
subscriptions have a default quota of zero cores for GPU-optimized VMs. You can raise this quota by opening a
support request. Make sure that your subscription has enough quota to run your workload.
Parallelizing across VMs versus cores
When running a style transfer process as a batch job, the jobs that run primarily on GPUs will have to be
parallelized across VMs. Two approaches are possible: You can create a larger cluster using VMs that have a single
GPU, or create a smaller cluster using VMs with many GPUs.
For this workload, these two options will have comparable performance. Using fewer VMs with more GPUs per
VM can help to reduce data movement. However, the data volume per job for this workload is not very big, so you
won't observe much throttling by blob storage.
MPI step
When creating the Azure Machine Learning pipeline, one of the steps used to perform parallel computation is the
MPI step. The MPI step will help split the data evenly across the available nodes. The MPI step will not executed
until all the requested nodes are ready. Should one node fail or get preempted (if it is a low-priority virtual
machine), the MPI step will have to be rerun.

Security considerations
Restricting access to Azure blob storage
In this reference architecture, Azure blob storage is the main storage component that needs to be protected. The
baseline deployment shown in the GitHub repo uses storage account keys to access the blob storage. For further
control and protection, consider using a shared access signature (SAS) instead. This grants limited access to
objects in storage, without needing to hard code the account keys or save them in plaintext. This approach is
especially useful because account keys are visible in plaintext inside of Logic App's designer interface. Using an
SAS also helps to ensure that the storage account has proper governance, and that access is granted only to the
people intended to have it.
For scenarios with more sensitive data, make sure that all of your storage keys are protected, because these keys
grant full access to all input and output data from the workload.
Data encryption and data movement
This reference architecture uses style transfer as an example of a batch scoring process. For more data-sensitive
scenarios, the data in storage should be encrypted at rest. Each time data is moved from one location to the next,
use SSL to secure the data transfer. For more information, see Azure Storage security guide.
Securing your computation in a virtual network
When deploying your Machine Learning compute cluster, you can configure your cluster to be provisioned inside
a subnet of a virtual network. This allows the compute nodes in the cluster to communicate securely with other
virtual machines.
Protecting against malicious activity
In scenarios where there are multiple users, make sure that sensitive data is protected against malicious activity. If
other users are given access to this deployment to customize the input data, take note of the following
precautions and considerations:
Use RBAC to limit users' access to only the resources they need.
Provision two separate storage accounts. Store input and output data in the first account. External users can be
given access to this account. Store executable scripts and output log files in the other account. External users
should not have access to this account. This will ensure that external users cannot modify any executable files
(to inject malicious code), and don't have access to log files, which could hold sensitive information.
Malicious users can DDoS the job queue or inject malformed poison messages in the job queue, causing the
system to lock up or causing dequeuing errors.

Monitoring and logging


Monitoring Batch jobs
While running your job, it's important to monitor the progress and make sure that things are working as expected.
However, it can be a challenge to monitor across a cluster of active nodes.
To check the overall state of the cluster, go to the Machine Learning blade of the Azure portal to inspect the state
of the nodes in the cluster. If a node is inactive or a job has failed, the error logs are saved to blob storage, and are
also accessible in the Azure portal.
Monitoring can be further enriched by connecting logs to Application Insights or by running separate processes to
poll for the state of the cluster and its jobs.
Logging with Azure Machine Learning
Azure Machine Learning will automatically log all stdout/stderr to the associated blob storage account. Unless
otherwise specified, your Azure Machine Learning workspace will automatically provision a storage account and
dump your logs into it. You can also use a storage navigation tool such as Storage Explorer which will provide a
much easier experience for navigating log files.

Cost considerations
Compared to the storage and scheduling components, the compute resources used in this reference architecture
by far dominate in terms of costs. One of the main challenges is effectively parallelizing the work across a cluster
of GPU-enabled machines.
The Azure Machine Learning Compute cluster size can automatically scale up and down depending on the jobs in
the queue. You can enable auto-scale programmatically by setting the minimum and maximum nodes.
For work that doesn't require immediate processing, configure auto-scale so the default state (minimum) is a
cluster of zero nodes. With this configuration, the cluster starts with zero nodes and only scales up when it detects
jobs in the queue. If the batch scoring process only happens a few times a day or less, this setting enables
significant cost savings.
Auto-scaling may not be appropriate for batch jobs that happen too close to each other. The time that it takes for a
cluster to spin up and spin down also incur a cost, so if a batch workload begins only a few minutes after the
previous job ends, it might be more cost effective to keep the cluster running between jobs.
Azure Machine Learning Compute also supports low-priority virtual machines. This allows you to run your
computation on discounted virtual machines, with the caveat that they may be preempted at any time. Low-
priority virtual machines are ideal for non-critical batch scoring workloads.

Deploy the solution


To deploy this reference architecture, follow the steps described in the GitHub repo.

NOTE
You can also deploy a batch scoring architecture for deep learning models using the Azure Kubernetes Service. Follow the
steps described in this GitHub repo.
Batch scoring of Python machine learning models on
Azure
11/2/2020 • 5 minutes to read • Edit Online

This reference architecture shows how to build a scalable solution for batch scoring many models on a schedule in
parallel using Azure Machine Learning. The solution can be used as a template and can generalize to different
problems.
A reference implementation for this architecture is available on GitHub.

Scenario : This solution monitors the operation of a large number of devices in an IoT setting where each device
sends sensor readings continuously. Each device is assumed to be associated with pretrained anomaly detection
models that need to be used to predict whether a series of measurements, that are aggregated over a predefined
time interval, correspond to an anomaly or not. In real-world scenarios, this could be a stream of sensor readings
that need to be filtered and aggregated before being used in training or real-time scoring. For simplicity, this
solution uses the same data file when executing scoring jobs.
This reference architecture is designed for workloads that are triggered on a schedule. Processing involves the
following steps:
1. Send sensor readings for ingestion to Azure Event Hubs.
2. Perform stream processing and store the raw data.
3. Send the data to a Machine Learning cluster that is ready to start taking work. Each node in the cluster runs a
scoring job for a specific sensor.
4. Execute the scoring pipeline, which runs the scoring jobs in parallel using machine learning Python scripts. The
pipeline is created, published, and scheduled to run on a predefined interval of time.
5. Generate predictions and store them in Blob storage for later consumption.

Architecture
This architecture consists of the following components:
Azure Event Hubs. This message ingestion service can ingest millions of event messages per second. In this
architecture, sensors send a stream of data to the event hub.
Azure Stream Analytics. An event-processing engine. A Stream Analytics job reads the data streams from the event
hub and performs stream processing.
Azure SQL Database. Data from the sensor readings is loaded into SQL Database. SQL is a familiar way to store the
processed, streamed data (which is tabular and structured), but other data stores can be used.
Azure Machine Learning. Azure Machine Learning is a cloud service for training, scoring, deploying, and managing
machine learning models at scale. In the context of batch scoring, Azure Machine Learning creates a cluster of
virtual machines on demand with an automatic scaling option, where each node in the cluster runs a scoring job
for a specific sensor. The scoring jobs are executed in parallel as Python-script steps that are queued and managed
by the service. These steps are part of a Machine Learning pipeline that is created, published, and scheduled to run
on a predefined interval of time.
Azure Blob Storage. Blob containers are used to store the pretrained models, the data, and the output predictions.
The models are uploaded to Blob storage in the 01_create_resources.ipynb notebook. These one-class SVM
models are trained on data that represents values of different sensors for different devices. This solution assumes
that the data values are aggregated over a fixed interval of time.
Azure Container Registry. The scoring Python script runs in Docker containers that are created on each node of the
cluster, where it reads the relevant sensor data, generates predictions and stores them in Blob storage.

Performance considerations
For standard Python models, it's generally accepted that CPUs are sufficient to handle the workload. This
architecture uses CPUs. However, for deep learning workloads, GPUs generally outperform CPUs by a considerable
amount — a sizeable cluster of CPUs is usually needed to get comparable performance.
Parallelizing across VMs versus cores
When running scoring processes of many models in batch mode, the jobs need to be parallelized across VMs. Two
approaches are possible:
Create a larger cluster using low-cost VMs.
Create a smaller cluster using high performing VMs with more cores available on each.
In general, scoring of standard Python models is not as demanding as scoring of deep learning models, and a
small cluster should be able to handle a large number of queued models efficiently. You can increase the number
of cluster nodes as the dataset sizes increase.
For convenience in this scenario, one scoring task is submitted within a single Azure Machine Learning pipeline
step. However, it can be more efficient to score multiple data chunks within the same pipeline step. In those cases,
write custom code to read in multiple datasets and execute the scoring script for those during a single-step
execution.

Management considerations
Monitor jobs . It's important to monitor the progress of running jobs, but it can be a challenge to monitor
across a cluster of active nodes. To inspect the state of the nodes in the cluster, use the Azure portal to manage
the machine learning workspace. If a node is inactive or a job has failed, the error logs are saved to blob
storage, and are also accessible in the Pipelines section. For richer monitoring, connect logs to Application
Insights, or run separate processes to poll for the state of the cluster and its jobs.
Logging . Machine Learning logs all stdout/stderr to the associated Azure Storage account. To easily view the
log files, use a storage navigation tool such as Azure Storage Explorer.
Cost considerations
The most expensive components used in this reference architecture are the compute resources. The compute
cluster size scales up and down depending on the jobs in the queue. Enable automatic scaling programmatically
through the Python SDK by modifying the compute's provisioning configuration. Or use the Azure CLI to set the
automatic scaling parameters of the cluster.
For work that doesn't require immediate processing, configure the automatic scaling formula so the default state
(minimum) is a cluster of zero nodes. With this configuration, the cluster starts with zero nodes and only scales up
when it detects jobs in the queue. If the batch scoring process happens only a few times a day or less, this setting
enables significant cost savings.
Automatic scaling may not be appropriate for batch jobs that happen too close to each other. The time that it takes
for a cluster to spin up and spin down also incurs a cost, so if a batch workload begins only a few minutes after the
previous job ends, it might be more cost effective to keep the cluster running between jobs. That depends on
whether scoring processes are scheduled to run at a high frequency (every hour, for example), or less frequently
(once a month, for example).

Deployment
To deploy this reference architecture, follow the steps described in the GitHub repo.
Batch scoring of Spark machine learning models on
Azure Databricks
11/2/2020 • 5 minutes to read • Edit Online

This reference architecture shows how to build a scalable solution for batch scoring an Apache Spark classification
model on a schedule using Azure Databricks, an Apache Spark-based analytics platform optimized for Azure. The
solution can be used as a template that can be generalized to other scenarios.
A reference implementation for this architecture is available on GitHub.

Scenario : A business in an asset-heavy industry wants to minimize the costs and downtime associated with
unexpected mechanical failures. Using IoT data collected from their machines, they can create a predictive
maintenance model. This model enables the business to maintain components proactively and repair them before
they fail. By maximizing mechanical component use, they can control costs and reduce downtime.
A predictive maintenance model collects data from the machines and retains historical examples of component
failures. The model can then be used to monitor the current state of the components and predict if a given
component will fail in the near future. For common use cases and modeling approaches, see Azure AI guide for
predictive maintenance solutions.
This reference architecture is designed for workloads that are triggered by the presence of new data from the
component machines. Processing involves the following steps:
1. Ingest the data from the external data store onto an Azure Databricks data store.
2. Train a machine learning model by transforming the data into a training data set, then building a Spark
MLlib model. MLlib consists of most common machine learning algorithms and utilities optimized to take
advantage of Spark data scalability capabilities.
3. Apply the trained model to predict (classify) component failures by transforming the data into a scoring
data set. Score the data with the Spark MLLib model.
4. Store results on the Databricks data store for post-processing consumption.
Notebooks are provided on GitHub to perform each of these tasks.

Architecture
The architecture defines a data flow that is entirely contained within Azure Databricks based on a set of
sequentially executed notebooks. It consists of the following components:
Data files . The reference implementation uses a simulated data set contained in five static data files.
Ingestion . The data ingestion notebook downloads the input data files into a collection of Databricks data sets. In
a real-world scenario, data from IoT devices would stream onto Databricks-accessible storage such as Azure SQL
Server or Azure Blob storage. Databricks supports multiple data sources.
Training pipeline . This notebook executes the feature engineering notebook to create an analysis data set from
the ingested data. It then executes a model building notebook that trains the machine learning model using the
Apache Spark MLlib scalable machine learning library.
Scoring pipeline . This notebook executes the feature engineering notebook to create scoring data set from the
ingested data and executes the scoring notebook. The scoring notebook uses the trained Spark MLlib model to
generate predictions for the observations in the scoring data set. The predictions are stored in the results store, a
new data set on the Databricks data store.
Scheduler . A scheduled Databricks job handles batch scoring with the Spark model. The job executes the scoring
pipeline notebook, passing variable arguments through notebook parameters to specify the details for
constructing the scoring data set and where to store the results data set.
The scenario is constructed as a pipeline flow. Each notebook is optimized to perform in a batch setting for each of
the operations: ingestion, feature engineering, model building, and model scorings. To accomplish this, the feature
engineering notebook is designed to generate a general data set for any of the training, calibration, testing, or
scoring operations. In this scenario, we use a temporal split strategy for these operations, so the notebook
parameters are used to set date-range filtering.
Because the scenario creates a batch pipeline, we provide a set of optional examination notebooks to explore the
output of the pipeline notebooks. You can find these in the GitHub repository:
1a_raw-data_exploring
2a_feature_exploration
2b_model_testing
3b_model_scoring_evaluation

Recommendations
Databricks is set up so you can load and deploy your trained models to make predictions with new data. We used
Databricks for this scenario because it provides these additional advantages: '
Single sign-on support using Azure Active Directory credentials.
Job scheduler to execute jobs for production pipelines.
Fully interactive notebook with collaboration, dashboards, REST APIs.
Unlimited clusters that can scale to any size.
Advanced security, role-based access controls, and audit logs.
To interact with the Azure Databricks service, use the Databricks Workspace interface in a web browser or the
command-line interface (CLI). Access the Databricks CLI from any platform that supports Python 2.7.9 to 3.6.
The reference implementation uses notebooks to execute tasks in sequence. Each notebook stores intermediate
data artifacts (training, test, scoring, or results data sets) to the same data store as the input data. The goal is to
make it easy for you to use it as needed in your particular use case. In practice, you would connect your data
source to your Azure Databricks instance for the notebooks to read and write directly back into your storage.
You can monitor job execution through the Databricks user interface, the data store, or the Databricks CLI as
necessary. Monitor the cluster using the event log and other metrics that Databricks provides.

Performance considerations
An Azure Databricks cluster enables autoscaling by default so that during runtime, Databricks dynamically
reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more
computationally demanding than others. Databricks adds additional workers during these phases of your job (and
removes them when they're no longer needed). Autoscaling makes it easier to achieve high cluster utilization,
because you don't need to provision the cluster to match a workload.
Additionally, more complex scheduled pipelines can be developed by using Azure Data Factory with Azure
Databricks.

Storage considerations
In this reference implementation, the data is stored directly within Databricks storage for simplicity. In a production
setting, however, the data can be stored on cloud data storage such as Azure Blob Storage. Databricks also
supports Azure Data Lake Store, Azure Synapse Analytics, Azure Cosmos DB, Apache Kafka, and Hadoop.

Cost considerations
Azure Databricks is a premium Spark offering with an associated cost. In addition, there are standard and premium
Databricks pricing tiers.
For this scenario, the standard pricing tier is sufficient. However, if your specific application requires automatically
scaling clusters to handle larger workloads or interactive Databricks dashboards, the premium level could increase
costs further.
The solution notebooks can run on any Spark-based platform with minimal edits to remove the Databricks-specific
packages. See the following similar solutions for various Azure platforms:
SQL Server R services
PySpark on an Azure Data Science Virtual Machine

Deploy the solution


To deploy this reference architecture, follow the steps described in the GitHub repository to build a scalable
solution for scoring Spark models in batch on Azure Databricks.

Related architectures
We have also built a reference architecture that uses Spark for building real-time recommendation systems with
offline, pre-computed scores. These recommendation systems are common scenarios where scores are batch-
processed.
Batch scoring of R machine learning models on Azure
11/13/2020 • 5 minutes to read • Edit Online

This reference architecture shows how to perform batch scoring with R models using Azure Batch. The scenario is
based on retail store sales forecasting, but this architecture can be generalized for any scenario requiring the
generation of predictions on a larger scale using R models. A reference implementation for this architecture is
available on GitHub.

Scenario : A supermarket chain needs to forecast sales of products over the upcoming quarter. The forecast allows
the company to manage its supply chain better and ensure it can meet demand for products at each of its stores.
The company updates its forecasts every week as new sales data from the previous week becomes available and
the product marketing strategy for next quarter is set. Quantile forecasts are generated to estimate the uncertainty
of the individual sales forecasts.
Processing involves the following steps:
1. An Azure Logic App triggers the forecast generation process once per week.
2. The logic app starts an Azure Container Instance running the scheduler Docker container, which triggers the
scoring jobs on the Batch cluster.
3. Scoring jobs run in parallel across the nodes of the Batch cluster. Each node:
a. Pulls the worker Docker image and starts a container.
b. Reads input data and pre-trained R models from Azure Blob storage.
c. Scores the data to produce the forecasts.
d. Writes the forecast results to blob storage.
The figure below shows the forecasted sales for four products (SKUs) in one store. The black line is the sales
history, the dashed line is the median (q50) forecast, the pink band represents the twenty-fifth and seventy-fifth
percentiles, and the blue band represents the fifth and ninety-fifth percentiles.
Architecture
This architecture consists of the following components.
Azure Batch is used to run forecast generation jobs in parallel on a cluster of virtual machines. Predictions are
made using pre-trained machine learning models implemented in R. Azure Batch can automatically scale the
number of VMs based on the number of jobs submitted to the cluster. On each node, an R script runs within a
Docker container to score data and generate forecasts.
Azure Blob Storage is used to store the input data, the pre-trained machine learning models, and the forecast
results. It delivers very cost-effective storage for the performance that this workload requires.
Azure Container Instances provide serverless compute on demand. In this case, a container instance is deployed on
a schedule to trigger the Batch jobs that generate the forecasts. The Batch jobs are triggered from an R script using
the doAzureParallel package. The container instance automatically shuts down once the jobs have finished.
Azure Logic Apps trigger the entire workflow by deploying the container instances on a schedule. An Azure
Container Instances connector in Logic Apps allows an instance to be deployed upon a range of trigger events.

Performance considerations
Containerized deployment
With this architecture, all R scripts run within Docker containers. This ensures that the scripts run in a consistent
environment, with the same R version and packages versions, every time. Separate Docker images are used for the
scheduler and worker containers, because each has a different set of R package dependencies.
Azure Container Instances provides a serverless environment to run the scheduler container. The scheduler
container runs an R script that triggers the individual scoring jobs running on an Azure Batch cluster.
Each node of the Batch cluster runs the worker container, which executes the scoring script.
Parallelizing the workload
When batch scoring data with R models, consider how to parallelize the workload. The input data must be
partitioned somehow so that the scoring operation can be distributed across the cluster nodes. Try different
approaches to discover the best choice for distributing your workload. On a case-by-case basis, consider the
following:
How much data can be loaded and processed in the memory of a single node.
The overhead of starting each batch job.
The overhead of loading the R models.
In the scenario used for this example, the model objects are large, and it takes only a few seconds to generate a
forecast for individual products. For this reason, you can group the products and execute a single Batch job per
node. A loop within each job generates forecasts for the products sequentially. This method turns out to be the
most efficient way to parallelize this particular workload. It avoids the overhead of starting many smaller Batch
jobs and repeatedly loading the R models.
An alternative approach is to trigger one Batch job per product. Azure Batch automatically forms a queue of jobs
and submits them to be executed on the cluster as nodes become available. Use automatic scaling to adjust the
number of nodes in the cluster depending on the number of jobs. This approach makes more sense if it takes a
relatively long time to complete each scoring operation, justifying the overhead of starting the jobs and reloading
the model objects. This approach is also simpler to implement and gives you the flexibility to use automatic
scaling-an important consideration if the size of the total workload is not known in advance.

Monitoring and logging considerations


Monitoring Azure Batch jobs
Monitor and terminate Batch jobs from the Jobs pane of the Batch account in the Azure portal. Monitor the batch
cluster, including the state of individual nodes, from the Pools pane.
Logging with doAzureParallel
The doAzureParallel package automatically collects logs of all stdout/stderr for every job submitted on Azure
Batch. These can be found in the storage account created at setup. To view them, use a storage navigation tool such
as Azure Storage Explorer or Azure portal.
To quickly debug Batch jobs during development, print logs in your local R session using the getJobFiles function
of doAzureParallel.

Cost considerations
The compute resources used in this reference architecture are the most costly components. For this scenario, a
cluster of fixed size is created whenever the job is triggered and then shut down after the job has completed. Cost
is incurred only while the cluster nodes are starting, running, or shutting down. This approach is suitable for a
scenario where the compute resources required to generate the forecasts remain relatively constant from job to
job.
In scenarios where the amount of compute required to complete the job is not known in advance, it may be more
suitable to use automatic scaling. With this approach, the size of the cluster is scaled up or down depending on the
size of the job. Azure Batch supports a range of auto-scale formulae which you can set when defining the cluster
using the doAzureParallel API.
For some scenarios, the time between jobs may be too short to shut down and start up the cluster. In these cases,
keep the cluster running between jobs if appropriate.
Azure Batch and doAzureParallel support the use of low-priority VMs. These VMs come with a significant discount
but risk being appropriated by other higher priority workloads. The use of these VMs are therefore not
recommended for critical production workloads. However, they are very useful for experimental or development
workloads.

Deployment
To deploy this reference architecture, follow the steps described in the GitHub repo.
Business Process Management
11/2/2020 • 2 minutes to read • Edit Online

This architecture demonstrates how to use Knowledge Mining for business process management.
When organizations task employees with the review and research of technical data, it can be tedious to read page
after page of dense text. Knowledge mining helps employees quickly review these materials. In industries where
bidding competition is fierce, or when the diagnosis of a problem must be quick or in near real-time, companies
can use knowledge mining to avoid costly mistakes.

Data Flow
There are three steps: Ingest, Enrich and Exploration. First, the unstructured and structured data is ingested then
enrichment of this data with AI to extract information and find and finally explore the newly structured data via
search, existing business applications or analytics solutions.
1. The user can ingest different types of content like SOWs, requests for proposal, invoice archives, sales team
correspondence, W2 forms, receipts, healthcare claim forms, bank statements, legal agreements, balance sheets,
income statements, cash flow statements, company disclosures, SEC documents, annual reports
2. This content is enriched by using optical character recognition, forms recognition, layout understanding, table
extraction, key-value pair extraction
3. And finally, the user can automatically populate data from invoices into ELP systems or databases or compile
enriched documents in the knowledge store and project them into tabular or object stores, then surface trends
in an analytics dashboard, such as frequent issues, popular products, and much more

Components
Key technologies used to implement tools for technical content review and research
Azure Cognitive Search
Microsoft Form Recognizer
Web API custom skill interface

Next steps
Using the knowledge mining solution accelerator to build an initial knowledge mining prototype with Azure
Cognitive Search.
Building custom skills with Microsoft's Custom Web API
Build a real-time recommendation API on Azure
11/2/2020 • 6 minutes to read • Edit Online

This reference architecture shows how to train a recommendation model using Azure Databricks and deploy it as
an API by using Azure Cosmos DB, Azure Machine Learning, and Azure Kubernetes Service (AKS). This
architecture can be generalized for most recommendation engine scenarios, including recommendations for
products, movies, and news.
A reference implementation for this architecture is available on GitHub

Scenario : A media organization wants to provide movie or video recommendations to its users. By providing
personalized recommendations, the organization meets several business goals, including increased click-through
rates, increased engagement on site, and higher user satisfaction.
This reference architecture is for training and deploying a real-time recommender service API that can provide
the top 10 movie recommendations for a given user.
The data flow for this recommendation model is as follows:
1. Track user behaviors. For example, a backend service might log when a user rates a movie or clicks a
product or news article.
2. Load the data into Azure Databricks from an available data source.
3. Prepare the data and split it into training and testing sets to train the model. (This guide describes options
for splitting data.)
4. Fit the Spark Collaborative Filtering model to the data.
5. Evaluate the quality of the model using rating and ranking metrics. (This guide provides details about the
metrics you can evaluate your recommender on.)
6. Precompute the top 10 recommendations per user and store as a cache in Azure Cosmos DB.
7. Deploy an API service to AKS using the Azure Machine Learning APIs to containerize and deploy the API.
8. When the backend service gets a request from a user, call the recommendations API hosted in AKS to get
the top 10 recommendations and display them to the user.

Architecture
This architecture consists of the following components:
Azure Databricks. Databricks is a development environment used to prepare input data and train the
recommender model on a Spark cluster. Azure Databricks also provides an interactive workspace to run and
collaborate on notebooks for any data processing or machine learning tasks.
Azure Kubernetes Service (AKS). AKS is used to deploy and operationalize a machine learning model service API
on a Kubernetes cluster. AKS hosts the containerized model, providing scalability that meets your throughput
requirements, identity and access management, and logging and health monitoring.
Azure Cosmos DB. Cosmos DB is a globally distributed database service used to store the top 10 recommended
movies for each user. Azure Cosmos DB is well-suited for this scenario, because it provides low latency (10 ms at
99th percentile) to read the top recommended items for a given user.
Azure Machine Learning. This service is used to track and manage machine learning models, and then package
and deploy these models to a scalable AKS environment.
Microsoft Recommenders. This open-source repository contains utility code and samples to help users get started
in building, evaluating, and operationalizing a recommender system.

Performance considerations
Performance is a primary consideration for real-time recommendations, because recommendations usually fall in
the critical path of the request a user makes on your site.
The combination of AKS and Azure Cosmos DB enables this architecture to provide a good starting point to
provide recommendations for a medium-sized workload with minimal overhead. Under a load test with 200
concurrent users, this architecture provides recommendations at a median latency of about 60 ms and performs
at a throughput of 180 requests per second. The load test was run against the default deployment configuration
(a 3x D3 v2 AKS cluster with 12 vCPUs, 42 GB of memory, and 11,000 Request Units (RUs) per second
provisioned for Azure Cosmos DB).
Azure Cosmos DB is recommended for its turnkey global distribution and usefulness in meeting any database
requirements your app has. For slightly faster latency, consider using Azure Cache for Redis instead of Azure
Cosmos DB to serve lookups. Azure Cache for Redis can improve performance of systems that rely highly on data
in back-end stores.

Scalability considerations
If you don't plan to use Spark, or you have a smaller workload where you don't need distribution, consider using
Data Science Virtual Machine (DSVM) instead of Azure Databricks. DSVM is an Azure virtual machine with deep
learning frameworks and tools for machine learning and data science. As with Azure Databricks, any model you
create in a DSVM can be operationalized as a service on AKS via Azure Machine Learning.
During training, provision a larger fixed-size Spark cluster in Azure Databricks or configure autoscaling. When
autoscaling is enabled, Databricks monitors the load on your cluster and scales up and downs when required.
Provision or scale out a larger cluster if you have a large data size and you want to reduce the amount of time it
takes for data preparation or modeling tasks.
Scale the AKS cluster to meet your performance and throughput requirements. Take care to scale up the number
of pods to fully utilize the cluster, and to scale the nodes of the cluster to meet the demand of your service. For
more information on how to scale your cluster to meet the performance and throughput requirements of your
recommender service, see Scaling Azure Container Service Clusters.
To manage Azure Cosmos DB performance, estimate the number of reads required per second, and provision the
number of RUs per second (throughput) needed. Use best practices for partitioning and horizontal scaling.

Cost considerations
The main drivers of cost in this scenario are:
The Azure Databricks cluster size required for training.
The AKS cluster size required to meet your performance requirements.
Azure Cosmos DB RUs provisioned to meet your performance requirements.
Manage the Azure Databricks costs by retraining less frequently and turning off the Spark cluster when not in use.
The AKS and Azure Cosmos DB costs are tied to the throughput and performance required by your site and will
scale up and down depending on the volume of traffic to your site.

Deploy the solution


To deploy this architecture, follow the Azure Databricks instructions in the setup document. Briefly, the
instructions require you to:
1. Create an Azure Databricks workspace.
2. Create a new cluster with the following configuration in Azure Databricks:
Cluster mode: Standard
Databricks Runtime Version: 4.3 (includes Apache Spark 2.3.1, Scala 2.11)
Python Version: 3
Driver Type: Standard_DS3_v2
Worker Type: Standard_DS3_v2 (min and max as required)
Auto Termination: (as required)
Spark Config: (as required)
Environment Variables: (as required)
3. Create a personal access token within the Azure Databricks workspace. See the Azure Databricks
authentication documentation for details.
4. Clone the Microsoft Recommenders repository into an environment where you can execute scripts (for
example, your local computer).
5. Follow the Quick install setup instructions to install the relevant libraries on Azure Databricks.
6. Follow the Quick install setup instructions to prepare Azure Databricks for operationalization.
7. Import the ALS Movie Operationalization notebook into your workspace. After logging into your Azure
Databricks Workspace, do the following:
a. Click Home on the left side of the workspace.
b. Right-click on white space in your home directory. Select Impor t .
c. Select URL , and paste the following into the text field:
https://github.com/Microsoft/Recommenders/blob/master/examples/05_operationalize/als_movie_o16n.ipynb

d. Click Impor t .
8. Open the notebook within Azure Databricks and attach the configured cluster.
9. Run the notebook to create the Azure resources required to create a recommendation API that provides
the top-10 movie recommendations for a given user.

Related architectures
We have also built a reference architecture that uses Spark and Azure Databricks to execute scheduled batch-
scoring processes. See that reference architecture to understand a recommended approach for generating new
recommendations routinely.
Conversational chatbot for hotel reservations on
Azure
11/2/2020 • 6 minutes to read • Edit Online

This example scenario is applicable to businesses that need to integrate a conversational chatbot into applications.
In this scenario, a C# chatbot is used for a hotel chain that allows customers to check availability and book
accommodation through a web or mobile application.
Potential uses include providing a way for customers to view hotel availability and book rooms, review a restaurant
take-out menu and place a food order, or search for and order prints of photographs. Traditionally, businesses
would need to hire and train customer service agents to respond to these customer requests, and customers
would have to wait until a representative is available to provide assistance.
By using Azure services such as the Bot Service and Language Understanding or Speech API services, companies
can assist customers and process orders or reservations with automated, scalable bots.

Relevant use cases


Other relevant use cases include:
Viewing a restaurant take-out menu and ordering food
Checking hotel availability and reserving a room
Searching available photos and ordering prints

Architecture

This scenario covers a conversational bot that functions as a concierge for a hotel. The data flows through the
scenario as follows:
1. The customer accesses the chatbot with a mobile or web app.
2. Using Azure Active Directory (Azure AD) B2C (business-to-consumer), the user is authenticated.
3. Interacting with the Bot Service, the user requests information about hotel availability.
4. Cognitive Services processes the natural language request to understand the customer communication.
5. After the user is happy with the results, the bot adds or updates the customer's reservation in a SQL Database.
6. Application Insights gathers runtime telemetry throughout the process to help the DevOps team with bot
performance and usage.
Components
Azure Active Directory (Azure AD) is Microsoft's multitenant cloud-based directory and identity management
service. Azure AD supports a B2C connector allowing you to identify individuals using external IDs such as
Google, Facebook, or a Microsoft Account.
App Service enables you to build and host web applications in the programming language of your choice
without managing infrastructure.
Bot Service provides tools to build, test, deploy, and manage intelligent bots.
Cognitive Services lets you use intelligent algorithms to see, hear, speak, understand, and interpret your user
needs through natural methods of communication.
SQL Database is a fully managed relational cloud database service that provides SQL Server engine
compatibility.
Application Insights is an extensible Application Performance Management (APM) service that lets you monitor
the performance of applications, such as your chatbot.
Alternatives
Microsoft Speech API can be used to change how customers interface with your bot.
QnA Maker can be used as to quickly add knowledge to your bot from semi-structured content like an FAQ.
Translator Text is a service that you might consider to easily add multi-lingual support to your bot.

Considerations
Availability
This scenario uses Azure SQL Database for storing customer reservations. SQL Database includes zone redundant
databases, failover groups, and geo-replication. For more information, see Azure SQL Database availability
capabilities.
Scalability
This scenario uses Azure App Service. With App Service, you can automatically scale the number of instances that
run your bot. This functionality lets you keep up with customer demand for your web application and chatbot. For
more information on autoscale, see Autoscaling best practices in the Azure Architecture Center.
For other scalability topics, see the performance efficiency checklist in the Azure Architecture Center.
Security
This scenario uses Azure Active Directory (Azure AD) B2C, a business-to-consumer identity management service,
to authenticate users. With Azure AD B2C, your chatbot doesn't store any sensitive customer account information
or credentials. For more information, see the Azure AD B2C overview.
Information stored in Azure SQL Database is encrypted at rest with transparent data encryption (TDE). SQL
Database also offers Always Encrypted which encrypts data during querying and processing. For more information
on SQL Database security, see Azure SQL Database security and compliance.
For general guidance on designing secure solutions, see the Azure Security Documentation.
Resiliency
This scenario uses Azure SQL Database for storing customer reservations. SQL Database includes zone redundant
databases, failover groups, geo-replication, and automatic backups. These features allow your application to
continue running if there is a maintenance event or outage. For more information, see Azure SQL Database
availability capabilities.
To monitor the health of your application, this scenario uses Application Insights. With Application Insights, you can
generate alerts and respond to performance issues that would impact the customer experience and availability of
the chatbot. For more information, see What is Application Insights?
For other resiliency topics, see Designing reliable Azure applications.
Deploy the scenario
This scenario is divided into three components for you to explore areas that you are most focused on:
Infrastructure components. Use an Azure Resource Manager template to deploy the core infrastructure
components of an App Service, Web App, Application Insights, Storage account, and SQL Server and database.
Web App Chatbot. Use the Azure CLI to deploy a bot with the Bot Service and Language Understanding and
Intelligent Services (LUIS) app.
Sample C# chatbot application. Use Visual Studio to review the sample hotel reservation C# application code
and deploy to a bot in Azure.
Prerequisites
You must have an existing Azure account. If you don't have an Azure subscription, create a free account before you
begin.
Walk-through
To deploy the infrastructure components with a Resource Manager template, perform the following steps.
1. Click the link below to deploy the solution.

2. Wait for the template deployment to open in the Azure portal, then complete the following steps:
Choose to Create new resource group, then provide a name such as
myCommerceChatBotInfrastructure in the text box.
Select a region from the Location drop-down box.
Provide a username and secure password for the SQL Server administrator account.
Review the terms and conditions, then check I agree to the terms and conditions stated above .
Select the Purchase button.
It takes a few minutes for the deployment to complete.
Deploy Web App chatbot
To create the chatbot, use the Azure CLI. The following example installs the CLI extension for Bot Service, creates a
resource group, then deploys a bot that uses Application Insights. When prompted, authenticate your Microsoft
account and allow the bot to register itself with the Bot Service and Language Understanding and Intelligent
Services (LUIS) app.

# Install the Azure CLI extension for the Bot Service


az extension add --name botservice --yes

# Create a resource group


az group create --name myCommerceChatbot --location eastus

# Create a Web App Chatbot that uses Application Insights


az bot create \
--resource-group myCommerceChatbot \
--name commerceChatbot \
--location eastus \
--kind webapp \
--sku S1 \
--insights eastus

Deploy chatbot C# application code


A sample C# application is available on GitHub:
Commerce Bot C# sample
The sample application includes the Azure Active Directory authentication components and integration with the
Language Understanding and Intelligent Services (LUIS) component of Cognitive Services. The application requires
Visual Studio to build and deploy the scenario. Additional information on configuring Azure AD B2C and the LUIS
app can be found in the GitHub repo documentation.

Pricing
To explore the cost of running this scenario, all of the services are pre-configured in the cost calculator. To see how
the pricing would change for your particular use case, change the appropriate variables to match your expected
traffic.
We have provided three sample cost profiles based on the number of messages you expect your chatbot to
process:
Small: this pricing example correlates to processing < 10,000 messages per month.
Medium: this pricing example correlates to processing < 500,000 messages per month.
Large: this pricing example correlates to processing < 10 million messages per month.

Related resources
For a set of guided tutorials for the Azure Bot Service, see the tutorial section of the documentation.
Commerce Chatbot
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Together, the Azure Bot Service and Language Understanding service enable developers to create conversational
interfaces for various scenarios like banking, travel, and entertainment. For example, a hotel's concierge can use a
bot to enhance traditional e-mail and phone call interactions by validating a customer via Azure Active Directory
and using Cognitive Services to better contextually process customer requests using text and voice. The Speech
recognition service can be added to support voice commands.

Architecture

Azure Active
Directory B2C
2
3 4

5 6
7

Download an SVG of this


architecture.

Data Flow
1. Customer uses your mobile app
2. Using Azure AD B2C, the user authenticates
3. Using the custom Application Bot, user requests information
4. Cognitive Services helps process the natural language request
5. Response is reviewed by customer who can refine the question using natural conversation
6. Once the user is happy with the results, the Application Bot updates the customer's reservation
7. Application insights gathers runtime telemetry to help development with Bot performance and usage
Disaster recovery and high availability for chatbots in
Azure
11/2/2020 • 2 minutes to read • Edit Online

To set up disaster recovery for an enterprise-grade conversational bot (chatbot), first review the service level
agreement (SLA) that cover the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) for the chatbot.
Implement the disaster recovery patterns in this article to build highly available and disaster resistant chatbot
solutions to meet the SLA.
For a description of the core components of a typical enterprise-grade chatbot solution in Azure, see Enterprise-
grade conversational bot.

Architecture

This diagram shows deployment of a chatbot solution in active-passive failover mode in two different Azure
regions for disaster recovery.
Download a Visio file of this architecture.

Components
Disaster recovery solutions vary depending on your SLA and the Azure services you use.
Non-regional services
Azure Active Directory (Azure AD), Azure Traffic Manager, Azure Front Door, and Azure Bot Service registration are
non-regional services that are always available in Azure geographies, regardless of specific region availability or
outage.
Regional services with automatic failover
Although you provision Azure Key Vault and Language Understanding Intelligent Service (LUIS) in a specific Azure
region, these services provide automatic failover to a different Azure region. For more information, see:
Azure Key Vault availability and redundancy
To set up high availability for Azure Cosmos DB, see High availability with Azure Cosmos DB.
LUIS regions and endpoints
Regional services without automatic failover
These services may need your attention to ensure high availability and disaster recovery.
Keep all deployment and source code artifacts in a source code repository, and use Azure paired regions to deploy
them in parallel. You can automate all the following deployment tasks and save them as part of your deployment
artifacts. When you deploy these services in the two paired regions, configure your bot API's environment variables
to match the specific services in each Azure region.
Keep the primary and secondary Azure search indexes in sync. For a sample app to back up and restore Azure
search indexes, see QnAMakerBackupRestore on GitHub.
Back up Application Insights by using continuous export. Although you can't currently import the exported
telemetry to another Application Insights resource, you can export into a storage account for further analysis.
To set up high availability and disaster recovery for Azure Storage accounts, see Disaster recovery and storage
account failover.
Deploy the bot API and QnA Maker into an Azure App Service Plan in both regions.
Once you set up the primary and secondary stacks, use Azure Traffic Manager or Azure Front Door to configure
the endpoints and set up a routing method for both QnA Maker and the bot API.
Create a Secure Sockets Layer (SSL) certificate for your traffic manager endpoint, and bind the SSL certificate in
your App Services.
Finally, use the Traffic Manager or Azure Front Door endpoint of QnA Maker in your bot, and use the traffic
manager endpoint of the bot API as the bot endpoint in Azure Bot Service registration.

Next steps
Enterprise-grade conversational bot
Overview of the resiliency pillar
Azure Key Vault availability and redundancy
Business continuity and disaster recovery (BCDR): Azure Paired Regions
Authoring and publishing regions and the associated keys
High availability with Azure Cosmos DB
Disaster recovery and account failover
Traffic Manager
Azure Front Door
Enterprise-grade conversational bot
11/2/2020 • 15 minutes to read • Edit Online

This reference architecture describes how to build an enterprise-grade conversational bot (chatbot) using the
Azure Bot Framework. Each bot is different, but there are some common patterns, workflows, and technologies
to be aware of. Especially for a bot to serve enterprise workloads, there are many design considerations beyond
just the core functionality. This article covers the most essential design aspects, and introduces the tools needed
to build a robust, secure, and actively learning bot.

Download a Visio file of this architecture.


The best practice utility samples used in this architecture are fully open-sourced and available on GitHub.

Architecture
The architecture shown here uses the following Azure services. Your own bot may not use all of these services, or
may incorporate additional services.
Bot logic and user experience
Bot Framework Ser vice (BFS). This service connects your bot to a communication app such as Cortana,
Facebook Messenger, or Slack. It facilitates communication between your bot and the user.
Azure App Ser vice . The bot application logic is hosted in Azure App Service.
Bot cognition and intelligence
Language Understanding (LUIS). Part of Azure Cognitive Services, LUIS enables your bot to understand
natural language by identifying user intents and entities.
Azure Search . Search is a managed service that provides a quick searchable document index.
QnA Maker . QnA Maker is a cloud-based API service that creates a conversational, question-and-answer
layer over your data. Typically, it's loaded with semi-structured content such as FAQs. Use it to create a
knowledge base for answering natural-language questions.
Web app . If your bot needs AI solutions not provided by an existing service, you can implement your own
custom AI and host it as a web app. This provides a web endpoint for your bot to call.
Data ingestion
The bot will rely on raw data that must be ingested and prepared. Consider any of the following options to
orchestrate this process:
Azure Data Factor y . Data Factory orchestrates and automates data movement and data transformation.
Logic Apps . Logic Apps is a serverless platform for building workflows that integrate applications, data, and
services. Logic Apps provides data connectors for many applications, including Office 365.
Azure Functions . You can use Azure Functions to write custom serverless code that is invoked by a trigger
— for example, whenever a document is added to blob storage or Cosmos DB.
Logging and monitoring
Application Insights . Use Application Insights to log the bot's application metrics for monitoring, diagnostic,
and analytical purposes.
Azure Blob Storage . Blob storage is optimized for storing massive amounts of unstructured data.
Cosmos DB . Cosmos DB is well-suited for storing semi-structured log data such as conversations.
Power BI . Use Power BI to create monitoring dashboards for your bot.
Security and governance
Azure Active Director y (Azure AD). Users will authenticate through an identity provider such as Azure AD.
The Bot Service handles the authentication flow and OAuth token management. See Add authentication to
your bot via Azure Bot Service.
Azure Key Vault . Store credentials and other secrets using Key Vault.
Quality assurance and enhancements
Azure DevOps . Provides many services for app management, including source control, building, testing,
deployment, and project tracking.
VS Code A lightweight code editor for app development. You can use any other IDE with similar features.

Design considerations
At a high level, a conversational bot can be divided into the bot functionality (the "brain") and a set of
surrounding requirements (the "body"). The brain includes the domain-aware components, including the bot
logic and ML capabilities. Other components are domain agnostic and address non-functional requirements such
as CI/CD, quality assurance, and security.

Before getting into the specifics of this architecture, let's start with the data flow through each subcomponent of
the design. The data flow includes user-initiated and system-initiated data flows.
User message flow
Authentication . Users start by authenticating themselves using whatever mechanism is provided by their
channel of communication with the bot. The bot framework supports many communication channels, including
Cortana, Microsoft Teams, Facebook Messenger, Kik, and Slack. For a list of channels, see Connect a bot to
channels. When you create a bot with Azure Bot Service, the Web Chat channel is automatically configured. This
channel allows users to interact with your bot directly in a web page. You can also connect the bot to a custom
app by using the Direct Line channel. The user's identity is used to provide role-based access control, as well as to
serve personalized content.
User message . Once authenticated, the user sends a message to the bot. The bot reads the message and routes
it to a natural language understanding service such as LUIS. This step gets the intents (what the user wants to
do) and entities (what things the user is interested in). The bot then builds a query that it passes to a service that
serves information, such as Azure Search for document retrieval, QnA Maker for FAQs, or a custom knowledge
base. The bot uses these results to construct a response. To give the best result for a given query, the bot might
make several back-and-forth calls to these remote services.
Response . At this point, the bot has determined the best response and sends it to the user. If the confidence
score of the best-matched answer is low, the response might be a disambiguation question or an
acknowledgment that the bot could not reply adequately.
Logging . When a user request is received or a response is sent, all conversation actions should be logged to a
logging store, along with performance metrics and general errors from external services. These logs will be
useful later when diagnosing issues and improving the system.
Feedback . Another good practice is to collect user feedback and satisfaction scores. As a follow up to the bot's
final response, the bot should ask the user to rate their satisfaction with the reply. Feedback can help you to solve
the cold start problem of natural language understanding, and continually improve the accuracy of responses.
System Data Flow
ETL . The bot relies on information and knowledge extracted from the raw data by an ETL process in the backend.
This data might be structured (SQL database), semi-structured (CRM system, FAQs), or unstructured (Word
documents, PDFs, web logs). An ETL subsystem extracts the data on a fixed schedule. The content is transformed
and enriched, then loaded into an intermediary data store, such as Cosmos DB or Azure Blob Storage.
Data in the intermediary store is then indexed into Azure Search for document retrieval, loaded into QnA Maker
to create question and answer pairs, or loaded into a custom web app for unstructured text processing. The data
is also used to train a LUIS model for intent and entity extraction.
Quality assurance . The conversation logs are used to diagnose and fix bugs, provide insight into how the bot is
being used, and track overall performance. Feedback data is useful for retraining the AI models to improve bot
performance.

Building a bot
Before you even write a single line of code, it's important to write a functional specification so the development
team has a clear idea of what the bot is expected to do. The specification should include a reasonably
comprehensive list of user inputs and expected bot responses in various knowledge domains. This living
document will be an invaluable guide for developing and testing your bot.
Ingest data
Next, identify the data sources that will enable the bot to interact intelligently with users. As mentioned earlier,
these data sources could contain structured, semi-structured, or unstructured data sets. When you're getting
started, a good approach is to make a one-off copy of the data to a central store, such as Cosmos DB or Azure
Storage. As you progress, you should create an automated data ingestion pipeline to keep this data current.
Options for an automated ingestion pipeline include Data Factory, Functions, and Logic Apps. Depending on the
data stores and the schemas, you might use a combination of these approaches.
As you get started, it's reasonable to use the Azure portal to manually create Azure resources. Later on, you
should put more thought into automating the deployment of these resources.
Core bot logic and UX
Once you have a specification and some data, it's time to start making your bot into reality. Let's focus on the
core bot logic. This is the code that handles the conversation with the user, including the routing logic,
disambiguation logic, and logging. Start by familiarizing yourself with the Bot Framework, including:
Basic concepts and terminology used in the framework, especially conversations, turns, and activities.
The Bot Connector service, which handles the networking between the bot and your channels.
How conversation state is maintained, either in memory or better yet in a store such as Azure Blob Storage or
Azure Cosmos DB.
Middleware, and how it can be used to hook up your bot with external services, such as Cognitive Services.
For a rich user experience, there are many options.
You can use cards to include buttons, images, carousels, and menus.
A bot can support speech.
You can even embed your bot in an app or website and use the capabilities of the app hosting it.
To get started, you can build your bot online using the Azure Bot Service, selecting from the available C# and
Node.js templates. As your bot gets more sophisticated, however, you will need to create your bot locally then
deploy it to the web. Choose an IDE, such as Visual Studio or Visual Studio Code, and a programming language.
SDKs are available for the following languages:
C#
JavaScript
Java (preview)
Python (preview)
As a starting point, you can download the source code for the bot you created using the Azure Bot Service. You
can also find sample code, from simple echo bots to more sophisticated bots that integrate with various AI
services.
Add smarts to your bot
For a simple bot with a well-defined list of commands, you might be able to use a rules-based approach to parse
the user input via regex. This has the advantage of being deterministic and understandable. However, when your
bot needs to understand the intents and entities of a more natural-language message, there are AI services that
can help.
LUIS is specifically designed to understand user intents and entities. You train it with a moderately sized
collection of relevant user input and desired responses, and it returns the intents and entities for a user's
given message.
Azure Search can work alongside LUIS. Using Search, you create searchable indexes over all relevant data.
The bot queries these indexes for the entities extracted by LUIS. Azure Search also supports synonyms,
which can widen the net of correct word mappings.
QnA Maker is another service that is designed to return answers for given questions. It's typically trained
over semi-structured data such as FAQs.
Your bot can use other AI services to further enrich the user experience. The Cognitive Services suite of pre-built
AI services (which includes LUIS and QnA Maker) has services for vision, speech, language, search, and location.
You can quickly add functionality such as language translation, spell checking, sentiment analysis, OCR, location
awareness, and content moderation. These services can be wired up as middleware modules in your bot to
interact more naturally and intelligently with the user.
Another option is to integrate your own custom AI service. This approach is more complex, but gives you
complete flexibility in terms of the machine learning algorithm, training, and model. For example, you could
implement your own topic modeling and use algorithm such as LDA to find similar or relevant documents. A
good approach is to expose your custom AI solution as a web service endpoint, and call the endpoint from the
core bot logic. The web service could be hosted in App Service or in a cluster of VMs. Azure Machine Learning
provides a number of services and libraries to assist you in training and deploying your models.

Quality assurance and enhancement


Logging . Log user conversations with the bot, including the underlying performance metrics and any errors.
These logs will prove invaluable for debugging issues, understanding user interactions, and improving the
system. Different data stores might be appropriate for different types of logs. For example, consider Application
Insights for web logs, Cosmos DB for conversations, and Azure Storage for large payloads. See Write directly to
Azure Storage.
Feedback . It's also important to understand how satisfied users are with their bot interactions. If you have a
record of user feedback, you can use this data to focus your efforts on improving certain interactions and
retraining the AI models for improved performance. Use the feedback to retrain the models, such as LUIS, in your
system.
Testing . Testing a bot involves unit tests, integration tests, regression tests, and functional tests. For testing, we
recommend recording real HTTP responses from external services, such as Azure Search or QnA Maker, so they
can be played back during unit testing without needing to make real network calls to external services.

NOTE
To jump-start your development in these areas, look at the Botbuilder Utils for JavaScript. This repo contains sample utility
code for bots built with Microsoft Bot Framework v4 and running Node.js. It includes the following packages:
Cosmos DB Logging Store. Shows how to store and query bot logs in Cosmos DB.
Application Insights Logging Store. Shows how to store and query bot logs in Application Insights.
Feedback Collection Middleware. Sample middleware that provides a bot user feedback-request mechanism.
Http Test Recorder. Records HTTP traffic from services external to the bot. It comes pre-built with support for LUIS,
Azure Search, and QnAMaker, but extensions are available to support any service. This helps you automate bot testing.
These packages are provided as utility sample code, and come with no guarantee of support or updates.

Availability considerations
As you roll out new features or bug fixes to your bot, it's best to use multiple deployment environments, such as
staging and production. Using deployment slots from Azure DevOps allows you to do this with zero downtime.
You can test your latest upgrades in the staging environment before swapping them to the production
environment. In terms of handling load, App Service is designed to scale up or out manually or automatically.
Because your bot is hosted in Microsoft's global datacenter infrastructure, the App Service SLA promises high
availability.

Security considerations
As with any other application, the bot can be designed to handle sensitive data. Therefore, restrict who can sign
in and use the bot. Also limit which data can be accessed, based on the user's identity or role. Use Azure AD for
identity and access control and Key Vault to manage keys and secrets.
DevOps considerations
Monitoring and reporting
Once your bot is running in production, you will need a DevOps team to keep it that way. Continually monitor
the system to ensure the bot operates at peak performance. Use the logs sent to Application Insights or Cosmos
DB to create monitoring dashboards, either using Application Insights itself, Power BI, or a custom web app
dashboard. Send alerts to the DevOps team if critical errors occur or performance falls below an acceptable
threshold.
Automated resource deployment
The bot itself is only part of a larger system that provides it with the latest data and ensures its proper operation.
All of these other Azure resources — data orchestration services such as Data Factory, storage services such as
Cosmos DB, and so forth — must be deployed. Azure Resource Manager provides a consistent management
layer that you can access through the Azure portal, PowerShell, or the Azure CLI. For speed and consistency, it's
best to automate your deployment using one of these approaches.
Continuous bot deployment
You can deploy the bot logic directly from your IDE or from a command line, such as the Azure CLI. As your bot
matures, however, it's best to use a continual deployment process using a CI/CD solution such as Azure DevOps,
as described in the article Set up continuous deployment. This is a good way to ease the friction in testing new
features and fixes in your bot in a near-production environment. It's also a good idea to have multiple
deployment environments, typically at least staging and production. Azure DevOps supports this approach.

Cost considerations
Use the Azure pricing calculator to estimate costs. Here are some other considerations.
Bot application
In this architecture, the main cost driver is the Azure App Service in which the bot application logic is hosted.
Choose an App Service plan tier that best suits your needs. Here are some recommendations:
Use Free and Shared (preview) tiers for testing purposes because the shared resources cannot scale out.
Run your production workload on Basic , Standard , and Premium tiers because the app runs on dedicated
virtual machine instances and has allocated resources that can scale out. App Service plans are billed on a per
second basis.
You are charged for the instances in the App Service plan, even when the app is stopped. Delete plans that you
don't intend to use long term, such as test deployments.
For more information, see How much does my App Service plan cost?.
Data ingestion
Azure Data Factory
In this architecture, Data Factory automates the data ingestion pipeline. Explore a range of data integration
capabilities to fit your budget needs, from managed SQL Server Integration Services for seamless
migration of SQL Server projects to the cloud (cost effective option), to large-scale, serverless data
pipelines for integrating data of all shapes and sizes.
For an example, see Azure Data Factory - example cost analysis.
Azure Functions
In this reference architecture, Azure Functions is billed as per the Consumption plan . You are charged
based on per-second resource consumption and each time an event triggers the execution of the function.
Processing several events in a single execution or batches can reduce cost.
Azure scales the infrastructure required to run functions as needed. When workload is low, the
infrastructure is scaled down up to zero with no associated cost. Whenever the workload grows, Azure
uses enough capacity to serve all the demand. Because you pay per actual use, manage the exact cost of
each component.
Logic Apps
Logic apps pricing works on the pay-as-you-go model. Logic apps have a pay-as-you-go pricing model. Triggers,
actions, and connector executions are metered each time a logic app runs. All successful and unsuccessful
actions, including triggers, are considered as executions.
For instance, your logic app processes 1000 messages a day from Azure Service Bus. A workflow of five actions
will cost less than $6. For more information, see Logic Apps pricing.
For other cost considerations, see the Cost section in Microsoft Azure Well-Architected Framework.

Next Steps
Review the Virtual Assistant template to quickly get started building conversational bots.
Enterprise Productivity Chatbot
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Azure Bot Service can be easily combined with Language Understanding to build powerful enterprise productivity
bots, allowing organizations to streamline common work activities by integrating external systems, such as
Microsoft 365 calendar, customer cases stored in Dynamics CRM and much more.

Architecture

Download an SVG of this architecture.

Data Flow
1. Employee access Enterprise Productivity Bot
2. Azure Active Directory validates the employee's identity
3. The Bot is able to query the employee's Microsoft 365 calendar via the Azure Graph
4. Using data gathered from the calendar, the Bot access case information in Dynamics CRM
5. Information is returned to the employee who can filter down the data without leaving the Bot
6. Application insights gathers runtime telemetry to help the development with Bot performance and usage
FAQ Chatbot with data champion model
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

The QnA Maker tool makes it easy for the content owners to maintain their knowledge base of Questions and
Answers. Combined with Bot Service and Language Understanding, it becomes simple to setup a FAQ chatbot
which responds from different knowledge bases depending on the intent of the query.

Architecture

Download an SVG of this architecture.

Data Flow
1. Employee access FAQ Bot
2. Azure Active Director validates the employee's identity
3. Query is send to a LUIS model to get the intent of the query
4. Based in the intent, the query is redirected to the appropriate Knowledge base
5. QnA Maker gives the best match to the incoming query
6. The result is shown to the employee
7. Data Champions manage and update their QnA Knowledge base based on the feedback from user traffic
Content Research
11/2/2020 • 2 minutes to read • Edit Online

This architecture describes how knowledge mining can be used for content research.
When organizations task employees to review and research of technical data, it can be tedious to read page after
page of dense text. Knowledge mining helps employees quickly review these dense materials. In industries where
bidding competition is fierce, or when the diagnosis of a problem must be quick or in near real-time, companies
can use knowledge mining to avoid costly mistakes.

Data Flow
There are three steps: Ingest, Enrich and Exploration. First, the unstructured and structured data is ingested then
enrichment of this data with AI to extract information and find and finally explore the newly structured data via
search, existing business applications or analytics solutions.
1. The user can ingest different types of technical content like product manual, user guides, engineering standard
documents, patent records, medical journals, pharmaceutical fillings
2. This content is enriched by using optical character recognition, key phrase extraction, entity recognition,
language translation, customized models to extract industry-specific terms such as product names or
engineering standards, customized models to flag potential risks or other essential information, customized
models for HIPAA compliance
3. And finally, the user can integrate the search index in to a searchable directory or a exiting business application

Components
Key technologies used to implement tools for technical content review and research
Azure Cognitive Search
Microsoft Text Analytics API
Microsoft Translator Text API
Microsoft Form Recognizer
Web API custom skill interface

Next steps
Using the knowledge mining solution accelerator to build an initial knowledge mining prototype with Azure
Cognitive Search.
Building custom skills with Microsoft's Custom Web API
Contract Management
11/2/2020 • 2 minutes to read • Edit Online

Many companies create products for multiple sectors, hence the business opportunities with different vendors and
buyers increases exponentially. Knowledge mining can help organizations to scour thousands of pages of sources
to create an accurate bid. Minor details in the bidding process can make the difference between a healthy profit or
lost opportunity on a project.

Data Flow
There are three steps: Ingest, Enrich and Exploration.
First, the unstructured and structured data is ingested then enrichment of this data with AI to extract information
and find and finally explore the newly structured data via search, existing business applications or analytics
solutions.
1. The user can ingest different types of content like user guides, forms, product manuals, product pricing
proposals, cost sheets, project reports
2. This content is enriched by using key phrase extraction, optical character recognition, entity recognition,
customized models to flag potential risk or essential information
3. And finally, the user can integrate the search index in to the portal to expand the knowledge base as users share
more information

Components
Key technologies used to implement tools for technical content review and research
Azure Cognitive Search
Microsoft Text Analytics API
Microsoft Translator Text API
Microsoft Form Recognizer
Web API custom skill interface
Next steps
Using the knowledge mining solution accelerator to build an initial knowledge mining prototype with Azure
Cognitive Search.
Building custom skills with Microsoft's Custom Web API
Customer Churn Prediction
11/13/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Customer Churn Prediction uses Cortana Intelligence Suite components to predict churn probability and helps find
patterns in existing data associated with the predicted churn rate.

Architecture

Download an SVG of this architecture.

Description
For more details on how this solution is built, visit the solution guide in GitHub.
Keeping existing customers is five times cheaper than the cost of attaining new ones. For this reason, marketing
executives often find themselves trying to estimate the likelihood of customer churn and finding the necessary
actions to minimize the churn rate.
Customer Churn Prediction uses Azure Machine Learning to predict churn probability and helps find patterns in
existing data associated with the predicted churn rate. This information empowers businesses with actionable
intelligence to improve customer retention and profit margins.
The objective of this guide is to demonstrate predictive data pipelines for retailers to predict customer churn.
Retailers can use these predictions to prevent customer churn by using their domain knowledge and proper
marketing strategies to address at-risk customers. The guide also shows how customer churn models can be
retrained to leverage additional data as it becomes available.

What's Under the Hood


The end-to-end solution is implemented in the cloud, using Microsoft Azure. The solution is composed of several
Azure components, including data ingest, data storage, data movement, advanced analytics and visualization. The
advanced analytics are implemented in Azure Machine Learning, where one can use Python or R language to build
data science models (or reuse existing in-house or third-party libraries). With data ingest, the solution can make
predictions based on data that being transferred to Azure from an on-premises environment.

Solution Dashboard
The snapshot below shows an example PowerBI dashboard that gives insights into the the predicted churn rates
across the customer base.
Customer Feedback and Analytics
11/2/2020 • 2 minutes to read • Edit Online

For many companies, customer support is costly and inefficient. Knowledge mining can help customer support
teams quickly find the right answer for a customer inquiry or assess customer sentiment at scale.
Every company is looking to enhance the customer experience. Knowledge mining can aggregate and analyze data
to discover trends about what customers are saying and use that information to improve products and services

Data Flow
There are three steps: Ingest, Enrich and Exploration.
First, the unstructured and structured data is ingested then enrichment of this data with AI to extract information
and find and finally explore the newly structured data via search, existing business applications or analytics
solutions.
1. The user can ingest different types of content like customer support tickets, chat logs, call transcriptions,
customer emails, customer payment history, product reviews, social media feeds, online comments, feedback
forms, and surveys
2. This content is enriched by using keyphrase extraction, sentiment analysis, language translation, bot services,
custom models to focus on specific products or company policies
3. And finally, the user compile enriched documents in the knowledge store and project them into tabular or object
stores, then surface trends in an analytics dashboard, such as frequent issues, popular products or integrate the
search index in to customer service support application

Components
Key technologies used to implement tools for technical content review and research
Azure Cognitive Search
Microsoft Text Analytics API
Microsoft Translator Text API
Web API custom skill interface

Next steps
Using the knowledge mining solution accelerator to build an initial knowledge mining prototype with Azure
Cognitive Search.
Building custom skills with Microsoft's Custom Web API
Defect prevention with predictive maintenance
11/13/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Learn how to use Azure Machine Learning to predict failures before they happen with real-time assembly line data.
This solution is built on the Azure managed services: Azure Stream Analytics, Event Hubs, Azure Machine Learning,
Azure Synapse Analytics and Power BI. These services run in a high-availability environment, patched and
supported, allowing you to focus on your solution instead of the environment they run in.

Architecture

Download an SVG of this architecture.

Components
Azure Stream Analytics: Stream Analytics provides near real-time analytics on the input stream from the Azure
Event Hub. Input data is filtered and passed to a Machine Learning endpoint, finally sending the results to the
Power BI dashboard.
Event Hubs ingests raw assembly-line data and passes it on to Stream Analytics.
Azure Machine Learning: Machine Learning predicts potential failures based on real-time assembly-line data
from Stream Analytics.
Azure Synapse Analytics: Synapse Analytics stores assembly-line data along with failure predictions.
Power BI visualizes real-time assembly-line data from Stream Analytics and the predicted failures and alerts
from Data Warehouse.
Next steps
Learn more about Stream Analytics
Learn more about Event Hubs
Learn more about Azure Machine Learning
Learn more about Synapse Analytics
Learn more about Power BI
Distributed training of deep learning models on
Azure
11/2/2020 • 8 minutes to read • Edit Online

This reference architecture shows how to conduct distributed training of deep learning models across clusters of
GPU-enabled VMs. The scenario is image classification, but the solution can be generalized for other deep learning
scenarios such as segmentation and object detection.
A reference implementation for this architecture is available on GitHub.

Scenario : Classifying images is a widely applied technique in computer vision, often tackled by training a
convolutional neural network (CNN). For particularly large models with large datasets, the training process can
take weeks or months on a single GPU. In some situations, the models are so large that it's not possible to fit
reasonable batch sizes onto the GPU. Using distributed training in these situations can shorten the training time.
In this specific scenario, a ResNet50 CNN model is trained using Horovod on the Imagenet dataset and on
synthetic data. The reference implementation shows how to accomplish this task using TensorFlow.
There are several ways to train a deep learning model in a distributed fashion, including data-parallel and model-
parallel approaches based on synchronous or asynchronous updates. Currently the most common scenario is data
parallel with synchronous updates. This approach is the easiest to implement and is sufficient for most use cases.
In data-parallel distributed training with synchronous updates, the model is replicated across n hardware devices.
A mini-batch of training samples is divided into n micro-batches. Each device performs the forward and backward
passes for a micro-batch. When a device finishes the process, it shares the updates with the other devices. These
values are used to calculate the updated weights of the entire mini-batch, and the weights are synchronized across
the models. This scenario is covered in the GitHub repository.
This architecture can also be used for model-parallel and asynchronous updates. In model-parallel distributed
training, the model is divided across n hardware devices, with each device holding a part of the model. In the
simplest implementation, each device may hold a layer of the network, and information is passed between devices
during the forward and backwards pass. Larger neural networks can be trained this way, but at the cost of
performance, since devices are constantly waiting for each other to complete either the forward or backwards
pass. Some advanced techniques try to partially alleviate this issue by using synthetic gradients.
The steps for training are:
1. Create scripts that will run on the cluster and train your model, then transfer them to file storage.
2. Write the data to Premium Blob Storage.
3. Create an Azure Machine Learning workspace. This will also create an Azure Container Registry to host your
Docker Images.
4. Create an Azure Machine Learning GPU Cluster.
5. Submit jobs. For each job with unique dependencies, a new Docker image is built and pushed to your
container registry. During execution, the appropriate Docker image runs and executes your script.
6. All the results and logs will be written to Blob storage.

Architecture
This architecture consists of the following components.
Azure Machine Learning Compute plays the central role in this architecture by scaling resources up and down
according to need. AzureML Compute is a service that helps provision and manage clusters of VMs, schedule jobs,
gather results, scale resources,and handle failures. It supports GPU-enabled VMs for deep learning workloads.
Blob storage is used to store the logs and results.
Premium Blob storage is used to store the training data. The premium blob storage is mounted in the nodes
using blob fuse. Premium blob offers better performance than standard blob and is recommended for distributed
training scenarios. When mounted blob fuse uses caching so that the data is saved locally. This means that after
the first epoch subsequent epochs will read from the local storage which is the most performant option.
Container Registr y is used to store the Docker image that Azure Machine Learning Compute uses to run the
training.

Performance considerations
Azure provides four GPU-enabled VM types suitable for training deep learning models. They range in price and
speed from low to high as follows:
A Z URE VM SERIES N VIDIA GP U

NC K80

ND P40

NCv2 P100

NCv3 V100

We recommended scaling up your training before scaling out. For example, try a single V100 before trying a
cluster of K80s.
The following graph shows the performance differences for different GPU types based on benchmarking tests
carried out using TensorFlow and Horovod. The graph shows throughput of 32 GPU clusters across various
models, on different GPU types and MPI versions. Models were implemented in TensorFlow 1.9

Each VM series shown in the previous table includes a configuration with InfiniBand. Use the InfiniBand
configurations when you run distributed training, for faster communication between nodes. InfiniBand also
increases the scaling efficiency of the training for the frameworks that can take advantage of it. For details, see the
Infiniband benchmark comparison.
Although Azure Machine Learning Compute can mount Blob storage using the blobfuse adapter, we don't
recommend using Blob Storage this way for distributed training, because the performance isn't good enough for
the majority of cases to handle the necessary throughput. Use Premium Blob for the data as shown in the
architecture diagram.

Scalability considerations
The scaling efficiency of distributed training is always less than 100 percent due to network overhead — syncing
the entire model between devices becomes a bottleneck. Therefore, distributed training is most suited for large
models that cannot be trained using a reasonable batch size on a single GPU, or for problems that cannot be
addressed by distributing the model in a simple, parallel way.
Distributed training is not recommended for running hyperparameter searches. The scaling efficiency affects
performance and makes a distributed approach less efficient than training multiple model configurations
separately.
One way to increase scaling efficiency is to increase the batch size. That must be done carefully, however, because
increasing the batch size without adjusting the other parameters can hurt the model's final performance.
Storage considerations
When training deep learning models, an often-overlooked aspect is where the data is stored. If the storage is too
slow to keep up with the demands of the GPUs, training performance can degrade.
Azure Machine Learning Compute supports many storage options. For best performance it is advisable that you
download the data locally to each node. However, this can be cumbersome, because all the nodes must download
the data from Blob Storage, and with the ImageNet dataset, this can take a considerable amount of time. By default
AzureML mounts storage such that it caches the data locally. This means in practice that after the first epoch the
data is read from local storage. This combined with Premium Blob Storage offers a good compromise between
ease of use and performance.

Data format
With large datasets it is often advisable to use data formats such as TFRecords and parquet which provide better
I/O performance than multiple small image files.

Security considerations
Encrypt data at rest and in motion
In scenarios that use sensitive data, encrypt the data at rest — that is, the data in storage. Each time data moves
from one location to the next, use SSL to secure the data transfer. For more information, see the Azure Storage
security guide.
Secure data in a virtual network
For production deployments, consider deploying the Azure Machine Learning cluster into a subnet of a virtual
network that you specify. This allows the compute nodes in the cluster to communicate securely with other virtual
machines or with an on-premises network. You can also use service endpoints with blob storage to grant access
from a virtual network.

Monitoring considerations
While running your job, it's important to monitor the progress and make sure that things are working as expected.
However, it can be a challenge to monitor across a cluster of active nodes.
Azure Machine Learning offers many ways to instrument your experiments. The stdout/stderr from your scripts
are automatically logged. These logs are automatically synced to your workspace Blob storage. You can either view
these files through the Azure portal, or download or stream them using the Python SDK or Azure Machine
Learning CLI. If you log your experiments using Tensorboard, these logs are automatically synced and you can
access them directly or use the Azure Machine Learning SDK to stream them to a Tensorboard session.

Cost considerations
Use the Azure pricing calculator to estimate costs. Here are some other tips for how to plan and manage costs.
Premium Blob Storage
Premium Blob Storage has higher data storage cost, however the transaction cost is lower compared to data
stored in the regular Hot tier. So, Premium Blob Storage can be less expensive for workloads with high transaction
rates. For more information, see pricing page.
Azure Container Registry
Azure Container Registry offers Basic , Standard and Premium . Choose a tier depending on the storage you
need. Choose Premium if you need geo replication, or you enhanced throughput for docker pulls across
concurrent nodes. In addition, standard networking charges apply. For more information, see Azure Container
Registry pricing.
Azure Machine Learning Compute
In this architecture, Azure ML Compute is one of the main cost drivers. The implementation needs a cluster of GPU
compute nodes and selected VM sizes can impact cost. For more information on the VM sizes that include GPUs,
see GPU-optimized virtual machine sizes and Azure Virtual Machines Pricing.
For more information, see the Cost section in Microsoft Azure Well-Architected Framework.

Deployment
The reference implementation of this architecture is available on GitHub. Follow the steps described there to
conduct distributed training of deep learning models across clusters of GPU-enabled VMs.

Next steps
The output from this architecture is a trained model that is saved to blob storage. You can operationalize this
model for either real-time scoring or batch scoring. For more information, see the following reference
architectures:
Real-time scoring of Python scikit-learn and deep learning models on Azure
Batch scoring on Azure for deep learning models
Digital Asset Management
11/2/2020 • 2 minutes to read • Edit Online

This architecture describes digital asset management for knowledge mining.


Given the amount of unstructured data created daily, many companies are struggling to make use of or find
information within their files. Knowledge mining through a search index makes it easy for end customers and
employees to locate what they are looking for faster.

Data Flow
There are three steps: Ingest, Enrich and Exploration. First, the unstructured and structured data is ingested then
enrichment of this data with AI to extract information and find and finally explore the newly structured data via
search, existing business applications or analytics solutions.
1. The user can ingest different types of technical content like article and image archives, photos and videos,
internal documents, marketing assets, brochures
2. This content is enriched by using automatic image captioning and object detection with computer vision,
celebrity recognition, language translation, and entity recognition
3. And finally, the user can integrate the search index into a website.

Components
Key technologies used to implement tools for technical content review and research
Azure Cognitive Search
Microsoft Computer Vision API
Microsoft Face API
Web API custom skill interface

Next steps
Using the knowledge mining solution accelerator to build an initial knowledge mining prototype with Azure
Cognitive Search.
Building custom skills with Microsoft's Custom Web API
Energy Supply Optimization
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

In an energy grid, energy consumers are engaged with various types of energy supplying, trading, and storage
components such as substations, batteries, windfarms and solar panels, micro-turbines, as well as demand
response bids, to meet their respective demands and minimize the cost of energy commitment. To do so, the grid
operator must determine how much energy each type of the resources should commit over a time frame, given the
prices of soliciting different types of resources and the capacities and the physical characteristics of them.
This solution is built upon Cortana Intelligence Suite and external open-source tools, and it computes the optimal
energy unit commitments from various types of energy resources. This solution demonstrates the ability of
Cortana Intelligence Suite to accommodating external tools, to solve parallelized numerical optimization problems
over an Azure Batch of Azure Virtual Machines.

Architecture

Download an SVG of this architecture.

Description
Save time and let one of these trained SI partners help you with a proof of concept, deployment & integration of
this solution.
Estimated Daily Cost: $12
For more details on how this solution is built, visit the solution guide in GitHub.
An energy grid consists of energy consumers, as well as various types of energy supplying, trading, and storage
components: Substations accepts power load or exports excessive power; Batteries may discharge energy or store
it for future use; Windfarms and solar panel (self-scheduled generators), micro-turbines (dispatchable generators),
and demand response bids can all be engaged to satisfying the demand from the consumers within the grid. The
costs of soliciting different types of resources vary, while the capacities and the physical characteristics of each
resource type limit the dispatch of the resource. Given all these constraints, a central challenge the smart grid
operator must face, is how much energy each type of the resources should commit over a time frame, so that the
forecasted energy demand from the grid are satisfied.
This solution provides an Azure-based smart solution, leveraging external open-source tools, that determines the
optimal energy unit commitments from various types of energy resources for an energy grid. The goal is to
minimize the overall cost incurred by these commitments while satisfying the energy demand. This solution
demonstrates the ability of Azure to accommodating external tools, such as Pyomo and CBC, to solve large-scale
numerical optimization problems such as mixed integer-linear programming, parallelizing multiple optimization
tasks over an Azure Batch of Azure Virtual Machines. Other involved products include Azure Blob Storage, Azure
Queue Storage, Azure Web App, Azure SQL Database, as well as Power BI.

Technical details and workflow


1. The sample data is streamed by newly deployed Azure Web Jobs. The web job uses resource related data from
Azure SQL to generate the simulated data.
2. The data simulator feeds this simulated data into the Azure Storage and writes message in Storage Queue, that
will be used in the rest of the solution flow.
3. Another Web Job monitors the storage queue and initiate an Azure Batch job once message in the queue is
available.
4. The Azure Batch service together with Data Science Virtual Machines is used to optimize the energy supply from
a particular resource type given the inputs received.
5. Azure SQL Database is used to store the optimization results received from the Azure Batch service. These
results are then consumed in the Power BI dashboard.
6. Finally, Power BI is used for results visualization.
Forecast Energy and Power Demand
11/13/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Learn how Microsoft Azure can help accurately forecast spikes in demand for energy products and services to give
your company a competitive advantage.
This solution is built on the Azure managed services: Azure Stream Analytics, Event Hubs, Azure Machine Learning,
Azure SQL Database, Data Factory and Power BI. These services run in a high-availability environment, patched and
supported, allowing you to focus on your solution instead of the environment they run in.

Architecture

Download an SVG of this architecture.

Components
Azure Stream Analytics: Stream Analytics aggregates energy consumption data in near real-time to write to
Power BI.
Event Hubs ingests raw energy consumption data and passes it on to Stream Analytics.
Azure Machine Learning: Machine Learning forecasts the energy demand of a particular region given the inputs
received.
Azure SQL Database: SQL Database stores the prediction results received from Azure Machine Learning. These
results are then consumed in the Power BI dashboard.
Data Factory handles orchestration and scheduling of the hourly model retraining.
Power BI visualizes energy consumption data from Stream Analytics as well as predicted energy demand from
SQL Database.

Next steps
Learn more about Stream Analytics
Learn more about Event Hubs
Learn more about Azure Machine Learning
Learn more about SQL Database
Learn more about Data Factory
Learn more about Power BI
Image classification for insurance claims on Azure
11/2/2020 • 4 minutes to read • Edit Online

This scenario is relevant for businesses that need to process images.


Potential applications include classifying images for a fashion website, analyzing text and images for insurance
claims, or understanding telemetry data from game screenshots. Traditionally, companies would need to develop
expertise in machine learning models, train the models, and finally run the images through their custom process to
get the data out of the images.
By using Azure services such as the Computer Vision API and Azure Functions, companies can eliminate the need
to manage individual servers, while reducing costs and leveraging the expertise that Microsoft has already
developed around processing images with Cognitive Services. This example scenario specifically addresses an
image-processing use case. If you have different AI needs, consider the full suite of Cognitive Services.

Relevant use cases


Other relevant use cases include:
Classifying images on a fashion website.
Classifying telemetry data from screenshots of games.

Architecture

This scenario covers the back-end components of a web or mobile application. Data flows through the scenario as
follows:
1. The API layer is built using Azure Functions. These APIs enable the application to upload images and retrieve
data from Cosmos DB.
2. When an image is uploaded via an API call, it's stored in Blob storage.
3. Adding new files to Blob storage triggers an Event Grid notification to be sent to an Azure Function.
4. Azure Functions sends a link to the newly uploaded file to the Computer Vision API to analyze.
5. Once the data has been returned from the Computer Vision API, Azure Functions makes an entry in Cosmos DB
to persist the results of the analysis along with the image metadata.
Components
Computer Vision API is part of the Cognitive Services suite and is used to retrieve information about each
image.
Azure Functions provides the back-end API for the web application, as well as the event processing for uploaded
images.
Event Grid triggers an event when a new image is uploaded to blob storage. The image is then processed with
Azure functions.
Blob storage stores all of the image files that are uploaded into the web application, as well any static files that
the web application consumes.
Cosmos DB stores metadata about each image that is uploaded, including the results of the processing from
Computer Vision API.

Alternatives
Custom Vision Service. The Computer Vision API returns a set of taxonomy-based categories. If you need to
process information that isn't returned by the Computer Vision API, consider the Custom Vision Service, which
lets you build custom image classifiers.
Azure Search. If your use case involves querying the metadata to find images that meet specific criteria, consider
using Azure Search. Currently in preview, Cognitive search seamlessly integrates this workflow.

Considerations
Scalability
The majority of the components used in this example scenario are managed services that will automatically scale. A
couple notable exceptions: Azure Functions has a limit of a maximum of 200 instances. If you need to scale beyond
this limit, consider multiple regions or app plans.
Cosmos DB doesn't autoscale in terms of provisioned request units (RUs). For guidance on estimating your
requirements, see request units in our documentation. To fully take advantage of the scaling in Cosmos DB,
understand how partition keys work in Cosmos DB.
NoSQL databases frequently trade consistency (in the sense of the CAP theorem) for availability, scalability, and
partitioning. In this example scenario, a key-value data model is used and transaction consistency is rarely needed
as most operations are by definition atomic. Additional guidance to Choose the right data store is available in the
Azure Architecture Center. If your implementation requires high consistency, you can choose your consistency level
in Cosmos DB.
For general guidance on designing scalable solutions, see the performance efficiency checklist in the Azure
Architecture Center.
Security
Managed identities for Azure resources are used to provide access to other resources internal to your account and
then assigned to your Azure Functions. Only allow access to the requisite resources in those identities to ensure
that nothing extra is exposed to your functions (and potentially to your customers).
For general guidance on designing secure solutions, see the Azure Security Documentation.
Resiliency
All of the components in this scenario are managed, so at a regional level they are all resilient automatically.
For general guidance on designing resilient solutions, see Designing resilient applications for Azure.
Pricing
To explore the cost of running this scenario, all of the services are pre-configured in the cost calculator. To see how
the pricing would change for your particular use case, change the appropriate variables to match your expected
traffic.
We have provided three sample cost profiles based on amount of traffic (we assume all images are 100 kb in size):
Small: this pricing example correlates to processing < 5000 images a month.
Medium: this pricing example correlates to processing 500,000 images a month.
Large: this pricing example correlates to processing 50 million images a month.

Related resources
For a guided learning path, see Build a serverless web app in Azure.
Before deploying this example scenario in a production environment, review recommended practices for
optimizing the performance and reliability of Azure Functions.
Image classification with Convolutional Neural
Networks
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Lean manufacturing, cost control, and waste reduction are imperative for manufacturing to remain competitive. In
circuit-board manufacturing, faulty boards can cost manufacturers money and productivity. Assembly lines rely on
human operators to quickly review and validate boards flagged as potentially faulty by assembly-line test
machines.
This solution analyzes electronic component images generated by assembly-line cameras in a circuit-board
manufacturing plant and detects their error status. The goal is to minimize or remove the need for human
intervention. The solution builds an image classification system using a convolutional neural network with 50
hidden layers, pretrained on 350,000 images in an ImageNet dataset to generate visual features of the images by
removing the last network layer. These features are then used to train a boosted decision tree to classify the image
as "pass" or "fail" and final scoring conducted on edge machines at the plant. The classification performance results
are good (time-based cross-validation AUC>.90) which indicates the solution is suitable to drastically minimize
human intervention for electronic-components failure detection in assembled circuit boards.
Using this solution to automate failure detection instead of relying solely on human operators helps improve the
identification of faulty electronic components and boost productivity.

Architecture
Download an SVG of this architecture.

Components
Azure Blob Storage: Data is ingested and stored in Azure Blob Storage.
GPU based Azure Data Science Virtual Machine: The core development environment is the Azure Ubuntu-based
GPU DSVM. The data is pulled from blob onto an Azure virtual hard disk (VHD) attached to the DSVM. On that
VHD, the data is processed, the images are featurized using a Deep Neural Network, and a Boosted Tree model is
trained. DSVM IPython Notebook server is used for solution development.
Microsoft Machine Learning for Apache Spark HDInsight Spark Cluster: As an alternative to DSVM-based
training, for big datasets, we use MMLSpark to build a highly scalable training solution.
Azure Container Registry: The model and web application are packaged into a Docker image and written to
Azure Container Registry.
Azure Machine Learning Model Management Service: Azure Machine Learning Model Management service is
used to deploy and manage the final model on a VM and to scale out using Azure Kubernetes Service to a
Kubernetes managed Azure cluster. A predictive web service and a Java ETL service are also written onto the VM,
each in its own container.
Azure Kubernetes Service (AKS): Deployment for this solution uses Azure Kubernetes Service running a
Kubernetes-managed cluster. The containers are deployed from images stored in Azure Container Registry.

Next steps
Learn more about Azure Blob Storage
Browse on GitHub
Learn more about Azure Container Registry
Learn more about Model Management
Learn more about Azure Kubernetes Service
Information discovery with deep learning and NLP
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Social sites, forums, and other text-heavy Q&A services rely heavily on tagging, which enables indexing and user
search. Without appropriate tagging, these sites are far less effective. Often, however, tagging is left to the users'
discretion. And since users don't have lists of commonly searched terms or a deep understanding of the
categorization or information architecture of a site, posts are frequently mislabeled. This makes it difficult or
impossible to find that content when it's needed later.
By combining deep learning and natural language processing (NLP) with data on site-specific search terms, this
solution helps greatly improve tagging accuracy on your site. As your user types their post, it offers highly used
terms as suggested tags, making it easier for others to find the information they're providing.

Architecture

Download an SVG of this architecture.

Components
Microsoft SQL Server: Data is stored, structured, and indexed using Microsoft SQL Server.
GPU based Azure Data Science Virtual Machine: The core development environment is the Microsoft Windows
Server 2016 GPU DSVM NC24.
Azure Machine Learning Workbench: The Workbench is used for data cleaning and transformation, and it serves
as the primary interface to the Experimentation and Model Management services.
Azure Machine Learning Experimentation Service: The Experimentation Service is used for model training,
including hyperparameter tuning.
Azure Machine Learning Model Management Service: The Model Management service is used for deployment
of the final model, including scaling out to a Kubernetes-managed Azure cluster.
Jupyter Notebooks on Azure Data Science VM: Jupyter Notebooks is used as the base IDE for the model, which
was developed in Python.
Azure Container Registry: The Model Management Service creates and packages real-time web services as
Docker containers. These containers are uploaded and registered via Azure Container Registry.
Azure Kubernetes Service (AKS): Deployment for this solution uses Azure Kubernetes Service running a
Kubernetes-managed cluster. The containers are deployed from images stored in Azure Container Registry.

Next steps
Learn more about Microsoft SQL Server
Learn more about Azure Machine Learning Workbench
Learn more about Azure Machine Learning sample experiments
Learn More about Azure Machine Learning Model Management
Learn more about Jupyter Notebooks
Learn more about Azure Container Registry
Learn more about Azure Kubernetes Service
Interactive Voice Response Bot
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

This solution creates an intelligent interactive voice response (IVR) application that processes customer order
requests for bicycles and bicycle accessories. Businesses with no existing IVR solution can easily get started
automating requests, or, where existing human-operated systems exist, this solution can be extended to incorporate
existing functionality and workflows.

Architecture

Download an SVG of this architecture.

Description
Estimated Daily Cost: $11.08
For more details on this solution, including source code and data customization, please fork or explore this project
on GitHub.
Automated customer interaction is essential to a business of any size. In fact, 61% of consumers prefer to
communicate via speech, and most of them prefer self-service. Because customer satisfaction is a priority for all
businesses, self-service is a critical facet of any customer-facing communications strategy.
In an era when mobile phones are dominant and keypads are not always readily accessible, interactive voice
response (IVR) systems provide an intuitive, simple, and convenient method for customers to convey their requests.
This solution provides an intelligent and natural self-service experience that can be repurposed across all customer
channels.
The Cortana Intelligence Suite provides advanced analytics tools backed by Microsoft's Azure cloud platforms-data
ingestion, data storage, data processing and advanced analytics components-all of the essential elements for
building an automated IVR solution.
This solution deploys a fully integrated service stack that manages speech requests via Skype and the Microsoft Bot
Framework.
The target use case of this solution is for a fictitious company, Contoso LLC, which sells bicycles and bicycle
equipment to its customers. Contoso currently processes new product orders and queries through human
operators. This automated solution allows Contoso to seamlessly scale up to handle a large call volumes while
maintaining zero wait times and freeing up staff to manage other tasks.
The "Deploy" button on this page will launch a workflow that deploys an instance of this solution within an Azure
subscription of your choosing. The solution includes multiple Azure services (described below) and provides a few
short instructions after the deployment that will help you setup communications with your new bot.

AI and NLP Azure Services


Bing Speech transcribes raw speech data into text form
LUIS (Language Understanding Intelligent Service) identifies caller intent (place an order, find an order, etc.) and
spoken entities (product categories, dates, times, etc.) from the transcribed text.
Azure Cognitive Search matches and ranks products from the inventory against the caller's description, and
applies advanced text analysis for common synonyms.
Microsoft Bot Framework processes conversation results from the caller and manages the call workflow.

Supporting Azure Services


Azure Web App hosts the core logic of the bot as a Node.js web app
Cosmos DB stores shared session state for every conversation, allowing the web application to scale out in a
stateless architecture.
Azure SQL stores product inventory and order status
Keyword search/speech-to-text/OCR digital media
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

A speech-to-text solution allows you to identify speech in static video files so you can manage it as standard
content, such as allowing employees to search within training videos for spoken words or phrases, and then
enabling them to quickly navigate to the specific moment in the video. This solution allows you to upload static
videos to an Azure website. The Azure Media Indexer uses the Speech API to index the speech within the videos and
stores it in SQL Azure. You can search for words or phrases by using Azure Web Apps and retrieve a list of results.
Selecting a result enables you to see where in the video the word or phrase is mentioned.
This solution is built on the Azure managed services: Content Delivery Network and Azure Cognitive Search. These
services run in a high-availability environment, patched and supported, allowing you to focus on your solution
instead of the environment they run in.

Architecture

Download an SVG of this architecture.

Components
Stores large amounts of unstructured data, such as text or binary data, that can be accessed from anywhere in
the world via HTTP or HTTPS. You can use Blob storage to expose data publicly to the world, or to store
application data privately.
Azure encoding: Encoding jobs are one of the most common processing operations in Media Services. You
create encoding jobs to convert media files from one encoding to another.
Azure streaming endpoint: Represents a streaming service that can deliver content directly to a client player
application, or to a content delivery network (CDN) for further distribution.
Content Delivery Network: Provides secure, reliable content delivery with broad global reach and a rich feature
set.
Azure Media Player: Uses industry standards, such as HTML5 (MSE/EME) to provide an enriched adaptive
streaming experience. Regardless of the playback technology used, developers have a unified JavaScript
interface to access APIs.
Azure Cognitive Search: Delegates search-as-a-service server and infrastructure management to Microsoft,
leaving you with a ready-to-use service that you can populate with your data, and then use to add search to
your web or mobile application.
Web Apps: Hosts the website or web application.
Azure Media Indexer: Enables you to make the content of your media files searchable and to generate a full-text
transcript for closed-captioning and keywords. You can process one media file or multiple media files in a batch.

Next steps
How to use Azure Blob storage
How to encode an asset using Media Encoder
How to manage streaming endpoints
Using Azure Content Delivery Network
Develop video player applications
Create an Azure Cognitive Search service
Run Web Apps in the cloud
Indexing media files
Machine teaching
11/2/2020 • 25 minutes to read • Edit Online

Artificial intelligence (AI) and machine learning (ML) offer unique opportunities and challenges for operations that
span the virtual and physical worlds. AI and ML can recognize correlations between real-world input data and
outcomes and make decisions that automate complex physical industrial systems. But AI machine learning systems
can't perform higher-level cognitive functions like exploration, improvisation, creative thinking, or determination of
causation.
Machine teaching is a new paradigm for machine learning systems that:
Infuses subject matter expertise into automated AI systems models.
Uses deep reinforcement learning to identify patterns in the learning process and adopt positive behaviors in its
own methods.
Leverages simulated environments to generate large amounts of synthetic data for domain-specific use cases
and scenarios.
Machine learning focuses on developing novel learning algorithms or improving existing algorithms. Machine
teaching focuses on the efficacy of the teachers themselves. Abstracting away AI complexity to focus on subject
matter expertise and real-world conditions creates powerful AI and ML models that turn automated control
systems into autonomous systems.

History of automation
Humans have been designing physical tools and machines to perform tasks more efficiently for thousands of
years. These technologies aim to achieve output more consistently, at a lower cost, and with less direct manual
labor.
The First Industrial Revolution, late 1700s to mid-1800s, introduced machines to replace hand production methods
in manufacturing. The Industrial Revolution increased production efficiency through automation with steam power,
and through consolidation by moving production from homes to organized factories. The Second Industrial
Revolution, mid-1800s to early 1900s, advanced production capability through electrification and production lines.
World Wars I and II brought major advances in information theory, communications, and signal processing.
Development of the transistor allowed information theory to be easily applied to controlling physical systems. This
Third Industrial Revolution allowed computer systems to make inroads in the hard-coded control of physical
systems like production, transport, and healthcare. The benefits of programmed automation included consistency,
reliability, and security.
The Fourth Industrial Revolution introduced the notion of cyber-physical systems and the industrial Internet of
Things (IoT). The systems humans wish to control have become too large and complex to write fully prescribed
rules. Artificial intelligence lets smart machines perform tasks that typically required human intelligence. Machine
learning lets machines automatically learn and improve from experience without being explicitly programmed.
AI and ML
AI and ML aren't new concepts, and many of the theories have been unchanged for decades, but recent
technological advances in storage, bandwidth, and computing enable more accurate and useful algorithm
predictions. Increased device processing capacity, miniaturization, storage capacity, and network capacity allow
further automation of systems and equipment. These advances also allow collection and collation of vast quantities
of real-time sensor data.
Cognitive automation is the application of software and AI to information-intensive processes and systems.
Cognitive AI can augment manual workers for increased productivity, replace human workers in monotonous or
hazardous fields, and allow for new insights because of the huge volumes of data it can process. Cognitive
technologies like computer vision, natural language processing, chatbots, and robotics can perform tasks that
previously only humans could do.
Many current production systems automate and accomplish impressive feats of engineering and manufacturing
using industrial robots. The use and evolution of industrial automation in manufacturing industries produces
higher-quality, safer products with more efficient use of energy and raw materials. However, in most cases, robots
can only operate in highly structured environments. They're typically inflexible to change and highly specialized
toward immediate tasks. Robots may also be expensive to develop because of the hardware and software rules
that govern their behaviors.
The paradox of automation claims that the more efficient an automated system becomes, the more crucial the
human component is to operations. The role of humans changes from mundane per-unit-of-work labor to
improving and managing the automated system and contributing essential domain expertise. While an automated
system can produce output more efficiently, it can also create waste and problems if it's poorly designed or
running incorrectly. Efficiently using automation makes humans more important, not less.
AI use cases

In the preceding diagram, the Control and Optimization categories relate to automation of work. On this side of
the AI spectrum, there are practically infinite problem spaces. Subject matter expertise is a strategic asset,
simulations are the most doable training option, and explainability is paramount.
Orchestrators include smart manufacturing and the Bonsai machine teaching platform. Use cases include
advanced robotics, autonomous vehicles, dark factories, smart sensors, route optimization, inventory optimization,
and virtual assistants.

Reinforcement learning
Machine teaching depends on reinforcement learning (RL) to train models, and to identify patterns in the learning
process and adopt positive behaviors in its own methods. Deep reinforcement learning (DRL) applies
reinforcement learning to complex deep learning neural networks.
RL in machine learning concerns how software agents learn to maximize rewards and desired outcomes in their
environments. RL is one of the three basic machine learning paradigms:
Supervised learning generalizes from tagged or structured data.
Unsupervised learning compresses unlabeled or unstructured data.
Reinforcement learning acts through trial and error.
While supervised learning is learning by example, reinforcement learning is learning from experience. Unlike
supervised learning, which focuses on finding and labeling suitable datasets, RL concentrates on designing models
of how to perform tasks.
The key RL components are:
Agent : the entity that can make a decision to change the current environment.
Environment : the physical or simulated world in which the agent operates.
State : the current situation of the agent and its environment.
Action : an interaction by the agent on its environment.
Reward : the feedback from the environment, following from an action of the agent.
Policy : the method or function to map the current state of the agent and its environment to actions.
RL uses reward functions and policies to assess agent actions and provide feedback. Through sequential decision
making based on the current environment, agents learn to maximize reward over time and to predict the best
possible actions in specific situations.
RL teaches the agent to complete an objective by rewarding desired behavior and not rewarding undesired
behavior. The following diagram illustrates the conceptual flow of RL and how the key components interact:

1. An agent, in this case a robot, takes an action in an environment, in this case a smart manufacturing line.
2. The action causes the environment to change state, and return its changed state to the agent.
3. An assessment mechanism applies a policy to determine what consequence to deliver to the agent.
4. The reward mechanism encourages beneficial actions by delivering a positive reward, and may discourage
negative actions by delivering a penalty.
5. Rewards cause desired actions to increase, while penalties cause undesired actions to decrease.
A problem can be stochastic (random) in nature, or deterministic. Although one agent is most common, there can
also be multiple agents in the environment. The agent senses the environment by observation. The environment
can be fully or partially observable, as determined by the agent's sensors, and observations can be discrete or
continuous.
Each observation is followed by an action, which causes the environment to change. This cycle repeats until a
terminal state is reached. Typically, the system has no memory, and the algorithm simply cares about the state it
comes from, the state it arrives at, and the reward it receives.
As the agent learns through trial and error, it needs large quantities of data to evaluate its actions. RL is most
applicable to domains that have large historical bodies of data, or that can easily produce simulated data.
Reward functions
A reward function determines how much and when to reward a particular action. The reward structure is normally
left to the system owner to define. Adjusting this parameter can significantly affect results.
The agent uses the reward function to learn about the physics and dynamics of the world around it. The
fundamental process by which an agent learns to maximize its reward, at least initially, is trial and error.
Exploration versus exploitation trade-off
Depending on the goal and reward function, the agent must balance exploration versus maximizing its reward. This
choice is called the exploration versus exploitation trade-off. As with many aspects of the real world, the agent
must balance the merits of further exploration of the environment, which may lead to better decisions in the
future, with exploitation of the environment, using all the knowledge the agent currently has about the world to
maximize reward. Taking different actions can offer a fresh perspective, especially if the actions haven't been tried
before.
The following training dashboard shows the exploration versus exploitation trade-off. The chart shows both the
smoothed rewards and episode rewards, with the episode rewards on the y-axis and the training iterations on the
x-axis. The episode reward density rises to 400 in the first 50,000 iterations, then keeps steady until 400,000
iterations, when it rises to 1,500 and remains steady.

Cobra effect
Rewards are subject to what is known in economics as the cobra effect. During the British rule of colonial India, the
government decided to cull the large wild cobra snake population by offering a reward for every dead cobra.
Initially, this policy succeeded, as large numbers of the snakes were killed to claim the reward. However, it didn't
take long for people to start gaming the system, and breeding cobras deliberately to collect the reward. Eventually,
the authorities noticed this behavior and canceled the program. With no further incentive, the cobra breeders set
their snakes free, with the result that the wild cobra population actually increased compared to what it was at the
start of the incentive.
The well-intentioned incentivization had made the situation worse, not better. The learning from this is that the
agents learn the behavior you incentivize, which may not produce the result you intended.
Shaped rewards
Creating a reward function with a particular shape can allow the agent to learn an appropriate policy more easily
and quickly.
A step function is an example of a sparse reward function that doesn't tell the agent much about how good its
action was. In the following step reward function, only a distance action between 0.0 and 0.1 generates a full
reward of 1.0. When distance is greater than 0.1, there is no reward.
In contrast, a shaped reward function gives the agent a sign of how close the action is to the desired response. The
following shaped reward function gives a greater reward depending on how close the response is to the desired
0.0 action. The curve of the function is a hyperbola. The reward is 1.0 for distance 0.0 and gradually drops to 0.0 as
distance approaches 1.0.
Shaping might discount the value of a future reward versus a more immediate reward, or encourage exploration
by shrinking the size of rewards around the goal.
Sometimes, a reward function might specify temporal as well as spatial considerations, to encourage ordered
sequences of actions. However, if a shaped reward function is becoming very large and complex, consider breaking
up the problem into smaller stages and using concept networks.
Concept networks
Concept networks allow specifying and reusing domain-specific knowledge and subject matter expertise to collect
a wanted ordering of behavior into a specific sequence of separate tasks. Concept networks help constrain the
search space within which the agent can operate and take actions.
In the following concept network for grasping and stacking objects, the Grasp and Stack ' box is the parent of two
gray boxes, Reach and Move , and three green boxes, Orient , Grasp , and Stack .
Concept networks often allow reward functions to be more easily defined. Each concept can use the most
appropriate approach for that task. The notion of concept networks helps with decomposability of the solution into
constituent pieces. Components can be replaced without retraining the whole system, allowing reuse of pre-
trained models and use of existing controllers or other existing ecosystem components. Especially in industrial
control systems, incremental piecemeal improvement can be more desirable than complete removal and
replacement.
Curriculum learning and apprenticeship learning
Dividing the problem into separate sequential tasks with concept networks allows splitting up the problem into
stages of difficulty and presenting it to the agent as a curriculum of increasing difficulty. This phased approach
starts with a simple problem, lets the agent practice, then challenges it more and more as its ability increases. The
reward function changes and evolves as the agent becomes more capable at its task. This curriculum learning
approach helps guide exploration and drastically reduces required training time.
You can also constrain the policy search space for the agent by instructing it to learn by mimicking the behavior of
an external expert. Apprenticeship learning uses expert-guided exemplars to constrain the state space the agent
explores. Apprenticeship learning trades off learning known solutions more quickly at the expense of not
discovering novel solutions.
An example of apprenticeship learning is teaching a self-driving car agent to mimic the actions of a human driver.
The agent learns how to drive, but also inherits any flaws and idiosyncrasies of the teacher.
Design of RL -based AI systems
The following strategy is a practical guide to constructing and building RL-based AI systems:
1. Formulate and iterate on states, terminal conditions, actions, and rewards.
2. Craft reward functions, shaping them as necessary.
3. Allocate rewards for specific subgoals.
4. Discount rewards aggressively if necessary.
5. Experiment with initial states.
6. Experiment with a sampling of examples for training.
7. Limit variation of simulation dynamics parameters during training.
8. Generalize during prediction and keep training as smooth as possible.
9. Introduce some physically relevant noise to accommodate noise in real machines.

Simulations
AI systems are data-hungry and require exposure to many scenarios to ensure they're trained to make appropriate
decisions. The systems often require expensive prototypes that risk damage in real-world environments. The cost
of collecting and manually labeling high fidelity training data is high, both in terms of time and direct labor. Using
simulators and densely labeled training data generated by simulators is a powerful means of addressing much of
this data deficit.
The curse of dimensionality refers to the phenomena that arise when dealing with large quantities of data in high-
dimensional spaces. Accurately modeling certain scenarios and problem sets requires the use of deep neural
networks. These networks themselves are highly dimensional, with many parameters that need fitting. As
dimensionality increases, the volume of the space increases at such a rate that available real-world data becomes
sparse, making it challenging to collect enough data to make statistically significant correlations. Without enough
data, training results in a model that underfits the data and doesn't generalize well to new data, which defeats the
purpose of a model.
The problem is twofold:
The training algorithm has a large learning capacity to accurately model the problem, but needs more data to
prevent underfitting.
Collecting and labeling this large amount of data, if doable, is difficult, expensive, and error-prone.
Simulations offer an alternative to having to collect huge amounts of real-world training data, by virtually
modeling systems in their intended physical environments. Simulations allow training in hazardous environments,
or in conditions difficult to reproduce in the real world, like various types of weather conditions. Artificially
simulated data sidesteps the difficulty in data collection, and keeps algorithms appropriately fed with example
scenarios that allow them to accurately generalize to the real world.
Simulations are the ideal training source for DRL, because they're:
Flexible to custom environments.
Safe and cost-effective for data generation.
Parallelizable, allowing for faster training times.
Simulations are available across a broad range of industries and systems, including mechanical and electrical
engineering, autonomous vehicles, security and networking, transportation and logistics, and robotics.
Simulation tools include:
Simulink, a graphical programming tool developed by MathWorks for modeling, simulating, and analyzing
dynamic systems.
Gazebo, a tool to allow accurate simulation of populations of robots in complex indoor and outdoor
environments.
Microsoft AirSim, an open-source robotics simulation platform.

Machine teaching paradigm


Machine teaching offers a new paradigm for building ML systems that moves the focus away from algorithms and
onto successful model generation and deployment. Machine teaching identifies patterns in the learning process
itself, and adopts positive behavior into its own method. Much of the activity in machine learning is focused on
improving existing algorithms or developing novel learning algorithms. In contrast, machine teaching focuses on
the efficacy of the teachers themselves.
Machine teaching:
Combines subject matter expertise from human domain experts with AI and ML.
Automates generation and management of deep reinforcement learning algorithms and models.
Integrates simulations for model optimization and scalability.
Provides for greater explainability of the behavior of resulting models.
The state of machine learning has largely been determined by a few algorithm experts. These experts have a deep
understanding of ML and can change an ML algorithm or architecture to meet required performance or accuracy
metrics. The number of ML experts globally can be estimated in the tens of thousands, which slows down the
adoption of ML solutions. The overbearing complexity of the models puts ML capabilities outside the reach of
many.
While ML experts are few, subject-matter experts are plentiful. Globally, there are tens of millions of domain
experts. Machine teaching taps into this larger pool of experts who understand the semantics of problems and can
provide examples, but need not be aware of the intricacies of ML. Machine teaching is the fundamental abstraction
needed to efficiently program subject matter expertise by codifying what to teach and how to teach it. Subject
matter experts with no AI background can break down their expertise into steps and tasks, criteria, and desired
outcomes.
For engineers, machine teaching raises the bar of abstraction beyond AI algorithm selection and hyper-parameter
tweaking to focus on more valuable application domain problems. Engineers building autonomous systems can
create accurate, detailed models of systems and environments, and make them intelligent using methods like deep
learning, imitation learning, and reinforcement learning. Another welcome outcome of machine teaching is a faster
time to model deployment, by reducing or eliminating the need for manual intervention from machine learning
experts during development.
Machine teaching streamlines the process of building ML solutions by looking at common ML practices and
adopting beneficial strategies into its own methods. With instruction and configuration from the developer, Bonsai,
the machine teaching service in the Microsoft Autonomous Systems Platform, can automate the development of AI
models into an AI system.
Bonsai provides an easily understandable central dashboard that tracks the current state of each project with
version control tools. Using this machine-teaching infrastructure ensures that model results can be reproduced,
and lets developers easily refresh AI systems with future AI algorithmic breakthroughs.

A perspective shift to a machine teaching methodology promotes ML adoption with a more streamlined and
accessible process for generating and deploying ML models. Machine teaching offers a way for domain experts to
apply the power of DRL as a tool. Machine teaching moves the technology of AI from a focus on ML algorithms
and techniques to the application of these algorithms by domain experts to real-world problems.
Machine teaching process
Machine teaching development and deployment has three phases: Build, Train, and Deploy.
1. The Build phase consists of writing the machine teaching program and connecting to a domain-specific training
simulator. Simulators generate sufficient training data for experiments and machine practice.
2. In the Train phase, the training engine automates DRL model generation and training by combining high-level
domain models with appropriate DRL algorithms and neural networks.
3. The Deploy phase deploys the trained model to the target application in the cloud, on-premises, or embedded
on site. Specific SDKs and deployment APIs deploy trained AI systems to various target applications, perform
machine tuning, and control the physical systems.
Simulated environments generate large amounts of synthetic data covering many use cases and scenarios.
Simulations provide safe and cost-effective data generation for model algorithm training, and faster training times
with simulation parallelization. Simulations help train the models across different kinds of environmental
conditions and scenarios, much faster and safer than is feasible in the real world.
Subject-matter experts can supervise agents as they work to solve problems in simulated environments, and
provide feedback and guidance that lets the agents dynamically adapt within the simulation. After training is
complete, engineers deploy the trained agents onto actual hardware, where they can use their knowledge to power
autonomous systems in the real world.
Machine learning and machine teaching
Machine teaching and machine learning are complementary and can evolve independently. Machine learning
research focuses on making the learner better by improving machine-learning algorithms. Machine teaching
research focuses on making the teacher more productive at building the machine learning models. Machine
teaching solutions require several machine learning algorithms to produce and test models throughout the
teaching process.
The following diagram shows a representative pipeline for building a machine learning model:
1. The problem owner collects and labels datasets, or assembles a label guideline so the labeling task can be
outsourced.
2. The problem owner reviews the labels until their quality is satisfactory.
3. Machine-learning experts select an algorithm, model architecture, objective function, regularizers, and cross-
validation sets.
4. Engineers train the model cyclically, adjusting the features or creating new features to improve model accuracy
and speed.
5. The model is tested on a small sample. If the system doesn't do well in the test, the preceding steps are
repeated.
6. Model performance is monitored in the field. If performance falls below a critical level, the model is modified by
repeating the preceding steps.
Machine teaching automates the creation of such models, easing the need for manual intervention in the learning
process to improve feature selection or examples, or tweaking of hyper-parameters. In effect, machine teaching
introduces a level of abstraction into the AI elements of the model, allowing the developer to focus on the domain
knowledge. This abstraction also allows the AI algorithm to be replaced by new more innovative algorithms in
time, without requiring a respecification of the problem.
The role of the teacher is to optimize the transfer of knowledge to the learning algorithm so it can generate a
useful model. Teachers also play a central role in data collection and labeling. Teachers can filter unlabeled data to
select specific examples, or look at the available example data and guess its label based on their own intuition or
biases. Similarly, given two features on a large unlabeled set, teachers can conjecture that one is better than the
other.
The following image shows the high-level process of machine teaching:
1. The teacher first questions whether a training set is realizable.
2. If the training set isn't realizable, the teacher determines whether the issue is due to inadequate labeling or
feature deficiencies. After correcting the labeling or adding features, the teacher again assesses whether the
training set is realizable.
3. If the training set is realizable, the teacher assesses whether training quality criteria are being met.
4. If quality criteria aren't being met, the teacher finds the test errors and adds the fixes to the training set, then
repeats the assessment steps.
5. Once the training set is realizable and quality criteria are met, the process finishes.
The process is a pair of indefinite loops, ending only when the model and the training itself are of sufficient quality.
The learning capacity of the model increases on demand. There's no need for traditional regularization, because the
teacher controls the capacity of the learning system by adding features only when necessary.
Machine teaching and traditional programming
Machine teaching is a form of programming. The goal of both programming and machine teaching is to create a
function. The steps to create a stateless target function that returns the value Y given an input X are similar for both
processes:

1. Specify the target function.


2. Decompose the target function into subfunctions if applicable.
3. Debug and test the functions and subfunctions.
4. Document the functions.
5. Share the functions.
6. Deploy the functions.
7. Maintain the functions with scheduled and unscheduled debug cycles.
Debugging or evaluating the performance of the solution has different attributes between the two processes. In
programming, debugging involves manually editing and recompiling code. In machine teaching, debugging
includes adding knowledge labels and features, and further training.
Building a target classification function that returns class Y given input X involves a machine learning algorithm,
while the process for machine teaching is like the set of programming steps above.
The following table illustrates some conceptual similarities between traditional programming and machine
teaching:

P RO GRA M M IN G M A C H IN E T EA C H IN G

Compiler Machine-learning algorithms, support vector machines


(SVMS), neural networks, training engine

Operating systems, services, integrated development Training, sampling, feature selection, machine training service
environments (IDEs)

Frameworks ImageNet, word2vec

Programming languages like Python and C# Inkling, labels, features, schemas

Programming expertise Teaching expertise

Version control Version control

Development processes like specifications, unit tests, Teaching processes like data collection, testing, publishing
deployment, monitoring

A powerful concept that allows software engineers to write systems that solve complex problems is
decomposition. Decomposition uses simpler concepts to express more complex ones. Machine teachers can learn
to decompose complex machine learning problems with the right tools and experiences. The machine teaching
discipline can bring the expectations of success for teaching a machine to a level comparable to that of
programming.

Machine teaching projects


Prerequisites:
Some experience with gathering, exploring, cleaning, preparing, and analyzing data
Familiarity with basic ML concepts like objective functions, training, cross-validation, and regularization
When building a machine teaching project, start with a true to life but relatively simple model, to allow for fast
iteration and formulation. Then, iteratively improve the fidelity of the model, and make the model more
generalizable through better scenario coverage.
The following diagram shows the phases of iterative machine teaching model development. Each successive step
requires a larger number of training samples.
1. Set up initial rough simulation model for approach validation.
2. Enhance simulation fidelity by modeling environment dynamics concurrent with iterative teaching refinements.
3. Generalize model with dynamic randomization and expansion.
4. Measure real system dynamics and test trained model on real equipment.
5. Expand model to cover more equipment variation.
Defining exact parameters for machine teaching projects requires quite a bit of experimentation and empirical
exploration. A machine teaching platform like Bonsai in the Microsoft Autonomous Systems platform uses DRL
innovations and simulations to help simplify AI model development.
Example project
An example of an autonomous systems AI project is the use case of a manufacturing process optimization. The
goal is to optimize the thickness tolerance of a steel beam being manufactured on a production line. Rollers
provide pressure across a piece of steel to shape it into the designed thickness.
The machine state inputs to the AI system are the rolling force, roller error, and roller noise. The control actions
from the AI system are actuator commands to control the operation and motion of the rollers, and optimize the
thickness tolerance of the steel beam.
First, find or develop a simulator that can simulate agents, sensors, and the environment. The following Matlab
simulation model provides an accurate training environment for this AI system:
Use the Bonsai machine teaching service in the Microsoft Autonomous Systems Platform to build a machine
teaching plan into a model, train the model against the simulator, and deploy the trained AI system to the real
production facility.
Inkling is a purpose-specific language to formally describe machine teaching plans. In Bonsai, you can use Inkling
to deconstruct the problem into schema:

Then, define key concepts and create a curriculum to teach the AI system, specifying the reward function for the
simulation state:
The AI system learns by practicing the optimization task in simulation, following the concepts of machine teaching.
You can upload the simulation into Bonsai, where it provides visualizations of training progress as it runs.
After building and training the model or brain, you can export it to deploy to the production facility, where optimal
actuator commands stream from the AI engine to support operator decisions in real time.
Other example applications
The following machine teaching examples create policies to control the motions of physical systems. In both cases,
manually creating a policy for the agent is infeasible or very difficult. Allowing the agent to explore the space in
simulation and guiding it to make choices through reward functions produces accurate solutions.
Cartpole
In the sample Cartpole project in Bonsai, the goal is to teach a pole to remain upright on a moving cart. The pole is
attached by an unactivated joint to the cart, which moves along a frictionless track. The available sensor
information includes the cart position and velocity, and the pole angle and angular velocity.
Applying a force to the cart controls the system. The supported agent actions are to push the cart to the left or the
right. The program provides a positive reward for every time step that the pole remains upright. The episode ends
when the pole is over 15 degrees from vertical, or the cart moves more than a predefined number of units from
the center.
The example uses Inkling language to write the machine teaching program, and the provided Cartpole simulator to
speed and improve the training.
The following Bonsai screenshot shows a Cartpole training, with Goal satisfaction on the y-axis and Training
iterations on the x-axis. The Bonsai dashboard also shows the percentage of goal satisfaction and the total
training time.

For more information about the Cartpole example or to try it yourself, see:
Quickstart: Balance a pole with AI (Cartpole)
Learn how you can teach an AI agent to balance a pole
Oil drilling
The Horizontal Oil Drilling application is a motion controller to automate oil rigs that drill horizontally
underground. An operator controls the drill underground with a joystick to keep the drill inside the oil shale while
avoiding obstacles. The drill makes as few steering actions as possible, for faster drilling. The goal is to use
reinforcement learning to automate control of the horizontal oil drill.
The available sensor information includes the direction of drill bit force, weight of the drill bit, side force, and
drilling angle. The supported agent actions are to move the drill bit up, down, left, or right. The program provides a
positive reward when the drill is within the tolerance distance of the chamber walls. The model learns to adapt to
different well plans, drill starting positions, and sensor inaccuracies.
For more information and a demo of this solution, see Motion control: Horizontal oil drilling.

Next steps
Machine teaching for autonomous systems
Autonomous systems with Microsoft AI
Autonomy for industrial control systems
Machine teaching: How people's expertise makes AI even more powerful
Microsoft broadens availability of autonomous systems tools for engineers and developers
Innovation space: Autonomous systems (Video)
Microsoft The AI Blog
Bonsai documentation
Microsoft AirSim on GitHub
Aerial Informatics and Robotics Platform (AirSim)
Gazebo
Simulink

Related resources
"Bonsai, AI for Everyone," 2016 March 2
"AI use cases: innovations solving more than just toy problems," 2017 March 2
Patrice Y. Simard, Saleema Amershi, David M. Chickering, et al., "Machine Teaching: A New Paradigm for Building
Machine Learning Systems," 2017
Carlos E. Perez, "Deep Teaching: The Sexiest Job of the Future," 2017 July 29
Tambet Matiisen, "Demystifying deep reinforcement learning," 2015 December 19
Andrej Karpathy, "Deep Reinforcement Learning: Pong from Pixels," 2016 May 31
David Kestenbaum, "Pop Quiz: How Do You Stop Sea Captains From Killing Their Passengers?" 2010 September
10
Machine learning operationalization (MLOps) for
Python models using Azure Machine Learning
11/2/2020 • 8 minutes to read • Edit Online

This reference architecture shows how to implement a continuous integration (CI), continuous delivery (CD), and
retraining pipeline for an AI application using Azure DevOps and Azure Machine Learning. The solution is built on
the scikit-learn diabetes dataset but can be easily adapted for any AI scenario and other popular build systems
such as Jenkins and Travis.

A reference implementation for this architecture is available on GitHub.

Architecture
This architecture consists of the following components:
Azure Pipelines . This build and test system is based on Azure DevOps and used for the build and release
pipelines. Azure Pipelines breaks these pipelines into logical steps called tasks. For example, the Azure CLI task
makes it easier to work with Azure resources.
Azure Machine Learning . Azure Machine Learning is a cloud service for training, scoring, deploying, and
managing machine learning models at scale. This architecture uses the Azure Machine Learning SDK for Python 3
to create a workspace, compute resources, the machine learning pipeline, and the scoring image. An Azure
Machine Learning workspace provides the space in which to experiment and train and deploy machine learning
models.
Azure Machine Learning Compute is cluster of virtual machines on demand with an automatic scaling and
GPU and CPU node options. The training job is executed on this cluster.
Azure Machine Learning pipelines provide reusable machine learning workflows that can be reused across
scenarios. Training, model evaluation, model registration, and image creation occur in distinct steps within these
pipelines for this use case. The pipeline is published or updated at the end of the build phase and gets triggered on
new data arrival.
Azure Blob Storage . Blob containers are used to store the logs from the scoring service. In this case, both the
input data and the model prediction are collected. After some transformation, these logs can be used for model
retraining.
Azure Container Registr y . The scoring Python script is packaged as a Docker image and versioned in the
registry.
Azure Container Instances . As part of release pipeline, the QA and staging environment is mimicked by
deploying the scoring webservice image to Container Instances, which provides an easy, serverless way to run a
container.
Azure Kubernetes Ser vice . Once the scoring webservice image is thoroughly tested in the QA environment, it is
deployed to the production environment on a managed Kubernetes cluster.
Azure Application Insights . This monitoring service is used to detect performance anomalies.

MLOps Pipeline
This solution demonstrates an end-to-end automation of various stages of an AI project using tools that are
already familiar to software engineers. The machine learning problem is simple to keep the focus on the DevOps
pipeline. The solution uses the scikit-learn diabetes dataset and builds a ridge linear regression model to predict
the likelihood of diabetes. See Training of Python scikit-learn models for details.
This solution is based on the following three pipelines:
Build pipeline . Buildings the code and runs a suite of tests.
Retraining pipeline . Retrains the model on a schedule or when new data becomes available
Release pipeline . Operationalizes the scoring image and promotes it safely across different environments.
The next sections describe each of these pipelines.
Build pipeline
The CI pipeline gets triggered every time code is checked in. It publishes an updated Azure Machine Learning
pipeline after building the code and running a suite of tests. The build pipeline consists of the following tasks:
Code quality. These tests ensure that the code conforms to the standards of the team.
Unit test. These tests make sure the code works, has adequate code coverage, and is stable.
Data test. These tests verify that the data samples conform to the expected schema and distribution.
Customize this test for other use cases and run it as a separate data sanity pipeline that gets triggered as
new data arrives. For example, move the data test task to a data ingestion pipeline so you can test it earlier.

[NOTE] You should consider enabling DevOps practices for the data used to train the machine learning models,
but this is not covered in this article. For more information about the architecture and best practices for CI/CD
of a data ingestion pipeline, see DevOps for a data ingestion pipeline.

The following one-time tasks occur when setting up the infrastructure for Azure Machine Learning and the Python
SDK:
Create the workspace that hosts all Azure Machine Learning-related resources.
Create the compute resources that run the training job.
Create the machine learning pipeline with the updated training script.
Publish the machine learning pipeline as a REST endpoint to orchestrate the training workflow. The next
section describes this step.
Retraining pipeline
The machine learning pipeline orchestrates the process of retraining the model in an asynchronous manner.
Retraining can be triggered on a schedule or when new data becomes available by calling the published pipeline
REST endpoint from previous step.
This pipeline covers the following steps:
Train model. The training Python script is executed on the Azure Machine Learning Compute resource to
get a new model file which is stored in the run history. Since training is the most compute-intensive task in
an AI project, the solution uses Azure Machine Learning Compute.
Evaluate model. A simple evaluation test compares the new model with the existing model. Only when
the new model is better does it get promoted. Otherwise, the model is not registered and the pipeline is
canceled.
Register model. The retrained model is registered with the Azure ML Model registry. This service provides
version control for the models along with metadata tags so they can be easily reproduced.
Release pipeline
This pipeline shows how to operationalize the scoring image and promote it safely across different environments.
This pipeline is subdivided into two environments, QA and production:
QA environment
Model Ar tifact trigger. Release pipelines get triggered every time a new artifact is available. A new model
registered to Azure Machine Learning Model Management is treated as a release artifact. In this case, a
pipeline is triggered for each new model is registered.
Create scoring image. The registered model is packaged together with scoring script and Python
dependencies (Conda YAML file) into an operationalization Docker image. The image automatically gets
versioned through Azure Container Registry.
Deploy on Container Instances. This service is used to create a non-production environment. The
scoring image is also deployed here, and this is mostly used for testing. Container Instances provides an
easy and quick way to test the Docker image.
Test web ser vice. A simple API test makes sure the image is successfully deployed.
Production environment
Deploy on Azure Kubernetes Ser vice. This service is used for deploying scoring image as a web service
at scale in a production environment.
Test web ser vice. A simple API test makes sure the image is successfully deployed.

Scalability considerations
A build pipeline on Azure DevOps can be scaled for applications of any size. Build pipelines have a maximum
timeout that varies depending on the agent they are run on. Builds can run forever on self-hosted agents (private
agents). For Microsoft-hosted agents for a public project, builds can run for six hours. For private projects, the limit
is 30 minutes.
To use the maximum timeout, set the following property in your Azure Pipelines YAML file:

jobs:
- job: <job_name>
timeoutInMinutes: 0

Ideally, have your build pipeline finish quickly and execute only unit tests and a subset of other tests. This allows
you to validate the changes quickly and fix them if issues arise. Run long-running tests during off hours.
The release pipeline publishes a real-time scoring web service. A release to the QA environment is done using
Container Instances for convenience, but you can use another Kubernetes cluster running in the QA/staging
environment.
Scale the production environment according to the size of your Azure Kubernetes Service cluster. The size of the
cluster depends on the load you expect for the deployed scoring web service. For real-time scoring architectures,
throughput is a key optimization metric. For non-deep learning scenarios, CPU should be sufficient to handle the
load; however, for deep learning workloads, when speed is a bottleneck, GPUs generally provide better
performance compared to CPUs. For moreTo match GPU performance using CPUs, a cluster with a large number
of CPUs is usually needed. Kubernetes Service supports both CPU and GPU node types-the reason this solution
uses Kubernetes Service for image deployment. For more information, see GPUs vs CPUs for deployment of deep
learning models (blog post).
Scale the retraining pipeline up and down depending on the number of nodes in your Azure Machine Learning
Compute resource, and use the autoscaling option to manage the cluster. This architecture uses CPUs. For deep
learning workloads, GPUs are a better choice and are supported by Azure Machine Learning Compute.

Management considerations
Monitor retraining job. Machine learning pipelines orchestrate retraining across a cluster of machines
and provides an easy way to monitor them. Use the Azure portal, and go to the machine learning
workspace, and look under pipelines section for the logs. Alternatively, these logs are also written to blob
and can be read from there as well using tools such as Azure Storage Explorer.
Logging. Azure Machine Learning provides an easy way to log at each step of the machine learning life
cycle. The logs are stored in a blob container. For more information, see Enable logging in Azure Machine
Learning. For richer monitoring, configure Application Insights to use the logs.
Security. All secrets and credentials are stored in Azure Key Vault and accessed in Azure Pipelines using
variable groups.

Cost considerations
Azure DevOps is free for open-source projects and small projects with up to five users. For larger teams, purchase
a plan based on the number of users.
Compute is the biggest cost driver in this architecture and varies depending on the use case. For experimentation
and training, Azure Machine Learning is free, you pay only for the compute used. While deploying models to
Kubernetes Service cluster, Azure Machine Learning adds a small surcharge on top of the Kubernetes Service
compute cost. Depending on the expected web service load and the defined autoscaling option, you can control
this cost. You can use either Azure CLI or Azure Machine Learning Python SDK to programmatically set an
autoscaling configuration.
The retraining pipeline also requires a form of compute. This architecture uses Azure Machine Learning Compute,
but other options are available. As mentioned earlier, training models do not incur the machine learning service
surcharge; you only pay the compute cost. The compute cost depends on the cluster size, node type, and number
of nodes. You can estimate the cost for Machine Learning and other services using the Azure pricing calculator.

Deploy the solution


To deploy this reference architecture, follow the steps described in the Getting Started guide in the GitHub repo.
Design a machine learning operations (MLOps)
framework to upscale an Azure Machine Learning
lifecycle
11/13/2020 • 30 minutes to read • Edit Online

This client engagement (or project) focused on helping a Fortune 500 food company that ships products directly to
multiple retail outlets to improve their demand forecasting. The improvement helped them optimize the stocking
of their products in different stores across several regions of the US. To achieve this, Microsoft’s Commercial
Software Engineering (CSE) team worked with the client’s data scientists to develop customized machine learning
models for the selected regions for a pilot study. These models considered variables that had to do with:
Shopper demographics
Historical and forecasted weather
Past shipments
Product returns
Special events
This specific goal represented a major component of the project and the client realized a significant sales lift in the
early field trials. Also, the team saw a 40% reduction in forecasting Mean Absolute Percentage Error (MAPE) when
compared with a historical average baseline model.
A key part of the project involved how to scale up the data science workflow from the pilot study to a production
level. This production-level workflow required the CSE team to:
Develop models for many regions.
Continuously update and monitor the models' performance.
Enable data and engineering teams to collaborate on all of these aspects.
The traditional data science workflow, as it exists today, is closer to a one-off lab environment than a production
workflow. This lab environment inherently has requirements that include the need to:
Prepare the data.
Experiment with different models.
Tune hyper parameters.
Create a build/test/evaluate/refine cycle.
Most existing tools are only used for specific purposes and aren't well disposed to automation. In a production
level machine learning operation, there must be more consideration given to Application Lifecycle Management
and DevOps.
The CSE team did extensive work in this area to help the client scale up the operation to production levels. The
process had to be made more effective by implementing different aspects of Continuous Integration/Continuous
Delivery (CI/CD) capabilities and items like observability and integration with Azure capabilities. In doing so, the
team uncovered gaps in existing MLOps guidance. Those gaps needed further refinement so MLOps was better
understood and applied at scale. Understanding MLOps practices helps organizations ensure the machine learning
models the system produces are production quality models that improve business performance. The organization
no longer has to focus the majority of the time on the associated low-level details relating to infrastructure and
engineering work required to develop and run machine learning models for production level operations. It also
helps incorporate the communities of data science and software engineering. Team members learn how to work
together to deliver a production-ready system.
The CSE team used this project to address machine learning community needs by addressing items like developing
an MLOps maturity model (and perhaps leading to the creation of “MLOps in a Box”). These efforts target
improving MLOps adoption through understanding common challenges facing key players in the MLOps process.

Engagement and Technical scenarios


The Engagement scenario discusses the real-world challenges that the CSE team had to solve. The Technical
scenario defines the requirements to create an MLOps lifecycle that is as reliable as the established DevOps
lifecycle.
Engagement scenario
The client delivers products directly to retail market outlets on a regular schedule. However, each retail outlet varies
in its product usage patterns so that product inventory needs to vary in each weekly delivery. Maximizing sales,
and minimizing product returns and lost sales opportunities, are the goals of the demand forecasting
methodologies used by the client. The project focused on improving their forecasting by applying machine
learning to the problem.
The CSE team divided the project into two phases. Phase 1 focused on developing working machine learning
models to support a field-based pilot study on the effectiveness of machine learning forecasting in a selected sales
region. The success of Phase 1 led to Phase 2. The team upscaled the initial pilot study from a minimal group of
models that supported a single geographic region to a set of sustainable production-level models that they could
apply to all of the client’s sales regions. A primary consideration for the upscaled solution was the need to
accommodate the large number of geographic regions and their local retail outlets. The team dedicated the
machine learning models to both large and small retail outlets in each region.
The pilot study determined that a model dedicated to one region’s retail outlets could use local sales history, local
demographics, weather, and special events to optimize the demand forecast for each sales outlet. Four ensemble
machine learning forecasting models serviced market outlets in a single region. The models processed data in
weekly batches. Also, the team developed two baseline models using historical data for comparison.
In the initial requirement for the upscaled solution, the CSE team selected 14 geographic regions, including small
and large market outlets, to participate. They used more than 50 machine learning forecasting models. The team
expected further system growth and continued refinement of the machine learning models. It quickly became clear
this wider-scaled machine learning solution could only be sustainable if the team based it on the best practice
principles of DevOps for the machine learning environment.

M O DEL M O DEL
EN VIRO N M EN T M A RK ET REGIO N F O RM AT M O DEL S SUB DIVISIO N DESC RIP T IO N
M O DEL M O DEL
EN VIRO N M EN T M A RK ET REGIO N F O RM AT M O DEL S SUB DIVISIO N DESC RIP T IO N

Dev Environment Each geographic Large format 2 ensemble Slow moving Slow and Fast
market/region stores models products each has an
(for example (supermarkets, ensemble of a
North Texas) big box stores, Least Absolute
and so on.) Shrinkage and
Selection
Operator
(LASSO) linear
regression model
and a neural
network with
categorical
embeddings

Fast moving Slow and Fast


products each has an
ensemble of a
LASSO linear
regression model
and a neural
network with
categorical
embeddings

1 ensemble N/A Historical average


model

Small format 2 ensemble Slow moving Slow and Fast


stores models products each has an
(pharmacies, ensemble of a
convenience LASSO linear
stores, and so regression model
on.) and a neural
network with
categorical
embeddings

Fast moving Slow and Fast


products each has an
ensemble of a
LASSO linear
regression model
and a neural
network with
categorical
embeddings

1 ensemble N/A Historical average


model

Same as above
for additional 13
geographic
regions
M O DEL M O DEL
EN VIRO N M EN T M A RK ET REGIO N F O RM AT M O DEL S SUB DIVISIO N DESC RIP T IO N

Same as above
for PROD
environment

The MLOps process would provide a framework for the upscaled system that addressed the full lifecycle of the
machine learning models. This framework includes development, testing, deployment, and monitoring that fulfills
the needs of a classic CI/CD process and operation of the deployed machine learning system. However, because of
its relative immaturity compared to DevOps, it became evident that existing MLOps guidance had gaps. The project
teams worked to fill in some of those gaps. They wanted to provide a functional process model that insures the
viability of the upscaled machine learning solution.
The MLOps process developed from this project made a significant real-world step to move MLOps to a higher
level of maturity and viability. The new process is directly applicable to other machine learning projects. The CSE
team has even taken their collective learnings to build a draft of an MLOps maturity model that anyone can apply
to other machine learning projects.
Technical scenario
MLOps, also known as DevOps for machine learning, is an umbrella term. It encompasses a wide range of
philosophies, practices, and technologies involved in moving machine learning lifecycles into a production
environment. It's still a relatively new concept. There are many definitions trying to address the question, "What is
MLOps?" Many people have wondered if MLOps can cover the whole space from how data scientists prepare data
to how they ultimately deliver, monitor, and evaluate machine learning results. While DevOps has had years to
stabilize and include a set of fundamental practices, MLOps is still early in its existence. As MLOps concepts mature,
we often discover the challenges in bringing two disciplines together that often operate with different skillsets and
priorities: Software/Ops Engineering and Data Science.
Implementing MLOps in real-world production environments has unique challenges that must be overcome. Teams
can use Azure's capabilities to support MLOps patterns. Azure can also provide clients with asset management and
orchestration services to enable effective machine learning lifecycle management. Azure services are the
foundation for the MLOps solution discussed in this technical paper.

Machine learning model requirements


Much of the work during the pilot field study involved creating the machine learning models that the CSE team
would apply to the large and small retail stores in a single study region. Notable requirements for the model
included:
Use of Azure Machine Learning Services.
Initial experimental models developed in Jupyter Notebook and implemented in Python.

NOTE
Teams would use the same machine learning approach for large and small stores, but the training and scoring data
was different depending on the size of the store.

Data requires preparation for model consumption.


Data processed on a batch basis rather than in real time.
Model training done when code changes, data changes, or there is evidence that the model has become
stale.
Model performance must be viewable in Power BI dashboards.
Model performance in scoring considered significant when MAPE <= 45% when compared with a historical
average baseline model.

MLOps requirements
The team had to meet several key requirements to scale up the solution from the pilot field study. Originally, only a
few models were developed for a single sales region. They had to deploy a wider-scale implementation that
enabled the development of custom machine learning models for all sales regions that included:
Weekly batch processing for large and small stores in each region including retraining of each model with
new datasets.
Continuous refinement of the machine learning models.
Integration of the development/test/package/test/deploy process common to CI/CD in a DevOps-like
processing environment for MLOps.

NOTE
This represents a shift in how data scientists and data engineers have commonly worked in the past.

A unique model that represented each region for large and small stores based on the stores' history,
demographics, and other key variables. The model had to process the whole dataset to minimize the risk of
processing error.
The ability to initially upscale to support 14 sales regions with plans to upscale further.
Plans for additional models for longer term forecasting for regions and other store clusters.

Machine learning model solution


The machine learning lifecycle, also known as the Data Science lifecycle, fits roughly into the following high-level
process flow:

Deploy Model here can represent any operational use of the validated machine learning model. Compared to
DevOps, MLOps presents the additional challenge of how to integrate this machine learning lifecycle into the
typical CI/CD process.
This data science lifecycle doesn't follow the typical software development lifecycle. The team had to preserve the
basic process whether they did it manually or via CI/CD automation, so the Azure Machine Learning Service
solution had to satisfy two functional tasks:
Train the model.
Run scoring to obtain the forecast results.
Ultimately, batch processing of data is the basis of the architecture. Two Azure Machine Learning pipelines are
central to the process, one for training and the other for scoring. This diagram shows the data science
methodology used for the initial phase of the client project:

The team tested several algorithms. They ultimately chose an ensemble design of a LASSO linear regression model
and a Neural Network with categorical embeddings. The team used the same model, defined by the level of
product that the client could store on site, for both large and small stores. The team further subdivided the model
into fast-moving and slow-moving products.
Data scientists train machine learning models when the team releases new code and when new data is available.
Training typically happens on a weekly cadence. Consequently, each processing run involves a large amount of
data. And, because the team collects that data from many sources in different formats, it requires conditioning to
put the data into a consumable format before data scientist can process it. This data conditioning requirement
needed significant manual effort and the CSE team identified it as a primary candidate for automation.
As mentioned, the data scientists developed and applied the experimental Azure Machine Learning models to a
single sales region in a pilot field study to evaluate the usefulness of this forecasting approach. The CSE team
considered the observed sales lift for the stores in the pilot study significant. The team considered the pilot study a
success, justifying the need to upscale the solution to full production levels starting with 14 geographic regions
and thousands of stores. The team could then use the same pattern to add additional regions.
The pilot model served as the basis for the upscaled solution, but the CSE team knew that the model would need
further refinement on a continuing basis to improve its performance.

MLOps solution
As MLOps concepts mature, teams often discover challenges in bringing two disciplines together that often operate
with different skillsets and priorities between the principal players: software engineers and data scientists. And yet,
there are similarities that we can build on.
MLOps, like DevOps is a development process implemented by a toolchain. The toolchain includes such things as:
Version control
Code analysis
Build automation
Continuous Integration
Testing frameworks and automation
Compliance policies integrated into CI/CD pipelines
Deployment automation
Monitoring
Disaster recovery/High availability
Package/Container management
As noted above, the solution takes advantage of existing DevOps guidance, but it's augmented to create a more
mature MLOps implementation that meets the needs of the client and the Data Science community. MLOps builds
on DevOps guidance with these additional requirements:
Data/model versioning isn't the same as code versioning : There must be versioning of datasets as the
schema and origin data changes.
Digital audit trail requirements : Track all changes when dealing with code and client data.
Generalization : Models are different than code for reuse, since data scientist must tune models based on input
data/scenario.
To reuse a model for a new scenario, you may need to fine-tune/transfer/learn on it. You need the
training pipeline.
Stale models : Models tend to decay over time and you need the ability to retrain them on demand to ensure
they remain relevant in production.

MLOps challenges
Immature MLOps standard
The standard pattern for MLOps is still evolving. It's typically built from scratch and made to fit the needs of a
particular client or user. The CSE team recognized this gap and sought to use DevOps best practices in this project.
They augmented the DevOps process to fit the additional requirements of MLOps. The process the team developed
is a viable example of what an MLOps standard pattern should look like.
Differences in skillsets
Software engineers and data scientists bring unique skillsets to the team. These different skillsets can make finding
a solution that fits everyone's needs difficult. Building a well-understood workflow for model delivery from
experimentation to production is important. Each team member must share an understanding of how they can
integrate changes into the system without breaking the MLOps process.
Managing multiple models
The team may need multiple models to solve for many difficult machine learning scenarios. One of the challenges
of MLOps is managing these models, including:
A coherent versioning scheme.
A continuous evaluation and monitoring of all models.
Traceable code and data lineage is also needed to diagnose model issues and create reproducible models. Investing
in custom dashboards can make sense of how deployed models are performing and when to intervene. The team
created such dashboards for this project.
Need for data conditioning
Data used with these models comes from many private and public sources. Because the original data is
disorganized, it makes it impossible for the machine learning model to consume the data in its raw state. The data
scientist must condition the data into a standard format for machine learning model consumption.
Much of the pilot field test focused on conditioning the raw data so that the machine learning model could process
it. In an MLOps system, the team should automate this process, and track the outputs.

MLOps maturity model


The purpose of the MLOps maturity model is to help clarify the principles and practices and identify gaps in an
existing organization's attempt to implement MLOps. It's also a way to show a client how to incrementally grow
their MLOps capability instead of overwhelming them with everything at once. The client should use it as a guide
to:
Estimate the scope of the work for the project.
Establish success criteria.
Identify deliverables that the team hands over at the conclusion of the project.
The team built the MLOps maturity model on five levels of technical capability:

L EVEL DESC RIP T IO N

0 No Ops

1 DevOps but no MLOps

2 Automated Training

3 Automated Model Deployment

4 Automated Operations (full MLOps)

For the current version of the MLOps maturity model, see the MLOps maturity model article.

MLOps process definition


MLOps includes the whole space between acquiring raw data and delivering model output, also known as scoring:
Data conditioning
Model training
Model testing and evaluation
Build definition and pipeline
Release pipeline
Deployment
Scoring
Multiple pipelines were ultimately used to meet all process requirements. See the MLOps Process Decision Tree
article.

Basic machine learning process


The basic machine learning process is somewhat like traditional software development, but there are significant
differences. This diagram illustrates the major steps in the machine learning process:
The Experiment phase is unique to the Data Science Lifecycle. How data scientists traditionally do their work is the
basis of this lifecycle. It isn't based on how code developers do their job. The following diagram illustrates this
lifecycle in more detail.

Integrating this data development process into MLOps poses a challenge. Here, you see the pattern that the team
used to integrate the process into a form supportable by MLOps:
The role of MLOps is to create a coordinated process that can efficiently support larger scale CI/CD environments
common in production level systems. Conceptually, the MLOps model must include all process requirements from
experimentation to scoring.
The CSE team refined the MLOps process developed for this project to fit the client’s specific needs. The most
notable was the need for batch processing instead of real-time processing. As the team developed the upscaled
system, they identified and resolved some shortcomings. The most significant of these shortcomings resulted in
the development of a bridge between Azure Data Factory and Azure Machine Learning, which the team
incorporated as a built-in connector in Azure Data Factory. They created this component set to facilitate the
triggering and status monitoring necessary to make the process automation work.
Another fundamental change is that the data scientist needed the capability to export experimental code from
Jupyter notebooks into the MLOps deployment process rather than trigger training and scoring directly.
Here is the final MLOps process model concept:
IMPORTANT
Scoring is the final step. The process runs the machine learning model against data that the system uses to make predictions
based on the data used to train the model. This addresses the basic business use case requirement for demand forecasting.
The team rates the quality of the predictions using the MAPE. MAPE is short for mean absolute percentage deviation. It is a
measure of prediction accuracy of statistical forecasting methods and a loss function for regression problems in machine
learning. In this project, the team considered a MAPE of <= 45% significant.

MLOps process flow


The following diagram describes how to apply CI/CD development and release workflows to the machine learning
lifecycle:

When a PR is created from a Feature branch, the pipeline runs code validation tests to validate the quality of
the code via unit tests and code quality tests. To validate quality upstream, the pipeline also runs basic model
validation tests to validate the end-to-end training and scoring steps with a sample set of mocked data.
When the PR is merged into the master branch, the CI pipeline will run the same code validation tests and
basic model validation tests with increased epoch. The pipeline will then package the artifacts, which include
the code and binaries, to run in the machine learning environment.
After the artifacts are available, a model validation CD pipeline will be triggered. It runs end-to-end
validation on the development machine learning environment. A scoring mechanism will be published. For a
batch scoring scenario, a scoring pipeline is published to the machine learning environment and triggered
to produce results. If you wanted to use a real-time scoring scenario, you can publish a web app or deploy a
container.
Once a milestone is created and merged into the Release branch, the same CI pipeline and model validation
CD pipeline are triggered. This time, they run against the code based in the Release branch.
You can consider the MLOps process data flow shown above as an archetype framework for projects that might
make similar architectural choices.
Code validation tests
Code validation tests for machine learning focus on validating the quality of the code base. It's the same concept as
any engineering project involving code quality tests (linting), unit tests, and determining code coverage.
Basic model validation tests
Model validation typically refers to validating the full end-to-end process steps required to produce a valid
machine learning model. Including steps like:
Data validation : Ensures that the input data is valid.
Training validation : Ensures that the model can be successfully trained.
Scoring validation : Ensures that the team can successfully use the trained model for scoring with the
input data.
Running this full set of steps on the machine learning environment is expensive and time consuming. As a result,
the team focused basic model validation tests to run all the steps above locally on a development machine by
using the following:
Local testing dataset : Check in a small and often obfuscated dataset to the repository and consume it as
the input data source.
Local flag : A flag or argument in the model's code that indicates that the code intends the dataset to run
locally. The flag allows the code to bypass any required call to the machine learning environment.
This goal of these validation tests isn't to evaluate the performance of the trained model. The goal is to validate that
the code for the end-to-end process is of good quality. It not only enables pushing quality code upstream, like the
incorporation of model validation tests in the PR and CI build. It also enables engineers and data scientists to put a
breakpoint into the code for debugging purposes.
Model validation CD pipeline
The goal of the model validation pipeline is to validate the end-to-end model training and scoring steps on the
machine learning environment with actual data. Any trained model that is produced will be added to the model
registry, tagged, and will await to be promoted once validation is completed. For batch prediction, promotion can
be publishing a scoring pipeline, where the pipeline is using this version of the model. For real-time scoring, the
model can be tagged to indicate that it has been promoted.
Scoring CD pipeline
The scoring CD pipeline is applicable for the batch inference scenario, where the same model orchestrator as the
model validation triggers the published scoring pipeline.
Development vs. production environments
It's a good practice to separate the development (dev) and the production (prod) environments. It allows the
system to trigger the model validation CD pipeline and scoring CD pipeline at a different cadence. For the
described MLOps flow, pipelines targeting the master branch will run on the dev environment, and the pipeline
targeting the Release branch will run on the prod environment.
Code changes vs. data changes
The previous sections have mostly dealt with how to handle code changes from development to release. However,
data changes should also follow the same rigor as code changes to provide the same validation quality and
consistency in production. With a data change trigger or a timer trigger, the system should trigger the model
validation CD pipeline and the scoring CD pipeline from the model orchestrator to run the same process as code
changes in the Release branch prod environment.

MLOps personas and roles


A key requirement for any MLOps process is that it meets the needs of the many users of the process. Consider
these users as individual personas for design purposes. For this project, the team identified the user personas like
this:
Data scientist : Creates the machine learning model and its algorithms.
Engineer
Data engineer : Handles data conditioning.
Software engineer : Handles model integration into the asset package and the CI/CD workflow.
Operations or IT : Oversees system operations.
Business stakeholder : Concerned with the predictions made by the machine learning model and how well
they help the business.
Data end user : Consumes model output in some way that aids in making business decisions.
The team had to address three key findings from the persona and role studies:
Data scientists and engineers have a mismatch of approach and skills in their work. Making it easy for the
data scientist and the engineer to work collaboratively becomes a major consideration in the MLOps process
flow. It does require new skill acquisitions by all team members.
There's a need to unify all of the principle personas without alienating anyone. A way to do this is to:
Make sure they share a common understanding of the conceptual model for MLOps.
Agree on clear expectations of how all the team members will work together.
Discuss working guidelines to achieve common goals.
The business stakeholder and data end user may require a way to interact with the data output from the
models. A user-friendly UI is the most common solution.
Other teams will certainly come across similar issues in other machine learning projects as they scale up for
production use.

MLOps solution architecture


Logical architecture

What isn't shown in this diagram is the data conditioning that must occur before you insert data into the Azure
Data Lake. Conditioning is necessary to convert the datasets that clients get from many sources using
microservices operating as Azure Functions. The clients customize the microservices to fit the data sources and
transform them into a standardized csv format that the training and scoring pipelines can consume.
System architecture
There were many design options available for the system architecture. The following diagram shows the end result
of the decision-making process described in Azure Machine Learning decision guide for optimal tool selection.

Batch processing architecture


The team devised the architectural design to support a batch data processing scheme. Alternate architectures are
possible but must support MLOps processes. Full use of available Azure services was a design requirement. The
following diagram shows the architecture:

Solution overview
The Azure Data Factory:
1. Triggers an Azure Function to start data ingestion with parameters and Azure Machine Learning Pipeline
run.
2. Launches a Durable Function to poll the Azure Machine Learning Pipeline for completion.
Custom dashboards in Power BI display the results. Other Azure dashboards, connected to SQL Azure, Azure
Monitor, and App Insights via OpenCensus Python SDK, track Azure resources. These dashboards provide
information about the health of the machine learning system. They also yield data the client used for product order
forecasting.
Model orchestration
Model orchestration follows these steps:
1. When a pull request is submitted, DevOps triggers a code validation pipeline.
2. That pipeline runs unit tests, code quality tests, and model validation tests.
3. When merged into the master branch, the same code validation tests are run and DevOps packages the
artifacts.
4. When DevOps collected artifacts, that triggered Azure Machine Learning to begin:
a. Data validation.
b. Training validation.
c. Scoring validation.
5. Once the validation completes, the final scoring pipeline is run.
6. When data changes and a new pull request is submitted, it retriggers the validation pipeline and then the
final scoring pipeline.
Enable experimentation
As mentioned, the traditional data science machine learning lifecycle doesn't support the MLOps process without
modification. It uses different kinds of manual tools and experimentation, validation, packaging, and model hand-
off that can't be easily scaled for an effective CI/CD process. MLOps demands a high level of process automation.
Whether a new machine learning model is being developed or an old one is modified, it's necessary to automate
the machine learning model’s lifecycle. In the Phase 2 project, the team uses Azure DevOps to orchestrate and
republish Azure Machine Learning pipelines for training tasks. The long running master branch performs basic
testing of models, and pushed stable releases through the long running release branch.
Source control becomes an important part of this process. Git is the version control system used to track notebook
and model code. It also supports process automation. The basic workflow implemented for using source control
applies the following principles:
Use formal code and dataset versioning.
New code development must use a branch until the code is fully developed and validated.
When you've validated the new code, you can merge it into the master branch.
Handle releases by establishing a permanent versioned branch separate from the master branch.
Version and use source control for the datasets that you conditioned for training or consumption so that
you can preserve the integrity of each dataset.
Use source control to track your Jupyter Notebook experiments.
Integration with data sources
Data scientists use many raw data sources and processed datasets to experiment with different machine learning
models. The volume of data in a production environment can be overwhelming. For the data scientists to
experiment with different models, they needed to use management tools like Azure Data Lake. The requirement for
formal identification and version control applies to all raw data, prepared datasets, and to machine learning
models.
In the project, the data scientists conditioned historical weekly shipment data since January 2017, historical and
forecasted daily weather data for each zip code, and shopper data for each store ID for input into the model.
Integration with source control
As part of getting data scientists to apply engineering best practices, it's necessary to easily integrate the tools they
use with source control systems like GitHub. It allows for machine learning model versioning, collaboration
between team members, and disaster recovery should the teams come across a loss of data or a disruption of any
systems.
Model ensemble support
The model design in this project was an ensemble model. That is, data scientists used many algorithms in the final
model design. In this case, the models used the same basic algorithm design. The only difference was that they
used different training data and scoring data. The models used the combination of a LASSO linear regression
algorithm and a neural network.
While not used in this project, the team explored an option to carry the process forward to the point where it
would support having many real-time models running in production servicing a given request. It could have
enabled the use of ensemble models in A/B testing and interleaved experiments.
End-user interfaces
The team developed end-user UIs for observability, monitoring, and instrumentation. As mentioned, dashboards
visually display the machine learning model data. These dashboards show the following data in a user-friendly
format:
Provide a means of observing pipeline steps, including pre-processing the input data.
To monitor the health of the machine learning model processing:
What metrics do you collect from your deployed model?
MAPE : Mean Absolute Percentage Error (the key metric to track for overall performance.
Target a MAPE value of <= 0.45 for each model.)
RMSE 0 : Root Mean Squared Error (RMSE) when the actual target value = 0
RMSE All : RMSE on the entire dataset
How do you evaluate if your model is performing as expected in production?
Is there a way to tell if production data is deviating too much from expected values?
Is your model performing poorly in production?
Do you have a failover state?
Track the quality of the processed data.
Display the scoring/predictions produced by the machine learning model.
The application populates the dashboards according to the nature of the data and how it processes and analyzes
the data. As such, the team must design the exact layout of the dashboards for each use case. Here are two sample
dashboards:
The overarching criteria for the design of these dashboards is that they provide readily usable information for
consumption by the end user of the machine learning model predictions.

NOTE
Stale models are scoring runs where the data scientist trained the model used for scoring more than 60 days from when
scoring took place. the Scoring page of the ML Monitor dashboard displays this health metric.

Components
Azure Machine Learning Service
Azure Machine Learning Compute
Azure Machine Learning Pipelines
Azure Machine Learning Model Registry
Azure Blob Storage
Azure Data Lake Storage
Azure DevOps Pipelines
Azure Data Factory (ADF)
Azure Functions for Python
Azure Monitor
Logs
Application Insights
Azure SQL Database
Azure Dashboards
Power BI

NOTE
To learn how to select the correct pipeline services to use for a specific process, see Azure Machine Learning decision guide
for optimal tool selection.

Considerations
Here you'll find a list of considerations to explore. They're based on the lessons the CSE team learned during the
project.
Environment Considerations
Data scientists develop most machine learning models using Python, often starting in Jupyter Notebooks. It
can be a challenge to implement these notebooks as production code. Jupyter notebooks are more of an
experimental tool, while Python scripts are more appropriate for production. Teams often need to spend
time refactoring model creation code into Python scripts.
If the client is new to DevOps and machine learning, be aware that experimentation and production require
different rigor. It's a good practice to separate the two.
Tools like the Azure Machine Learning Visual Designer or AutoML can be effective in getting basic models
off the ground while the client ramps up on standard DevOps practices around the rest of their solution.
Azure DevOps has plug-ins that can integrate with the Azure Machine Learning service to help trigger
pipeline steps. The MLOpsPython repo has a few examples of such pipelines.
Machine learning often requires powerful GPU machines for training. If the client doesn't already have such
hardware available, using Azure Machine Learning Compute clusters is an effective path for quickly
provisioning cost-effective powerful hardware that autoscales. If a client requires advanced security and/or
monitoring needs, they may wish to explore other options, such as standard VMs, Databricks, or using local
compute.
For a client to be successful, their model building teams (data scientists) and deployment teams (DevOps
engineers) need to have a strong communication channel. They can accomplish this with daily stand-ups
or a formal online chat service. Either would help with the standard integration of their development
efforts in an MLOps framework.
Data Preparation Considerations
The simplest solution for using Azure Machine Learning is to store data in a supported data storage
solution. Tools such as Azure Data Factory are effective for piping data to and from those locations on a
schedule.
It's important for clients to set up a cadence for capturing additional retraining data to keep their models up
to date. If they don't already have a data pipeline, creating one will be an important part of the overall
solution. Using a solution such as Datasets in Azure Machine Learning can be useful for versioning data to
help with traceability of models.
Model Training and Evaluation Considerations
It's overwhelming for a client, who is just getting started in their machine learning journey, to jump straight
in to a full MLOps pipeline. If necessary, they can ease into it by using Azure Machine Learning to track
experiment runs and using Azure Machine Learning Compute as the training target. These options might
create a lower barrier of entry solution to begin integrating Azure services.
Going from a notebook experiment to repeatable scripts is currently still a rather rough transition for many
data scientists. The sooner we can get them writing their training code in Python scripts, the easier it will be
for them to begin versioning their training code and enabling retraining.
That isn't the only possible method. Databricks supports scheduling notebooks as jobs. But based on current
client experience, it's difficult to instrument with full DevOps practices because of testing limitations.
It's also important to understand what metrics are being used to consider a model a success. Accuracy alone
is often not good enough to determine the overall performance of one model versus another.
Compute Considerations
Customers should consider using containers to standardize their compute environments. Nearly all Azure
Machine Learning Compute targets support using Docker. Having a container handle the dependencies reduces
a significant amount of friction. Especially if the team uses many compute targets.
Model Serving Considerations
The Azure Machine Learning SDK does provide an option to deploy directly to Azure Kubernetes Service from a
registered model creating limits on what security/metrics are in place. There may be an easier solution for
clients to test their model, but it's best to develop a more robust deployment to AKS for production workloads.

Related resources
MLOps maturity model
Learn more about MLOps
MLOps on Azure
Azure Monitor Visualizations
Machine Learning Lifecycle
Azure DevOps Machine Learning extension
Azure Machine Learning CLI
Trigger applications, processes, or CI/CD workflows based on Azure Machine Learning events
Set up model training and deployment with Azure DevOps
Set up MLOps with Azure Machine Learning and Databricks
Machine Learning Operations maturity model
11/2/2020 • 5 minutes to read • Edit Online

The purpose of this maturity model is to help clarify the Machine Learning Operations (MLOps) principles and
practices. The maturity model shows the continuous improvement in the creation and operation of a production
level machine learning application environment. You can use it as a metric for establishing the progressive
requirements needed to measure the maturity of a machine learning production environment and its associated
processes.

Maturity model
The MLOps maturity model helps clarify the Development Operations (DevOps) principles and practices necessary
to run a successful MLOps environment. It's intended to identify gaps in an existing organization's attempt to
implement such an environment. It's also a way to show you how to grow your MLOps capability in increments
rather than overwhelm you with the requirements of a fully mature environment. Use it as a guide to:
Estimate the scope of the work for new engagements.
Establish realistic success criteria.
Identify deliverables you'll hand over at the conclusion of the engagement.
As with most maturity models, the MLOps maturity model qualitatively assesses people/culture,
processes/structures, and objects/technology. As the maturity level increases, the probability increases that
incidents or errors will lead to improvements in the quality of the development and production processes.
The MLOps maturity model encompasses five levels of technical capability:

L EVEL DESC RIP T IO N H IGH L IGH T S T EC H N O LO GY

0 No MLOps Difficult to manage Manual builds and


full machine learning deployments
model lifecycle Manual testing of
The teams are model and
disparate and application
releases are painful No centralized
Most systems exist tracking of model
as "black boxes," little performance
feedback during/post Training of model is
deployment manual

1 DevOps but no MLOps Releases are less Automated builds


painful than No Automated tests for
MLOps, but rely on application code
Data Team for every
new model
Still limited feedback
on how well a model
performs in
production
Difficult to
trace/reproduce
results
L EVEL DESC RIP T IO N H IGH L IGH T S T EC H N O LO GY

2 Automated Training Training environment Automated model


is fully managed and training
traceable Centralized tracking
Easy to reproduce of model training
model performance
Releases are manual, Model management
but low friction

3 Automated Model Releases are low Integrated A/B


Deployment friction and testing of model
automatic performance for
Full traceability from deployment
deployment back to Automated tests for
original data all code
Entire environment Centralized training
managed: train > of model training
test > production performance

4 Full MLOps Automated Full system Automated model


Operations automated and easily training and testing
monitored Verbose, centralized
Production systems metrics from
are providing deployed model
information on how
to improve and, in
some cases,
automatically
improve with new
models
Approaching a zero-
downtime system

The tables that follow identify the detailed characteristics for that level of process maturity. The model will
continue to evolve. This version was last updated in January 2020.

Level 0: No MLOps
P EO P L E M O DEL C REAT IO N M O DEL REL EA SE A P P L IC AT IO N IN T EGRAT IO N

Data scientists: Data gathered Manual process Heavily reliant on


siloed, not in regular manually Scoring script may be data scientist
communications with Compute is likely not manually created well expertise to
the larger team managed after experiments, implement
Data engineers (if Experiments aren't not version Manual releases each
exists): siloed, not in predictably tracked controlled time
regular End result may be a Release handled by
communications with single model file data scientist or data
the larger team manually handed off engineer alone
Software engineers: with inputs/outputs
siloed, receive model
remotely from the
other team members
Level 1: DevOps no MLOps
P EO P L E M O DEL C REAT IO N M O DEL REL EA SE A P P L IC AT IO N IN T EGRAT IO N

Data scientists: Data pipeline gathers Manual process Basic integration


siloed, not in regular data automatically Scoring script may be tests exist for the
communications with Compute is or isn't manually created well model
the larger team managed after experiments, Heavily reliant on
Data engineers (if Experiments aren't likely version data scientist
exists): siloed, not in predictably tracked controlled expertise to
regular End result may be a Is handed off to implement model
communication with single model file software engineers Releases automated
the larger team manually handed off Application code has
Software engineers: with inputs/outputs unit tests
siloed, receive model
remotely from the
other team members

Level 2: Automated Training


P EO P L E M O DEL C REAT IO N M O DEL REL EA SE A P P L IC AT IO N IN T EGRAT IO N

Data scientists: Data pipeline gathers Manual release Basic integration


Working directly with data automatically Scoring script is tests exist for the
data engineers to Compute managed version controlled model
convert Experiment results with tests Heavily reliant on
experimentation tracked Release managed by data scientist
code into repeatable Both training code Software engineering expertise to
scripts/jobs and resulting models team implement model
Data engineers: are version Application code has
Working with data controlled unit tests
scientists
Software engineers:
siloed, receive model
remotely from the
other team members

Level 3: Automated Model Deployment


P EO P L E M O DEL C REAT IO N M O DEL REL EA SE A P P L IC AT IO N IN T EGRAT IO N

Data scientists: Data pipeline gathers Automatic release Unit and integration
Working directly with data automatically Scoring script is tests for each model
data engineers to Compute managed version controlled release
convert Experiment results with tests Less reliant on data
experimentation tracked Release managed by scientist expertise to
code into repeatable Both training code continuous delivery implement model
scripts/jobs and resulting models (CI/CD) pipeline Application code has
Data engineers: are version unit/integration tests
Working with data controlled
scientists and
software engineers
to manage
inputs/outputs
Software engineers:
Working with data
engineers to
automate model
integration into
application code

Level 4: Full MLOps Automated Retraining


P EO P L E M O DEL C REAT IO N M O DEL REL EA SE A P P L IC AT IO N IN T EGRAT IO N

Data scientists: Data pipeline gathers Automatic Release Unit and Integration
Working directly with data automatically Scoring Script is tests for each model
data engineers to Retraining triggered version controlled release
convert automatically based with tests Less reliant on data
experimentation on production Release managed by scientist expertise to
code into repeatable metrics continuous implement model
scripts/jobs. Working Compute managed integration and Application code has
with software Experiment results CI/CD pipeline unit/integration tests
engineers to identify tracked
markers for data Both training code
engineers and resulting models
Data engineers: are version
Working with data controlled
scientists and
software engineers
to manage
inputs/outputs
Software engineers:
Working with data
engineers to
automate model
integration into
application code.
Implementing post-
deployment metrics
gathering
Azure Machine Learning decision guide for optimal
tool selection
11/2/2020 • 7 minutes to read • Edit Online

Microsoft Azure offers a myriad of services and capabilities. Building an end-to-end machine learning pipeline
from experimentation to deployment often requires bringing together a set of services from across Azure. While it
may be possible to have one pipeline do it all, there are tradeoffs when you don't use the services for what they're
best at.
So, then you have to ask the question: When is it worth it to adopt each service for your use case? The answer
often depends on different details that are not necessarily related to the functional requirements. The main factors
are:
The skillsets on your team.
How the Azure service plugs into your existing architecture.
The maintainability of the solution that you build using the service.
The cost of these services at scale.
This document focuses on Azure services that you can use to support data or machine learning workloads. While
not exhaustive, this document covers the most popular Azure service options for supporting the end-to-end
workflow:
1. Experimentation
2. Overall Orchestration/Scheduling
3. Data Transfer
4. Data Transformation
5. Model Training
6. Model Deployment
7. Monitoring

Services for each stage


Listed below are the service options that you should consider for each stage of an end-to-end machine learning
pipeline:
Service options
STA GE SERVIC E O P T IO N S

Experimentation Azure Machine Learning Notebook VMs


Databricks Notebooks
Azure Machine Learning Experiment for Python SDK
Azure Data Science Virtual Machine (DSVM)
STA GE SERVIC E O P T IO N S

Overall Orchestration / Scheduling Azure Logic Apps


Azure Data Factory
Azure Machine Learning Pipelines
Azure DevOps

Data Transfer Azure Data Factory Copy Activity


Azure Machine Learning DataTransferStep

Compute Databricks
Azure Machine Learning Compute

Tracking / Versioning options Experiment/Hyper-tuning Tracking:


Azure Machine Learning Experiments
Databricks and MLFLow Tracking

Data Versioning/Data Drift: Azure Machine Learning Datasets

Model Versioning:
Azure Machine Learning Model Management Service
Databricks and MLFlow Model Registry

Model Training Azure Machine Learning Pipelines


Databricks

Model Deployment Batch Scoring in Azure Machine Learning Pipeline


Real-time Deployment in Azure Machine Learning Service
Azure Kubernetes Service (AKS)
Azure Container Instance
Azure App Service
Azure Functions
IoT Edge
Azure Machine Learning Model Deployment

Monitoring Azure Monitor


Application Insights
Azure Dashboards
Power BI

Use No Code or Code implementation approach


The first decision you need to make is whether to use a No Code implementation approach or the traditional Code
implementation approach. Each has its own benefits and tradeoffs.
No Code
If you don't want to code your own solutions, a set of tools is available for building workflows without writing any
code:
For experimentation, use Azure Machine Learning Designer.
For overall orchestration/scheduling, use Logic Apps, especially if integrating to Microsoft 365 suite.
For data transfer and data transformation, use Data Factory Data Flows. If datasets are simple and on a
small scale, Azure Machine Learning Designer can also handle them.
For model training and model deployment, use Azure Machine Learning Designer. It supports both real-
time and batch deployments.
For monitoring, use Azure Monitor with Azure Dashboards, which lets you click to pin visuals and set up
alerts without code. For more configuration, you can use Power BI to create historical dashboards.
The primary issue you’ll come across here is that you must work within the constraints of the services. However, if
your use case fits within these limitations, these services could be a good solution for you. They're always evolving
and their capabilities will expand over time. So you should familiarize yourself with their latest features at the time
you consider them. This diagram summarizes the process for the No Code option.

Code
If you want to code or need the flexibility that a coded solution offers, all of the options described have a “code-
like” interface. The options also have a representation of processing logic that you can export to JSON or YAML
format and check in the exported files to a code repository. From there, deployment is handled through Azure
DevOps or scripts. This diagram summarizes the Code option process.

Experimentation: Notebooks vs. Python/R scripts


Depending on the skillsets or comfort level of your team’s data scientists/engineers with notebooks or plain
scripts, there are choices for experimentation that support both options.
Notebooks
TYPE DESC RIP T IO N

Azure Machine Learning Notebook VMs Azure Machine Learning manages them. The data scientist
only navigates to the link to interact with Jupyter notebooks.
1 VM backs them. You can stop and start the VM at will. The
Notebook VM uses Azure Machine Learning SDKs to interact
with data stored in Azure. This option allows you to pick the
Compute instance option needed for experimentation
based on memory, CPU, or GPU needs.

Databricks Notebooks Azure Databricks workspace stores them with the help of Git
integration. To use it, set up a cluster to run Databricks
Notebooks. They use built-in dbutils to access data stored in
Azure. They cost more than Azure Machine Learning
Notebook VMs.

Jupyter Notebook See Azure Machine Learning Notebook VMs above.

Python/R scripts
TYPE DESC RIP T IO N

Azure Machine Learning Experiment from Python SDK If you're using Python or scripts, you can submit them directly
to Azure Machine Learning as steps in a pipeline. You can also
run a Databricks Notebook or another method of stepping
through this method, but you still need to create the actual
pipeline using scripts of some kind. If you go this route, you'll
have to upskill your engineers in Azure Machine Learning
Pipelines. A benefit is that you can leverage dataset
connections to existing data in Azure.

Note that pipeline startup cost can be prohibitive to iterating


quickly.

DSVM A catch all for teams who want to have a GPU or non-GPU
VM with standard machine learning frameworks pre-installed,
but full flexibility in what tooling to use for coding. There's a
low amount of upskilling needed.

Local experimentation If you don't require compute power in the cloud, local
experimentation is also an option.

Overall Orchestration and Scheduling


Trigger Options
The next table lists which Azure services are best for supported trigger options.

T RIGGERED B Y SERVIC E/ SY ST EM

Code Azure DevOps

Schedule Azure Machine Learning Pipelines (can only trigger itself)

Data/schedule Azure Data Factory

Events/alerts/other non-Azure products Azure Logic Apps


Trigger options with scheduling
This table adds scheduling options to the triggers.

A Z URE M A C H IN E A Z URE DATA


A Z URE DEVO P S L EA RN IN G P IP EL IN E FA C TO RY LO GIC A P P S

Schedule Cron schedule Recurrence-Based Recurrence-Based + Recurrence-based


(run at these hours Additional Support
on these days) for Tumbling
Windows

Event-based Pull request, branch, None. Blob creation and Many triggers from
Trigger and build completion blob deletion events Microsoft and non-
triggers. Artifact only. Microsoft services.
triggers not available Twitter, Dropbox,
in new YAML builds. SharePoint, and so
on.

Manual Yes, limited. No. No. No.


Inter vention or
Approval Based

Integration with Yes, limited. Supports No built-in support Yes, limited. Can run Yes, limited. Can
Other deployment to most for Azure DevOps, Azure Machine trigger Azure DevOps
Orchestrators Azure Services. Can Azure Data Factory, Learning Pipeline. build. Can fire and
call and wait Azure or Logic Apps. forget trigger Azure
Machine Learning Data Factory. No
Pipeline from integration with
Agentless task. Azure Machine
Learning Pipeline.

Data Transfer Options


TYPE DESC RIP T IO N

Azure Data Factory Copy Activity Large Scale (GBs to TBs) with many options for source and
sinks.

Azure Machine Learning DataTransferStep Small Scale (MBs to GBs) with limited options for source and
sinks.

Compute Options
TYPE DESC RIP T IO N

Azure Machine Learning Compute Instance Scalable compute instances that work for GPU or non-GPU
clusters. You run Python or R code in configurable Conda
environments managed by Azure Machine Learning. It helps
scale out multiple jobs but doesn't handle distributed data
partitioning/execution except in unique cases.

Databricks A scalable compute instance that handles distributed data


partitioning/job execution on top of Spark. Big data jobs will
likely execute faster on Databricks. You have to manage
dependencies and environments. Compute for Databricks is
more expensive.
TYPE DESC RIP T IO N

Azure Synapse (preview) Open-source Spark/RDD processing, distributed. (Big Data


Analytics)

Big Data Cluster/SQL 2019 Big Data Analytics

Tracking and Versioning options


TYPE DESC RIP T IO N

Experiment/Hyper-tuning Tracking Azure Machine Learning Experiments


Databricks and MLFLow Tracking

Data Versioning/Data Drift Azure Machine Learning Datasets

Model Versioning Azure Machine Learning Model Management Service


Databricks and MLFlow Model Registry

Model Training options


TYPE DESC RIP T IO N

Option 1 Azure Machine Learning Pipelines

Option 2 Databricks

Model deployment options


TYPE DESC RIP T IO N

Batch Scoring in Azure Machine Learning Pipeline Batch Deployment and Scoring in Azure Machine Learning
Pipeline

Real-time Deployment in Azure Machine Learning Service Azure Machine Learning Service supports real-time
deployment and scoring using:
AKS
Azure Container Instance
Azure App Service
Azure Functions
IoT Edge
and more

Monitoring options
TYPE DESC RIP T IO N

Azure Monitor Application Insights


Azure Dashboards

Power BI Analytics and Reports

Resources
Technical Paper: MLOps framework to upscale machine learning lifecycle with Azure Machine Learning
MLOps maturity model
Machine Learning model training with AKS
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Training of models using large datasets is a complex and resource intensive task. Use familiar tools such as
TensorFlow and Kubeflow to simplify training of Machine Learning models. Your ML models will run in AKS clusters
backed by GPU enabled VMs.

Architecture

Download an SVG of this architecture.

Data Flow
1. Package ML model into a container and publish to ACR
2. Azure Blob storage hosts training data sets and trained model
3. Use Kubeflow to deploy training job to AKS, distributed training job to AKS includes Parameter servers and
Worker nodes
4. Serve production model using Kubeflow, promoting a consistent environment across test, control and
production
5. AKS supports GPU enabled VM
6. Developer can build features querying the model running in AKS cluster
Movie recommendations on Azure
11/2/2020 • 4 minutes to read • Edit Online

This example scenario shows how a business can use machine learning to automate product recommendations for
their customers. An Azure Data Science Virtual Machine (DSVM) is used to train a model on Azure that
recommends movies to users based on ratings that have been given to movies.
Recommendations can be useful in various industries from retail to news to media. Potential applications include
providing product recommendations in a virtual store, providing news or post recommendations, or providing
music recommendations. Traditionally, businesses had to hire and train assistants to make personalized
recommendations to customers. Today, we can provide customized recommendations at scale by using Azure to
train models to understand customer preferences.

Relevant use cases


Consider this scenario for the following use cases:
Movie recommendations on a website.
Consumer product recommendations in a mobile app.
News recommendations on streaming media.

Architecture

This scenario covers the training and evaluating of the machine learning model using the Spark alternating least
squares (ALS) algorithm on a dataset of movie ratings. The steps for this scenario are:
1. The front-end website or app service collects historical data of user-movie interactions, which are
represented in a table of user, item, and numerical rating tuples.
2. The collected historical data is stored in a blob storage.
3. A DSVM is often used to experiment with or productize a Spark ALS recommender model. The ALS model is
trained using a training dataset, which is produced from the overall dataset by applying the appropriate data
splitting strategy. For example, the dataset can be split into sets randomly, chronologically, or stratified,
depending on the business requirement. Similar to other machine learning tasks, a recommender is
validated by using evaluation metrics (for example, precision@k, recall@k, MAP, nDCG@k).
4. Azure Machine Learning is used for coordinating the experimentation, such as hyperparameter sweeping
and model management.
5. A trained model is preserved on Azure Cosmos DB, which can then be applied for recommending the top k
movies for a given user.
6. The model is then deployed onto a web or app service by using Azure Container Instances or Azure
Kubernetes Service.
For an in-depth guide to building and scaling a recommender service, see Build a real-time recommendation API
on Azure.
Components
Data Science Virtual Machine (DSVM) is an Azure virtual machine with deep learning frameworks and tools
for machine learning and data science. The DSVM has a standalone Spark environment that can be used to
run ALS.
Azure Blob storage stores the dataset for movie recommendations.
Azure Machine Learning is used to accelerate the building, managing, and deploying of machine learning
models.
Azure Cosmos DB enables globally distributed and multi-model database storage.
Azure Container Instances is used to deploy the trained models to web or app services, optionally using
Azure Kubernetes Service.
Alternatives
Azure Databricks is a managed Spark cluster where model training and evaluating is performed. You can set up a
managed Spark environment in minutes, and autoscale up and down to help reduce the resources and costs
associated with scaling clusters manually. Another resource-saving option is to configure inactive clusters to
terminate automatically.

Considerations
Availability
Machine-learning-built apps are split into two resource components: resources for training, and resources for
serving. Resources required for training generally do not need high availability, as live production requests do not
directly hit these resources. Resources required for serving need to have high availability to serve customer
requests.
For training, the DSVM is available in multiple regions around the globe and meets the service level agreement
(SLA) for virtual machines. For serving, Azure Kubernetes Service provides a highly available infrastructure. Agent
nodes also follow the SLA for virtual machines.
Scalability
If you have a large data size, you can scale your DSVM to shorten training time. You can scale a VM up or down by
changing the VM size. Choose a memory size large enough to fit your dataset in-memory and a higher vCPU count
in order to decrease the amount of time that training takes.
Security
This scenario can use Azure Active Directory to authenticate users for access to the DSVM, which contains your
code, models, and (in-memory) data. Data is stored in Azure Storage prior to being loaded on a DSVM, where it is
automatically encrypted using Storage Service Encryption. Permissions can be managed via Azure Active Directory
authentication or role-based access control.
Deploy this scenario
Prerequisites : You must have an existing Azure account. If you don't have an Azure subscription, create a free
account before you begin.
All the code for this scenario is available in the Microsoft Recommenders repository.
Follow these steps to run the ALS quickstart notebook:
1. Create a DSVM from the Azure portal.
2. Clone the repo in the Notebooks folder:

cd notebooks
git clone https://github.com/Microsoft/Recommenders

3. Install the conda dependencies following the steps described in the SETUP.md file.
4. In a browser, go to your jupyterlab VM and navigate to
notebooks/00_quick_start/als_pyspark_movielens.ipynb .

5. Execute the notebook.

Related resources
For an in-depth guide to building and scaling a recommender service, see Build a real-time recommendation API
on Azure. For tutorials and examples of recommendation systems, see Microsoft Recommenders repository.
Optimize Marketing with Machine Learning
11/13/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Marketing campaigns are about more than the message being delivered; when and how that message is delivered
is just as important. Without a data-driven, analytical approach, campaigns can easily miss opportunities or
struggle to gain traction.
Through machine learning informed by historical campaign data, this solution helps predict customer responses
and recommends an optimized plan for connecting with your leads-including the best channel to use (by email,
SMS, a cold call, etc.), the best day of the week, and the best time of the day.
Optimizing your campaigns with machine learning helps improve both sales leads and revenue generation and can
provide strong ROI for your marketing investment.
In this solution, SQL Server R Services brings the compute to the data by running R on the computer that hosts the
database.

Architecture

Download an SVG of this architecture.

Components
SQL Server R Services: SQL Server stores the campaign and lead data. R-based analytics provide training and
predicted models and predicted results for consumption using R.
Azure Machine Learning: Machine Learning helps you design, test, operationalize, and manage predictive
analytics solutions in the cloud.
Power BI provides an interactive dashboard with visualization that uses data stored in SQL Server to drive
decisions on the predictions.

Next steps
Get started with SQL Server R Services
Learn more about Machine Learning
Learn more about Power BI
Personalized Offers
11/2/2020 • 3 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

In today's highly competitive and connected environment, modern businesses can no longer survive with generic,
static online content. Furthermore, marketing strategies using traditional tools are often expensive, hard to
implement, and do not produce the desired return on investment. These systems often fail to take full advantage of
the data collected to create a more personalized experience for the user.
Surfacing offers that are customized for the user has become essential to building customer loyalty and remaining
profitable. On a retail website, customers desire intelligent systems which provide offers and content based on their
unique interests and preferences. Today's digital marketing teams can build this intelligence using the data
generated from all types of user interactions. By analyzing massive amounts of data, marketers have the unique
opportunity to deliver highly relevant and personalized offers to each user. However, building a reliable and
scalable big data infrastructure, and developing sophisticated machine learning models that personalize to each
user is not trivial.

Architecture
Download an SVG of this architecture.

Description
Save time and let one of these trained SI partners help you with a proof of concept, deployment & integration of
this solution.
The Cortana Intelligence Suite provides advanced analytics tools through Microsoft Azure - data ingestion, data
storage, data processing and advanced analytics components - all of the essential elements for building a
personalized offer solution.
This solution combines several Azure services to provide powerful advantages. Event Hubs collects real-time
consumption data. Stream Analytics aggregates the streaming data and makes it available for visualization, as well
as updating the data used in making personalized offers to the customer. Azure DocumentDB stores the customer,
product and offer information. Azure Storage is used to manage the queues that simulate user interaction. Azure
Functions are used as a coordinator for the user simulation and as the central portion of the solution for generating
personalized offers. Azure Machine Learning implements and executes the user to product affinity scoring and
when no user history is available Azure Cache for Redis is used to provide pre-computed product affinities for the
customer. PowerBI visualizes the real-time activity for the system and with the data from DocumentDB the behavior
of the various offers.
The 'Deploy' button will launch a workflow that will deploy an instance of the solution within a Resource Group in
the Azure subscription you specify. The solution includes multiple Azure services (described above) and provides at
the end a few short instructions necessary to have a working end-to-end solution with simulated user behavior.
For post deployment instructions and more details on the technical implementation, please see the instructions
here.

Data Flow
1. User activity on the website is simulated with an Azure Function and a pair of Azure Storage Queues.
2. Personalized Offer Functionality is implemented as an Azure Function. This is the key function that ties
everything together to produce an offer and record activity. Data is read in from Azure Cache for Redis and
Azure DocumentDb, product affinity scores are computed from Azure Machine Learning (if no history for the
user exists then pre-computed affinities are read in from Azure Cache for Redis).
3. Raw user activity data (Product and Offer Clicks), Offers made to users, and performance data (for Azure
Functions and Azure Machine Learning) are sent to Azure Event Hub.
4. The offer is returned to the User. In our simulation this is done by writing to an Azure Storage Queue and picked
up by an Azure Function in order to produce the next user action.
5. Azure Stream Analytics analyzes the data to provide near real-time analytics on the input stream from the Azure
Event Hub. The aggregated data is sent to Azure DocumentDB. The raw data is sent to Azure Data Lake Storage.
Personalized marketing solutions
11/13/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Personalized marketing is essential for building customer loyalty and remaining profitable. Reaching customers
and getting them to engage is harder than ever, and generic offers are easily missed or ignored. Current marketing
systems fail to take advantage of data that can help solve this problem.
Marketers using intelligent systems and analyzing massive amounts of data can deliver highly relevant and
personalized offers to each user, cutting through the clutter and driving engagement. For example, retailers can
provide offers and content based on each customer's unique interests and preferences, putting products in front of
the people most likely to buy them.
By personalizing your offers, you'll deliver an individualized experience for every current or prospective customer,
boosting engagement and improving customer conversion, lifetime value, and retention.

Architecture

Download an SVG of this architecture.

Components
Event Hubs ingests raw click-stream data from Functions and passes it on to Stream Analytics.
Azure Stream Analytics: Stream Analytics aggregates clicks in near real-time by product, offer, and user to write
to Azure Cosmos DB and also archives raw click-stream data to Azure Storage.
Azure Cosmos DB stores aggregated data of clicks by user, product, and offer as well as user-profile information.
Storage Accounts: Azure Storage stores archived raw click-stream data from Stream Analytics.
Azure Functions takes in user clickstream data from website and reads existing user history from Azure Cosmos
DB. These data are then run through the Machine Learning web service or used along with the cold-start data in
Azure Cache for Redis to obtain product-affinity scores. Product-affinity scores are used with the personalized-
offer logic to determine the most relevant offer to present to the user.
Azure Machine Learning: Machine Learning helps you design, test, operationalize, and manage predictive
analytics solutions in the cloud.
Azure Cache for Redis stores pre-computed cold-start product affinity scores for users without history.
Power BI Visualizes user activity data as well as offers presented by reading in data from Cosmos DB.

Next steps
Learn more about Event Hubs
Learn more about Stream Analytics
Learn how to use Azure Cosmos DB
Learn more about Azure Storage
Learn how to create functions
Learn more about machine learning
Learn how to use Azure Cache for Redis
Learn about Power BI
Population Health Management for Healthcare
11/2/2020 • 3 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Population Health Management is an important tool that is increasingly being used by health care providers to
manage and control the escalating costs. The crux of Population Health Management is to use data to improve
health outcomes. Tracking, monitoring, and bench marking are the three bastions of Population Health
Management, aimed at improving clinical and health outcomes while managing and reducing cost.
In this solution, we will be leveraging the clinical and socioeconomic in-patient data generated by hospitals for
population health reporting. As an example of a machine learning application with population health management,
a model is utilized to predict length of hospital stay. It is geared towards hospitals and health care providers to
manage and control the health care expenditure through disease prevention and management. You can learn about
the data used and the length of hospital stay model in the manual deployment guide for the solution. Hospitals can
use these results to optimize care management systems and focus their clinical resources on patients with more
urgent need. Understanding the communities they serve through population health reporting can help hospitals
transition from fee-for-service payments to value-based care while reducing costs and providing better care.

Architecture
Download an SVG of this architecture.

Description
Estimated Daily Cost: $156
Using Cortana Intelligence Suite you can put together and deploy from the ground up a Population Health
Management solution by following the instructions here. To see the entire Population Health Management solution
for Health care using Cortana Intelligence Suite in action without having to spin up and connect all the components
manually, you can use the automated deployment option available here.
The 'Deploy' button will launch a workflow that will deploy an instance of the solution within a Resource Group in
the Azure subscription you specify. The architecture diagram below shows the data flow and the end-to-end
pipeline for Population Health Management Solution for Healthcare. The solution includes multiple Azure services
and requires a few necessary manual steps to have a working end-to-end solution with simulated in-patient data
from hospitals.
The architecture diagram above shows the solution design for Population Health Management Solution for
Healthcare. The solution is composed of several Azure components that perform various tasks, viz. data ingestion,
data storage, data movement, advanced analytics and visualization. Azure Event Hub is the ingestion point of raw
records that will be processed in this solution. These are then pushed to Data Lake Store for storage and further
processing by Azure Stream Analytics. A second Stream Analytics job sends selected data to PowerBI for near real
time visualizations. Azure Data Factory orchestrates, on a schedule, the scoring of the raw events from the Azure
Stream Analytics job by utilizing Azure Data Lake Analytics for processing with both USQL and R. Results of the
scoring are then stored in Azure Data Lake Store and visualized using Power BI.

Post Deployment Steps


Once the solution is deployed to the subscription, you can see the various services deployed by clicking the
resource group name on the final deployment screen. Alternatively you can use Azure management portal to see
the resources provisioned in your resource group in your subscription. The source code of the solution as well as
manual deployment instructions can be found here. The post deployment steps constitute monitoring the health of
your deployment and visualizing the Population Health Report in real time as well as the results of the predictions
from length of stay model. The post deployment instructions can be found here.
Predict Length of Stay and Patient Flow
11/13/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

For the people running a healthcare facility, length of stay-the number of days from patient admission to discharge-
matters. However, that number can vary across facilities and across disease conditions and specialties, even within
the same healthcare system, making it harder to track patient flow and plan accordingly.
This Azure solution helps hospital administrators use the power of machine learning to predict the length of stay
for in-hospital admissions, to improve capacity planning and resource utilization. A Chief Medical Information
Officer might use a predictive model to determine which facilities are overtaxed and which resources to bolster
within those facilities, and a Care Line Manager might use it to determine if there will be adequate staff resources to
handle the release of a patient.
Being able to predict length of stay at the time of admission helps hospitals provide higher quality care and
streamline their operational workload. It also helps accurately plan for discharges, lowering other quality measures
such as readmissions.

Architecture

Download an SVG of this architecture.

Components
SQL Server R Services: Stores the patient and hospital data. Provides training and predicted models and
predicted results for consumption using R.
Power BI provides an interactive dashboard with visualization that uses data stored in SQL Server to drive
decisions on the predictions.
Azure Machine Learning: Machine Learning helps you design, test, operationalize, and manage predictive
analytics solutions in the cloud.

Next steps
Learn more about SQL Server
Learn more about Power BI
Learn more about Machine Learning
Predictive Insights with Vehicle Telematics
11/13/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Learn how car dealerships, manufacturers, and insurance companies can use Microsoft Azure to gain predictive
insights on vehicle health and driving habits.
This solution is built on the Azure managed services: Event Hubs, Azure Stream Analytics, Azure Machine Learning,
Storage Accounts, HDInsight, Data Factory, Azure SQL Database and Power BI. These services run in a high-
availability environment, patched and supported, allowing you to focus on your solution instead of the
environment they run in.

Architecture
Download an SVG of this architecture.

Components
Event Hubs ingests diagnostic events and passes them on to Stream Analytics and an Azure ML Web Service.
Azure Stream Analytics: Stream Analytics accepts the input stream from Event Hubs, calls an Azure ML Web
Service to do predictions, and sends the stream to Azure Storage and Power BI.
Azure Machine Learning: Machine Learning helps you design, test, operationalize, and manage predictive
analytics solutions in the cloud and deploy web services that can be called by Stream Analytics and Azure Data
Factory.
Storage Accounts: Azure Storage stores diagnostic events stream data from Stream Analytics.
Azure Data Factory uses HDInsight to run Hive queries to process the data and load it into Azure SQL Database.
Data Factory uses HDInsight to process data and load it into Azure SQL Database.
Azure SQL Database: SQL Database is used to store and data processed by Data Factory and HDInsight and is
accessed by Power BI for analysis of the telemetry data.
This solution uses Power BI, but others use Power BI Embedded to analyze the telemetry data.

Next steps
Learn more about Event Hubs
Learn more about Stream Analytics
Learn more about Azure Machine Learning
Learn more about Azure Storage
Learn more about HDInsight
Learn more about Azure Data Factory
Learn more about SQL Database
Learn more about Power BI
Predictive Maintenance
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

This Predictive Maintenance solution monitors aircraft and predicts the remaining useful life of aircraft engine
components.

Architecture

Download an SVG of this architecture.

Overview
This solution demonstrates how to combine real-time data from sensors with advanced analytics to monitor
aircraft parts in real time and also predict the parts' remaining useful life.

Details
Air travel is central to modern life, however, aircraft engines are expensive and keeping them up and running
requires frequent maintenance by highly skilled technicians. Modern aircraft engines are equipped with highly
sophisticated sensors to track the functioning of these machines. By combining the data from these sensors with
advanced analytics, it's possible to both monitor the aircraft in real time, as well as predict the remaining useful life
of an engine component so that maintenance can be scheduled in a timely manner to prevent mechanical failures.
This predictive maintenance solution monitors aircraft and predicts the remaining useful life of aircraft engine
components. It is an end-to-end solution that includes data ingestion, data storage, data processing and advanced
analytics - all essential for building a predictive maintenance solution. While it is customized for aircraft monitoring,
it can easily be generalized for other predictive maintenance scenarios. The data source of this solution is
comprised of or derived from publicly available data from the NASA data repository using the Turbofan Engine
Degradation Simulation Data Set.
The Deploy button will launch a workflow that will deploy an instance of the solution within a Resource Group in
the Azure subscription you specify. The solution includes multiple Azure services (described below) along with a
web job that simulates data so that immediately after deployment you have a working end-to-end demo.

Technical details and workflow


1. The simulation data is streamed by a newly deployed Azure Web Job, AeroDataGenerator.
2. This synthetic data feeds into the Azure Event Hubs service as data points.
3. Two Azure Stream Analytics jobs analyze the data to provide near real-time analytics on the input stream from
the event hub. One of the Stream Analytics jobs archives all raw incoming events to the Azure Storage service
for later processing by the Azure Data Factory service, and the other publishes results onto a Power BI
dashboard.
4. The HDInsight service is used to run Hive scripts (orchestrated by Azure Data Factory) to provide aggregations
on the raw events that were archived by the aforementioned Stream Analytics job.
5. Azure Machine Learning is used (orchestrated by Azure Data Factory) to make predictions on the remaining
useful life (RUL) of particular aircraft engine given the inputs received.
6. Azure SQL Database is used (managed by Azure Data Factory) to store the prediction results received from
Azure Machine Learning. These results are then consumed in the Power BI dashboard. A stored procedure is
deployed in the SQL Database and later invoked in Azure Data Factory pipeline to store the ML prediction
results into the scoring result table.
7. Azure Data Factory handles orchestration, scheduling, and monitoring of the batch processing pipeline.
8. Finally, Power BI is used for results visualization, so that aircraft technicians can monitor the sensor data from an
airplane or across the fleet in real time and use visualizations to schedule engine maintenance.
Predictive Marketing with Machine Learning
11/13/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Marketing campaigns are about more than the message being delivered; when and how that message is delivered
is just as important. Without a data-driven, analytical approach, campaigns can easily miss opportunities or
struggle to gain traction.
Through machine learning informed by historical campaign data, this solution architecture helps predict customer
responses and recommends an optimized plan for connecting with your leads-including the best channel to use (by
email, SMS, a cold call, etc.), the best day of the week, and the best time of the day.
Optimizing your campaigns with predictive marketing helps improve both sales leads and revenue generation and
can provide strong ROI for your marketing investment.
This architecture enables efficient handling of big data on Spark with Microsoft R Server.

Architecture

Download an SVG of this architecture.

Components
Microsoft R Server on HDInsight Spark clusters provides distributed and scalable machine learning capabilities
for big data, combining the power of R Server and Apache Spark.
Power BI provides an interactive dashboard with visualization that uses data stored in SQL Server to drive
decisions on the predictions.
Storage Accounts: Azure Storage stores campaign and lead data.
Azure Machine Learning: Machine Learning helps you design, test, operationalize, and manage predictive
analytics solutions in the cloud.
Next steps
Learn more about Spark on HDInsight
Learn more about Power BI
Learn more about Azure storage
Learn more about Azure Machine Learning
Quality Assurance
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Quality assurance systems allow businesses to prevent defects throughout their processes of delivering goods or
services to customers. Building such a system that collects data and identifies potential problems along a pipeline
can provide enormous advantages. For example, in digital manufacturing, quality assurance across the assembly
line is imperative. Identifying slowdowns and potential failures before they occur rather than after they are detected
can help companies reduce costs for scrap and rework while improving productivity.
This solution shows how to predict failures using the example of manufacturing pipelines (assembly lines). This is
done by leveraging test systems already in place and failure data, specifically looking at returns and functional
failures at the end of assembly line. By combining these with domain knowledge and root cause analysis within a
modular design that encapsulates main processing steps, we provide a generic advanced analytics solution that
uses machine learning to predict failures before they happen. Early prediction of future failures allows for less
expensive repairs or even discarding, which are usually more cost efficient than going through recall and warranty
cost.

Architecture

Download an SVG of this architecture.

Description
Save time and let one of these trained SI partners help you with a proof of concept, deployment & integration of
this solution.
The Cortana Intelligence Suite provides advanced analytics tools through Microsoft Azure - data ingestion, data
storage, data processing and advanced analytics components - all of the essential elements for building a Quality
Assurance for Manufacturing solution. The solution is implemented in the cloud, using the flexible on-line Microsoft
Azure platform that decouples infrastructure components (data ingestion, storage, data movement, visualization)
from analytics engine that supports modern DS languages like R and Python. The solution modeling component
can thus be retrained as needed and be implemented using high performance Azure Machine Learning algorithms,
or open source (R/Python) libraries, or from a third-party solution vendor. The 'Deploy' button will launch a
workflow that will deploy an instance of the solution within a Resource Group in the Azure subscription you specify.
The solution includes multiple Azure services (described below) along with a web job that simulates data so that
immediately after deployment you have a working end-to-end solution. For post deployment instructions and
more details on the technical implementation, please see the instructions.

Technical details and workflow


1. The Manufacturing Assembly Line simulation data is streamed by the newly deployed Azure Web Jobs.
2. This synthetic data feeds into the Azure Event Hubs as data points/events, that will be consumed in the rest of
the solution flow and stored in Azure Synapse Analytics.
3. There are 2 Azure Stream Analytics jobs used in this pattern to provide near real-time analytics on the input
stream from the Azure Event Hub. Both jobs filter through the input data and pass the data points along to a
Azure Machine Learning endpoint sending the results to a Power BI Dashboard.
4. Finally, Power BI is used for results visualization.
Real-time scoring of Python scikit-learn and deep
learning models on Azure
11/2/2020 • 9 minutes to read • Edit Online

This reference architecture shows how to deploy Python models as web services to make real-time predictions
using the Azure Machine Learning. Two scenarios are covered: deploying regular Python models, and the specific
requirements of deploying deep learning models. Both scenarios use the architecture shown.
Two reference implementations for this architecture are available on GitHub, one for regular Python models and
one for deep learning models.

Scenarios
The reference implementations demonstrate two scenarios using this architecture.
Scenario 1: FAQ matching . This scenario shows how to deploy a frequently asked questions (FAQ) matching
model as a web service to provide predictions for user questions. For this scenario, "Input Data" in the architecture
diagram refers to text strings containing user questions to match with a list of FAQs. This scenario is designed for
the scikit-learn machine learning library for Python, but can be generalized to any scenario that uses Python
models to make real-time predictions.
This scenario uses a subset of Stack Overflow question data that includes original questions tagged as JavaScript,
their duplicate questions, and their answers. It trains a scikit-learn pipeline to predict the match probability of a
duplicate question with each of the original questions. These predictions are made in real time using a REST API
endpoint.
The application flow for this architecture is as follows:
1. The trained model is registered to the machine learning model registry.
2. Machine Learning creates a Docker image that includes the model and scoring script.
3. Azure Machine Learning deploys the scoring image on Azure Kubernetes Service (AKS) as a web service.
4. The client sends an HTTP POST request with the encoded question data.
5. The web service created by Azure Machine Learning extracts the question from the request.
6. The question is sent to the Scikit-learn pipeline model for featurization and scoring.
7. The matching FAQ questions with their scores are returned to the client.
Here is a screenshot of the example app that consumes the results:

Scenario 2: Image classification . This scenario shows how to deploy a Convolutional Neural Network (CNN)
model as a web service to provide predictions on images. For this scenario, "Input Data" in the architecture
diagram refers to image files. CNNs are very effective in computer vision for tasks such as image classification
and object detection. This scenario is designed for the frameworks TensorFlow, Keras (with the TensorFlow back
end), and PyTorch. However, it can be generalized to any scenario that uses deep learning models to make real-
time predictions.
This scenario uses a pre-trained ResNet-152 model trained on ImageNet-1K (1,000 classes) dataset to predict
which category (see figure below) an image belongs to. These predictions are made in real time using a REST API
endpoint.
The application flow for the deep learning model is as follows:
1. The deep learning model is registered to the machine learning model registry.
2. Azure Machine Learning creates a docker image including the model and scoring script.
3. Azure Machine Learning deploys the scoring image on Azure Kubernetes Service (AKS) as a web service.
4. The client sends an HTTP POST request with the encoded image data.
5. The web service created by Azure Machine Learning preprocesses the image data and sends it to the model for
scoring.
6. The predicted categories with their scores are returned to the client.

Architecture'
This architecture consists of the following components.
Azure Machine Learning is a cloud service that is used to train, deploy, automate and manage machine
learning models, all at the broad scale that the cloud provides. It is used in this architecture to manage the
deployment of models as well as authentication, routing, and load balancing of the web service.
Vir tual machine (VM). The VM is shown as an example of a device — local or in the cloud — that can send an
HTTP request.
Azure Kubernetes Ser vice (AKS) is used to deploy the application on a Kubernetes cluster. AKS simplifies the
deployment and operations of Kubernetes. The cluster can be configured using CPU-only VMs for regular Python
models or GPU-enabled VMs for deep learning models.
Azure Container Registr y enables storage of images for all types of Docker container deployments including
DC/OS, Docker Swarm and Kubernetes. The scoring images are deployed as containers on Azure Kubernetes
Service and used to run the scoring script. The image used here is created by Machine Learning from the trained
model and scoring script, and then is pushed to the Azure Container Registry.

Performance considerations
For real-time scoring architectures, throughput performance becomes a dominant consideration. For regular
Python models, it's generally accepted that CPUs are sufficient to handle the workload.
However for deep learning workloads, when speed is a bottleneck, GPUs generally provide better performance
compared to CPUs. To match GPU performance using CPUs, a cluster with large number of CPUs is usually
needed.
You can use CPUs for this architecture in either scenario, but for deep learning models, GPUs provide significantly
higher throughput values compared to a CPU cluster of similar cost. AKS supports the use of GPUs, which is one
advantage of using AKS for this architecture. Also, deep learning deployments typically use models with a high
number of parameters. Using GPUs prevents contention for resources between the model and the web service,
which is an issue in CPU-only deployments.

Scalability considerations
For regular Python models, where the AKS cluster is provisioned with CPU-only VMs, take care when scaling out
the number of pods. The goal is to fully utilize the cluster. Scaling depends on the CPU requests and limits defined
for the pods. Machine Learning through Kubernetes also supports pod autoscaling based on CPU utilization or
other metrics. The cluster autoscaler (in preview) can scale agent nodes based on the pending pods.
For deep learning scenarios, using GPU-enabled VMs, resource limits on pods are such that one GPU is assigned
to one pod. Depending on the type of VM used, you must scale the nodes of the cluster to meet the demand for
the service. You can do this easily using the Azure CLI and kubectl.

Monitoring and logging considerations


AKS monitoring
For visibility into AKS performance, use the Azure Monitor for containers feature. It collects memory and
processor metrics from controllers, nodes, and containers that are available in Kubernetes through the Metrics
API.
While deploying your application, monitor the AKS cluster to make sure it's working as expected, all the nodes are
operational, and all pods are running. Although you can use the kubectl command-line tool to retrieve pod status,
Kubernetes also includes a web dashboard for basic monitoring of the cluster status and management.
To see the overall state of the cluster and nodes, go to the Nodes section of the Kubernetes dashboard. If a node is
inactive or has failed, you can display the error logs from that page. Similarly, go to the Pods and Deployments
sections for information about the number of pods and status of your deployment.
AKS logs'
AKS automatically logs all stdout/stderr to the logs of the pods in the cluster. Use kubectl to see these and also
node-level events and logs. For details, see the deployment steps.
Use Azure Monitor for containers to collect metrics and logs through a containerized version of the Log Analytics
agent for Linux, which is stored in your Log Analytics workspace.

Security considerations
Use Azure Security Center to get a central view of the security state of your Azure resources. Security Center
monitors potential security issues and provides a comprehensive picture of the security health of your
deployment, although it doesn't monitor AKS agent nodes. Security Center is configured per Azure subscription.
Enable security data collection as described in Onboard your Azure subscription to Security Center Standard.
When data collection is enabled, Security Center automatically scans any VMs created under that subscription.
Operations . To sign in to an AKS cluster using your Azure Active Directory (Azure AD) authentication token,
configure AKS to use Azure AD for user authentication. Cluster administrators can also configure Kubernetes role-
based access control (RBAC) based on a user's identity or directory group membership.
Use RBAC to control access to the Azure resources that you deploy. RBAC lets you assign authorization roles to
members of your DevOps team. A user can be assigned to multiple roles, and you can create custom roles for
even more fine-grained permissions.
HTTPS . As a security best practice, the application should enforce HTTPS and redirect HTTP requests. Use an
ingress controller to deploy a reverse proxy that terminates SSL and redirects HTTP requests. For more
information, see Create an HTTPS ingress controller on Azure Kubernetes Service (AKS).
Authentication . This solution doesn't restrict access to the endpoints. To deploy the architecture in an enterprise
setting, secure the endpoints through API keys and add some form of user authentication to the client application.
Container registr y . This solution uses Azure Container Registry to store the Docker image. The code that the
application depends on, and the model, are contained within this image. Enterprise applications should use a
private registry to help guard against running malicious code and to help keep the information inside the
container from being compromised.
DDoS protection . Consider enabling DDoS Protection Standard. Although basic DDoS protection is enabled as
part of the Azure platform, DDoS Protection Standard provides mitigation capabilities that are tuned specifically to
Azure virtual network resources.
Logging . Use best practices before storing log data, such as scrubbing user passwords and other information that
could be used to commit security fraud.

Cost considerations
Use the Azure pricing calculator to estimate costs. Here are some other considerations.
For more information, see the Cost section in Microsoft Azure Well-Architected Framework.
Azure Machine Learning'
In this reference architecture, a large portion of cost is driven by compute resources. For the purposes of
experimentation and training, Azure Machine Learning is free. You are only charged for the compute used by the
web service. Use the Azure pricing calculator to estimate your compute costs.
Azure Container Registry
Azure Container Registry offers Basic , Standard , and Premium . Choose a tier depending on the storage you
need. Choose Premium if you need geo replication, or you enhanced throughput for docker pulls across
concurrent nodes. In addition, standard networking charges apply. For more information, see Azure Container
Registry pricing.
Azure Kubernetes Service
You only pay for the virtual machines instances, storage, and networking resources consumed by your Kubernetes
cluster. To estimate the cost of the required resources, see the Container Services calculator.
For more information, see the Cost section in Microsoft Azure Well-Architected Framework.

DevOps considerations
In this architecture the scoring images are created by the Machine Learning model and deployed as containers on
AKS. You can integrate the entire architecture into a release pipeline for model management and oprationalization,
including DevOps tasks for data sanity test, model training on different compute targets, model version
management, model deployment as real-time web service, staged deployment to QA/production environments,
integration testing and functional testing. The Machine learning operationalization (MLOps) for Python models
using Azure Machine Learning reference architecture shows how to implement a continuous integration (CI),
continuous delivery (CD), and retraining pipeline for an AI application using Azure DevOps and Azure Machine
Learning.

Deployment
To deploy this reference architecture, follow the steps described in the GitHub repos:
Regular Python models
Deep learning models
Real-time scoring of R machine learning models on
Azure
11/2/2020 • 5 minutes to read • Edit Online

This reference architecture shows how to implement a real-time (synchronous) prediction service in R using
Microsoft Machine Learning Server running in Azure Kubernetes Service (AKS). This architecture is intended to be
generic and suited for any predictive model built in R that you want to deploy as a real-time service. Deploy this
solution .

Architecture

This reference architecture takes a container-based approach. A Docker image is built containing R, as well as the
various artifacts needed to score new data. These include the model object itself and a scoring script. This image is
pushed to a Docker registry hosted in Azure, and then deployed to a Kubernetes cluster, also in Azure.
The architecture of this workflow includes the following components.
Azure Container Registr y is used to store the images for this workflow. Registries created with Container
Registry can be managed via the standard Docker Registry V2 API and client.
Azure Kubernetes Ser vice is used to host the deployment and service. Clusters created with AKS can be
managed using the standard Kubernetes API and client (kubectl).
Plumber is used to define the REST API for the service. Plumber is an open-source package for exposing R
code, such as predictive models, via a REST API.
Traefik provides a middleware layer for basic authentication and TLS encryption.

Performance considerations
Machine learning workloads tend to be compute-intensive, both when training and when scoring new data. As a
rule of thumb, try not to run more than one scoring process per core. Machine Learning Server lets you define the
number of R processes running in each container. The default is five processes. When creating a relatively simple
model, such as a linear regression with a small number of variables, or a small decision tree, you can increase the
number of processes. Monitor the CPU load on your cluster nodes to determine the appropriate limit on the
number of containers.
A GPU-enabled cluster can speed up some types of workloads, and deep learning models in particular. Not all
workloads can take advantage of GPUs — only those that make heavy use of matrix algebra. For example, tree-
based models, including random forests and boosting models, generally derive no advantage from GPUs.
Some model types such as random forests are massively parallelizable on CPUs. In these cases, speed up the
scoring of a single request by distributing the workload across multiple cores. However, doing so reduces your
capacity to handle multiple scoring requests given a fixed cluster size.
In general, open-source R models store all their data in memory, so ensure that your nodes have enough memory
to accommodate the processes you plan to run concurrently. If you are using Machine Learning Server to fit your
models, use the libraries that can process data on disk, rather than reading it all into memory. This can help reduce
memory requirements significantly. Regardless of whether you use Machine Learning Server or open-source R,
monitor your nodes to ensure that your scoring processes are not memory-starved.

Security considerations
Network encryption
In this reference architecture, HTTPS is enabled for communication with the cluster, and a staging certificate from
Let's Encrypt is used. For production purposes, substitute your own certificate from an appropriate signing
authority.
Authentication and authorization
In this architecture, access to the AKS cluster endpoint is secured using HTTP basic authentication. This allows
anybody who knows the username and password to access the endpoint. It's strongly recommended to provide an
additional layer of authentication on top of this, for example with Azure API Management.
Traffic between Container Registry and AKS is authenticated using role-based access control (RBAC) to limit access
privileges to only those needed.
Separate storage
This reference architecture bundles the application (R) and the data (model object and scoring script) into a single
image. In some cases, it may be beneficial to separate these. You can place the model data and code into Azure
blob or file storage, and retrieve them at container initialization. In this case, ensure that the storage account is set
to allow authenticated access only and require HTTPS.

Monitoring and logging considerations


Use the Kubernetes dashboard to monitor the overall status of your AKS cluster. See the cluster's overview blade in
Azure portal for more details. The GitHub resources also show how to bring up the dashboard from R.
Although the dashboard gives you a view of the overall health of your cluster, it's also important to track the status
of individual containers. To do this, enable Azure Monitor Insights from the cluster overview blade in Azure portal,
or consider Azure Monitor for containers (in preview).

Cost considerations
The main cost consideration in this architecture is the Kubernetes cluster's compute resources. The cluster must be
large enough to handle the expected request volume at peak times, but this approach leaves resources idle at other
times. To limit the impact of idle resources, enable the horizontal autoscaler for the cluster using the kubectl tool. or
use the AKS cluster autoscaler.
Software alternatives
An alternative to Plumber is Microsoft Machine Learning Server, which provides a model operationalization
feature for exposing predictive models via REST APIs. MMLS has the advantage of providing built-in authentication,
which may reduce or eliminate the need for a separate authentication layer. Note however that MMLS is
commercial software, licensed on a per-core basis. If you are an enterprise Machine Learning Server or Microsoft
SQL Server customer, contact your Microsoft representative for pricing details.
Nginx and Cert-Manager can be used rather than Traefik to provide the middleware layer.
This architecture provides a pure R experience. It doesn't use the Python-oriented Azure Machine Learning
(AzureML SDK), which is a mature cloud service for developing AI solutions at scale. AzureML SDK provides an
easy path for developing and deploying containerized scoring scripts. We provide an alternative solution that
shows how to use Conda to install R and R packages to leverage AzureML SDK via the rpy2 Python package. The
value of using this alternative to operationalize R models is that you don't need to know Flask, and you can reuse
AzureML SDK expertise, which can be useful for teams that are comfortable using both R and Python languages
for their data science projects. The implementation of this alternative architecture is available on GitHub.

Deploy the solution


The reference implementation of this architecture is available on GitHub. Follow the steps described there to
deploy a simple predictive model as a service.
Remote Patient Monitoring Solutions
11/13/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Predict and prevent future incidents by combining IoT and intelligence to optimize treatments, using Azure to
remotely monitor patients and analyze the massive amounts of data generated by medical devices.

Architecture

Download an SVG of this architecture.

Data Flow
1. Securely ingest medical sensor and device data using Azure IoT Hub.
2. Securely store sensor and device data in Cosmos DB.
3. Analyze sensor and device data using a pre-trained Cognitive Services API or a custom developed Machine
Learning model.
4. Store Artificial Intelligence (AI) and Machine Learning results in Cosmos DB.
5. Interact AI and Machine Learning results using Power BI, while preserving Role-Based Access Control (RBAC).
6. Integrate data insights with backend systems and processes using Logic Apps.

Components
Azure IoT Hub: Connect, monitor and manage billions of IoT assets
Security Center: Unify security management and enable advanced threat protection across hybrid cloud
workloads
Cognitive Services: Get started with quickstarts, samples, and tutorials
Key Vault: Safeguard and maintain control of keys and other secrets
Logic Apps: Automate the access and use of data across clouds without writing code
Power BI Embedded: Embed fully interactive, stunning data visualizations in your applications
Azure Cosmos DB: Globally distributed, multi-model database for any scale
Application Insights: Detect, triage, and diagnose issues in your web apps and services
Azure Machine Learning: Build, deploy, and manage predictive analytics solutions
Azure Monitor: Full observability into your applications, infrastructure, and network

Next steps
IoT Hub Documentation
Azure Security Center Documentation
Get Started with Azure
What is Azure Key Vault?
Azure Logic Apps Documentation
Power BI Embedded Documentation
Azure Cosmos DB Documentation
Application Insights Documentation
Azure Machine Learning Documentation
Azure Monitor Documentation
Retail Assistant with Visual Capabilities using
Cognitive Services
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

The retail assistant or vacation planner can help your customers have interactions with your business bot and
provide suggestions based on the visual information.

Architecture

Download an SVG of this architecture.

Data Flow
1. Users interact with your business assistant
2. Assistant understands context from LUIS
3. Assistant passes visual context to the Bing Visual Search API
4. Optionally Bot gets more information for user queries exclusively from your domain using the Bing Custom
Search API
Retail product recommendations
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

A deep understanding between customer interests and purchasing patterns is a critical component of any retail
business intelligence operation. This solution implements a process of aggregating customer data into a complete
profile, and uses advanced machine learning models backed by the reliability and processing power of Azure to
provide predictive insights on simulated customers.

Architecture

Download an SVG of this architecture.

Description
For more details on how this solution is built, visit the solution guide in GitHub.
A typical retail business collects customer data through a variety of channels, including web-browsing patterns,
purchase behaviors, demographics, and other session-based web data. Some of the data originates from core
business operations, but other data must be pulled and joined from external sources like partners, manufacturers,
public domain, etc.
Many businesses leverage only a small portion of the available data, but in order to maximize ROI, a business must
integrate relevant data from all sources. Traditionally, the integration of external, heterogeneous data sources into a
shared data processing engine has required significant effort and resources to set up. This solution describes a
simple, scalable approach to integrating analytics and machine learning to predict customer purchasing activity.
This solution addresses the above problems by:
Uniformly accessing data from multiple data sources while minimizing data movement and system complexity
in order to boost performance.
Performing ETL and feature engineering needed to use a predictive Machine Learning model.
Creating a comprehensive customer 360 profile enriched by predictive analytics running across a distributed
system backed by Microsoft R Server and Azure HDInsight.

Data Flow
1. A Data Generator pipes simulated customer events to an Event Hub
2. A Stream Analytics job reads from the EventHub, performs aggregations
3. Stream Analytics persists time-grouped data to an Azure Storage Blob
4. A Spark job running in HDInsight merges the latest customer browsing data with historical purchase and
demographic data to build a consolidated user profile
5. A second Spark job scores each customer profile against a machine learning model to predict future purchasing
patterns (in other words, is a given customer likely to make a purchase in the next 30 days, and if so, in which
product category?)
6. Predictions and other profile data are visualized and shared as charts and tables in Power BI Online
Scalable personalization on Azure
11/2/2020 • 7 minutes to read • Edit Online

Recommendations are a main revenue driver for many businesses and are used in different kinds of industries,
including retail, news, and media. With the availability of large amounts of data, you can now provide highly
relevant recommendations using machine learning.
There are two main types of recommendation systems: collaborative filtering and content-based. Collaborative
filtering identifies similar patterns in customer behavior and recommends items that other similar customers have
interacted with. Content-based recommendation uses information about the items to learn customer preferences
and recommends items that share properties with items that a customer has previously interacted with. The
approach described in this document focuses on the content-based recommendation system.
This example scenario shows how your business can use machine learning to automate content-based
personalization for your customers. At a high level, we use Azure Databricks to train a model that predicts the
probability a user will engage with an item. That model is deployed to production as a prediction service using
Azure Kubernetes Service. In turn, you can use this prediction to create personalized recommendations by ranking
items based on the content that a user is most likely to consume.

Relevant use cases


This scenario is relevant to the following use cases:
Content recommendations on a website or in a mobile application.
Product recommendation on an e-commerce site.
Displayed ad recommendation on a website.

Architecture

This scenario covers the training, evaluation, and deployment of a machine learning model for content-based
personalization on Apache Spark using Azure Databricks. In this case, a model is trained with a supervised
classification algorithm on a dataset containing user and item features. The label for each example is a binary value
indicating that the user engaged with (for example, clicked) an item. This scenario covers a subset of the steps
required for a full end-to-end recommendation system workload. The broader context of this scenario is based on a
generic e-commerce website with a front end that serves rapidly changing content to its users. This website uses
cookies and user profiles to personalize the content for that user. Along with user profiles, the website may have
information about every item it serves to each user. Once that data is available, the following steps are executed to
build and operationalize a recommendation system:
1. The sets of distinct user and item data are preprocessed and joined, which results in a mixture of numeric and
categorical features to be used for predicting user-item interactions (clicks). This table is uploaded to Azure Blob
Storage. For demonstration purposes, the Criteo display advertising challenge dataset is used. This dataset
matches the described anonymized table, as it contains a binary label for observed user clicks, 13 numerical
features, and an additional 26 categorical features.
2. The MMLSpark library provides the ability to train a LightGBM classifier on Azure Databricks to predict the click
probability as a function of the numeric and categorical features that were created in the previous step.
LightGBM is a highly efficient machine learning algorithm, and MMLSpark enables distributed training of
LightGBM models over large datasets.
3. The trained classifier is serialized and stored in the Azure Model Registry. With Azure Model Registry, you can
store and organize different versions of the model (for example, based on newer data or different
hyperparameters) within an Azure Machine Learning (Azure ML) Workspace.
4. A serving script is defined using the MML Spark Serving library to provide predictions from the trained model.
5. Azure ML is used to create a Docker image in the Azure Container Registry that holds the image with the scoring
script and all necessary dependencies for serving predictions.
6. Azure ML is also used to provision the compute for serving predictions. A Kubernetes cluster is configured using
Azure Kubernetes Service (AKS) with the number of nodes needed to handle expected load. The virtual machine
size can be adjusted based on the model's computation and memory requirements.
7. The scoring service is deployed as a web service on the AKS cluster. The service provides an endpoint where
user and item features can be sent to receive the predicted probability of a click for that user and item.
Components
This architecture makes use of the following components:
Azure Blob Storage is a storage service optimized for storing massive amounts of unstructured data. In this case,
the input data is stored here.
Azure Databricks is a managed Apache Spark cluster for model training and evaluation. We also use MMLSpark,
a Spark-based framework designed for large-scale machine learning.
Azure Container Registry is used to package the scoring script as a container image, which is used to serve the
model in production.
Azure Kubernetes Service is used to deploy the trained model to web or app services.
Azure Machine Learning is used in this scenario to register the machine learning model and to deploy AKS.
Microsoft Recommenders is an open-source repository that contains utility code and samples. With this
repository, users can start to build, evaluate, and operationalize a recommender system.

Considerations
Scalability
For training, you can scale Azure Databricks up or down based on the size of the data used and the compute
necessary for model training. To scale, you can adjust the total number of cores or amount of memory available to
the cluster. Just edit the number or type of Virtual Machines (VMs) used. The Criteo dataset contains 45.8 million
rows in this example; it was trained in a few minutes on a cluster with 10 standard L8s VMs.
For deployment, you can scale the compute resources based on the expected load for the scoring service and
latency requirements. The scoring service uses MML Spark Serving running separately on each node in the
Kubernetes cluster. With this practice, you can seamlessly transfer the feature transformation and model prediction
pipeline developed on Azure Databricks to the production side. It also removes the need to precompute scores for
all possible user and item combinations, which might be difficult if you're using dynamic user features, such as time
of day.
Availability
Machine learning tasks are split into two resource components: resources for training, and resources for production
deployment. Resources required for training generally don't need high availability, as live production requests don't
directly hit these resources. Resources required for serving need to have high availability to serve customer
requests.
Training on Azure Databricks can happen on any one of the regions with a service level agreement (SLA) to support
your needs. For production deployment, Azure Kubernetes Service is used to provide broad geographic availability
with this SLA.
Security
This scenario can use Azure Active Directory (Azure AD) to authenticate users to the Azure Databricks workspace
and the Azure Kubernetes cluster. Permissions can be managed via Azure AD authentication or role-based access
control.

Deploy this scenario


Prerequisites
You must have an existing Azure account.
Walkthrough
All the code for this scenario is available in the Microsoft Recommenders repository.
To run the notebooks for training and deploying the recommendation model on Azure Databricks:
1. Create an Azure Databricks workspace from the Azure portal.
2. Follow the setup instructions to install utilities from the Microsoft Recommenders repository on a cluster within
your workspace.
a. Include the --mmlspark option in the install script to have MMLSpark installed.
b. Also, MMLSpark requires autoscaling to be disabled in the Cluster setup.
3. Import the training notebook into your workspace. After logging into your Azure Databricks Workspace:
a. Select Home on the left side of the workspace.
b. Right-click whitespace in your home directory.
c. Select Impor t .
d. Select URL , and paste the following string into the text field:
https://aka.ms/recommenders/lgbm-criteo-training .
e. Select Impor t .
4. Repeat step 3 for the operationalization notebook here: https://aka.ms/recommenders/lgbm-criteo-o16n .
5. Select the notebook to open it, attach the configured cluster, and execute the notebook.

Pricing
To better understand the cost of running this scenario on Azure, we provide a pricing estimator based on the
following assumptions:
Training data is of the same scale as the example dataset used (45.8 million rows).
Training needs to happen daily to update the serving model.
Training will occur on Azure Databricks using a cluster provisioned with 12 VMs using L8s instances.
Training will take an hour, including feature processing and model training plus validation.
Azure Machine Learning will be used to deploy the model to AKS with a small three-node cluster using D3
instances.
AKS cluster will autoscale as needed, resulting in two nodes per month being active on average.
To see how pricing differs for your use case, change the variables to match your expected data size and serving load
requirements. For larger or smaller training data sizes, the size of the Databricks cluster can be increased or
reduced, respectively. To handle larger numbers of concurrent users during model serving, the AKS cluster should
be increased. For more information on scaling AKS to support latency and load requirements, review the
operationalization notebook.

Next steps
For an in-depth guide to building and scaling a recommender service, see Build a real-time recommendation
API on Azure.
To see more examples, tutorials, and tools to help you build your own recommendation system visit the
Microsoft Recommenders GitHub repository.
Speech Services
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

With Speech Services, it's easy to transcribe every call. Index the transcription for full-text search, or apply Text
Analytics to detect sentiment, language, and key phrases for insights. If your call center recordings involve
specialized terminology, such as product names or IT jargon, create a custom language model to teach Speech
Services the vocabulary. A custom acoustic model helps Speech Services understand speakers even with
background noise or poor phone connections.
For more information, read how batch transcription works with Speech Services.

Architecture

Download an SVG of this architecture.

Data Flow
1. Adapt a model for your domain and deploy that model
2. Upload your recordings to a blob container
3. Create a POST request to batch transcription
4. Speech Services schedules the transcription job
5. Stereo files are split into two channels
6. Mono files undergo diarization to distinguish between speakers
7. Download the transcription using the transcription ID
Speech transcription with Azure Cognitive Services
11/2/2020 • 8 minutes to read • Edit Online

Customer care centers are an integral part of the success of many businesses. You can improve the efficiency of
your call centers by using speech AI. Speech recognition and the analysis of high volumes of recorded customer
calls can provide your business with valuable information about current trends, product shortcomings, and
successes. Enterprise solutions that use the Speech APIs of Azure Cognitive Services can be implemented to
consume and process such high volumes of discrete data.
The reference architecture described in this article shows how to build an audio ingestion and speech-to-text
transcription pipeline for customer care centers. This pipeline processes batches of recorded audio files and stores
the transcribed text files in Azure Blob storage. This architecture doesn't implement real-time speech processing.
This pipeline can later feed into the next phase of your speech AI implementation. In that phase, you can process
transcribed text to recognize and remove sensitive information, perform sentiment analysis, and so on.
The reference implementation for this architecture is available on GitHub.

Architecture

You can implement this architecture by using your Azure account and allow client applications access to the
pipeline through REST APIs. The application goes through a three-step process to upload an audio file:
1. It authenticates by using Azure Active Directory (Azure AD). This step is required for the first file upload.
2. It calls the REST API to get the shared access signature (SAS) token required to access Azure Blob storage.
3. It uploads the audio files to a blob container.
The reference client application uses JavaScript to upload the files, as shown in this example. After the file is
uploaded, an Azure Event Grid trigger is generated that invokes an Azure function. The function processes the file
by using the Azure Cognitive Services Speech APIs. The transcribed text is stored in a separate blob container, ready
for consumption into the next phase of the pipeline: speech analysis and storage in a database.
The architecture uses these Azure services:
Azure Blob storage stores objects in the cloud. Blob storage is optimized for storing massive amounts of
unstructured data, like text or binary data. Because sensitive information might be saved in the blob, you need to
secure its access by using authentication methods like SAS keys.
Azure Event Grid provides built-in support for efficient event-driven architectures on Azure. When the audio file
upload is complete, Event Grid triggers a BlobCreated event for the transcription function.
Azure Functions provides event-driven compute capabilities without requiring you to build the infrastructure. The
function in this reference architecture transcribes the speech audio files to text. The model is a serverless model,
which means the Consumption plan is used to host the function.
Azure Cognitive Ser vices is a collection of APIs that can help developers build intelligent applications without
extensive AI or data science skills. The transcription function calls the Cognitive Services speech-to-text APIs. The
output for a sample audio file transcription might look similar to the following metadata:
ResultId:19e70bee8b5348a6afb67817825a9586 Reason:RecognizedSpeech Recognized text:<Text for sample audio.>. Json:
{"DisplayText":"Text for sample
audio.","Duration":53700000,"Id":"28526a6304da4af1922fedd4edcdddbb","Offset":3900000,"RecognitionStatus":"Success"}
.
Azure API Management provides secure access to REST APIs. Because only clients authenticated via API
Management can request a SAS token, this service provides an additional layer of security in the architecture.
Azure Active Director y provides identity management and secured access to resources on the Azure cloud
platform. The client in this architecture first needs to authenticate by using Azure AD to be able to access the REST
API. The REST API creates the access token for the blob storage by using the Azure AD credentials of the business
owner. Role-based access control gives the client the minimum access privileges needed to upload audio files.
Azure Key Vault provides secure storage of secrets and keys. This reference architecture stores the account
credentials and other secrets needed to generate the SAS tokens in the key vault. The REST APIs and the speech
transcription function access this vault to retrieve the secrets.

Scalability considerations
Azure Blob storage
Scalability during upload
To create a high-performing and cost-effective scalable solution, this reference architecture uses the Valet Key
design pattern. The client application is responsible for the actual data upload. The SAS token restricts access to
blob storage. The client needs to first acquire this token by using the REST API. The API in the reference
implementation generates a user delegate SAS token that's created by using the Azure Active Directory credentials
of the business owner. For most scenarios, this method is more secure and is preferred over SAS tokens created by
using an account key. For more information on SAS tokens, see Types of shared access signatures.
Scalability for file size
The reference architecture allows large audio files to be uploaded to the cloud by dividing them into 4-KB chunks.
Chunking is a common technique used to upload large blobs, as discussed in this article. The maximum file size
that can be uploaded is dictated by the maximum size limit of the blob, which can be as much as 4.77 TB.
Scalability for storage
Azure Blob storage can throttle service requests per blob or per storage account. The blob-level throttling limits
might not be a concern in this scenario because every uploaded file corresponds to a single blob. However, multiple
clients uploading multiple files to a single storage account might exceed the account's limits. If that's a possibility,
consider using multiple storage accounts and partitioning the data objects across them. For a detailed list of
scalability considerations for the blob, see Performance and scalability checklist for Blob storage.
Azure Event Grid
The function that transcribes the audio files is triggered when the upload finishes. This reference architecture uses
an Event Grid trigger instead of the Blob storage trigger. This is because the blob trigger events can be missed as
the number of blobs in a container increases significantly. Missing triggers negatively affect application throughput
and reliability. For more information, see Blob trigger alternatives.
Azure Cognitive Services
Cognitive Services APIs might have request limits based on the subscription tier. Consider containerizing these APIs
to avoid throttling large volume processing. Containers give you flexibility of deployment, whether in the cloud or
on-premises. You can also mitigate side effects of new version rollouts by using containers. See Container support
in Azure Cognitive Services for more information.

Security considerations
Many of the security considerations for a serverless web applications apply to this reference architecture. The
following sections discuss considerations specific to this architecture.
Azure Active Directory
The audio files stored in the blob might contain sensitive customer data. If multiple clients are using this solution,
it's important to restrict access to these files. This reference architecture uses SAS tokens to protect these files from
outside attacks. These tokens, called user delegation SAS tokens, are created by using the service owner's Azure AD
credentials.
A SAS token allows you to control:
Which resources clients can access, because it's created per resource.
Which permissions clients have while accessing the resources, via role-based access control. We recommend
that you grant the minimum required permissions. The clients in this architecture have write-only access to the
blobs. This access level prevents them from reading other clients' audio files, either accidentally or maliciously.
When individual tokens expire. This control limits the window of exposure to the token, which limits the
possibility of unauthorized access to the resource. For larger files, the SAS token might expire before the upload
finishes. The client can request multiple tokens for the same file. Because only authenticated clients can do so,
multiple requests of these tokens don't affect overall security.
See Grant limited access to Azure Storage resources using shared access signatures (SAS) for an in-depth
discussion of SAS tokens. See Create a user delegation SAS to learn more about user delegation SAS tokens.
API Management
In addition to restricting access to resources by using SAS tokens, this reference architecture provides another layer
of security by using API Management. Clients need to authenticate by using API Management before they request
SAS tokens. API Management has built-in access controls for the REST API. We recommend this additional layer of
security because the uploaded data might contain sensitive information.
When several clients upload files in parallel, API Management serves multiple purposes:
It enforces usage quotas and rate limits.
It validates OAuth 2.0 tokens for authentication.
It enables cross-origin resource sharing (CORS).
It caches responses.
It monitors and logs requests.

Resiliency considerations
In the case of an extremely large number of events, Event Grid might fail to trigger the function. Such missed
events are typically added to a dead letter container. Consider making the architecture more resilient by adding an
additional supervisor function. This function can periodically wake up on a timer trigger. It can then find and
process missed events, either from the dead letter container or by comparing the blobs in the upload and
transcribe containers.
This pattern is similar to the Scheduler Agent Supervisor pattern. This pattern isn't implemented in this reference
architecture for the sake of simplicity. For more information on how Event Grid handles failures, see the Event Grid
message delivery and retry policies.
Another way to improve resiliency is to use Azure Service Bus instead of Event Grid. This model sequentially
processes file uploads. The client signals Service Bus when an upload finishes. Service Bus then invokes the
function to transcribe the uploaded file. This model is more reliable. However, it will have less throughput than an
event-based architecture. Carefully consider which architecture applies to your scenario and application.

Solution deployment
For information about how to deploy the reference implementation for this architecture, see the GitHub readme.

Next steps
You can process transcribed speech by using the built-in speech analysis features provided by Azure Cognitive
Services. For more information, see the documentation for Speech APIs.
Training of Python scikit-learn and deep learning
models on Azure
11/2/2020 • 6 minutes to read • Edit Online

This reference architecture shows recommended practices for tuning the hyperparameters (training parameters)
of python models. Two scenarios are covered: hyperparameter tuning of scikit-learn models and deep learning
models with GPUs. Two reference implementation for this architecture are available on GitHub, one for scikit-learn
models and one for deep learning models.

Scenarios
Scenario 1: FAQ matching. The problem addressed here is Frequently Asked Question (FAQ) matching. This
scenario uses a subset of Stack Overflow question data that includes original questions tagged as JavaScript, their
duplicate questions, and their answers. It tunes a scikit-learn pipeline to predict the probability that a duplicate
question matches one of the original questions.
Processing in this Azure Machine Learning pipeline scenario involves the following steps:
1. The training Python script is submitted to Azure Machine Learning.
2. The script runs in Docker containers that are created on each node.
3. That script retrieves training and testing data from Azure Storage.
4. The script trains a model from the training data using its combination of training parameters.
5. The model's performance is evaluated on the testing data, and is written to Azure Storage.
6. The best performing model is registered with Azure Machine Learning.
See also considerations for training deep learning models with GPUs.
Scenario2: Out of stock detection. This scenario shows how to tune an object detection Mask RCNN model
that can be deployed as a web service to provide predictions for empty spaces on store shelves. Images similar to
retailer store shelves filled with products are used to predict empty spaces to help detect out of stock products by
potentially marrying these predictions with other sources of information such as planograms and databases. In
this scenario, only empty space prediction is covered. The dataset used is distributed under the CC-BY 4.0 license.
Processing in this scenario involves the following steps:
1. The training Python script is submitted to Azure Machine Learning.
2. The script runs in Docker containers that are created on each node by pulling a custom image that is stored
on Azure Container Registry.
3. The script trains a model from the training data saved in Azure Storage. using its combination of training
parameters.
4. The model's performance is evaluated on the testing data, and is written to Azure Storage.
5. The best performing model is registered with Azure Machine Learning.
See also considerations for distributed training of deep learning models with GPUs.

Architecture
This architecture consists of several Azure cloud services that scale resources according to need. The services and
their roles in this solution are described below.
Microsoft Data Science Virtual Machine (DSVM) is a VM image configured with tools used for data analytics and
machine learning. Both Windows Server and Linux versions are available. This deployment uses the Linux editions
of the DSVM on Ubuntu 16.04 LTS.
Azure Machine Learning is used to train, deploy, automate, and manage machine learning models at cloud scale.
It's used in this architecture to manage the allocation and use of the Azure resources described below.
Azure Machine Learning Compute is the resource used to train and test machine learning and AI models at scale in
Azure. The compute target in this scenario is a cluster of nodes that are allocated on demand based on an
automatic scaling option. Each node is a VM that runs a training job for a particular hyperparameter set.
Azure Container Registry stores images for all types of Docker container deployments. These containers are
created on each node and used to run the training Python script. The image used in the Machine Learning
Compute cluster is created by Machine Learning in the local run and hyperparameter tuning notebooks, and then
is pushed to Container Registry.
Azure Blob storage receives the training and test data sets from Machine Learning that are used by the training
Python script. Storage is mounted as a virtual drive onto each node of a Machine Learning Compute cluster.

Performance considerations
Each set of hyperparameters runs on one node of the Machine Learning compute target. For scenario 1, each node
is a Standard D4 v2 VM, which has four cores. This scenario uses a LightGBM classifier for machine learning, a
gradient boosting framework. This software can run on all four cores at the same time, speeding up each run by a
factor of up to four. That way, the whole hyperparameter tuning run takes up to one-quarter of the time it would
take had it been run on a Machine Learning Compute target based on Standard D1 v2 VMs, which have only one
core each. For scenario 2, each node is a Standard NC6 with 1 GPU and each hyperparameter tuning run will use
the single GPU on each node.
The maximum number of Machine Learning Compute nodes affects the total run time. The recommended
minimum number of nodes is zero. With this setting, the time it takes for a job to start up includes some minutes
for auto-scaling at least one node into the cluster. If the hyperparameter tuning runs for a short time, however,
scaling up the job adds to the overhead. For example, a job can run in under five minutes, but scaling up to one
node might take another five minutes. In this case, setting the minimum to one node saves time but adds to the
cost.

Monitoring and logging considerations


Submit a HyperDrive run configuration to return a Run object for use in monitoring the run's progress.
RunDetails Jupyter Widget
Use the Run object with the RunDetails Jupyter widget to conveniently monitor its progress at queuing and when
running its children jobs. It also shows the values of the primary statistic that they log in real time.
Azure portal
Print a Run object to display a link to the run's page in Azure portal like this:

Use this page to monitor the status of the run and its children runs. Each child run has a driver log containing the
output of the training script it has run.

Cost considerations
The cost of a hyperparameter tuning run depends linearly on the choice of Machine Learning Compute VM size,
whether low-priority nodes are used, and the maximum number of nodes allowed in the cluster.
Ongoing costs when the cluster is not in use depend on the minimum number of nodes required by the cluster.
With cluster autoscaling, the system automatically adds nodes up to the allowed maximum to match the number
of jobs, and then removes nodes down to the requested minimum as they are no longer needed. If the cluster can
autoscale down to zero nodes, it does not cost anything when it is not in use.

Security considerations
Restrict access to Azure Blob Storage
This architecture uses storage account keys to access the Blob storage. For further control and protection, consider
using a shared access signature (SAS) instead. This grants limited access to objects in storage, without needing to
hard-code the account keys or save them in plaintext. Using a SAS also helps to ensure that the storage account
has proper governance, and that access is granted only to the people intended to have it.
For scenarios with more sensitive data, make sure that all of your storage keys are protected, because these keys
grant full access to all input and output data from the workload.
Encrypt data at rest and in motion
In scenarios that use sensitive data, encrypt the data at rest-that is, the data in storage. Each time data moves from
one location to the next, use SSL to secure the data transfer. For more information, see the Azure Storage security
guide.
Secure data in a virtual network
For production deployments, consider deploying the cluster into a subnet of a virtual network that you specify.
This allows the compute nodes in the cluster to communicate securely with other virtual machines or with an on-
premises network. You can also use service endpoints with blob storage to grant access from a virtual network or
use a single-node NFS inside the virtual network with Azure Machine Learning.

Deployment
To deploy the reference implementation for this architecture, follow the steps described in the GitHub repos:
Regular Python models
Deep learning models

Next steps
To operationalize the model, see Machine learning operationalization (MLOps) for Python models using Azure
Machine Learning.
Vision classifier model with Azure Custom Vision
Cognitive Service
11/2/2020 • 2 minutes to read • Edit Online

Azure Cognitive Services offers many possibilities for Artificial Intelligence (AI) solutions. One of them is Azure
Custom Vision, which allows you to build, deploy, and improve your image classifiers. This architecture uses
Custom Vision to classify images taken by a simulated drone. It provides a way to combine AI and the Internet of
Things (IoT).

Potential use case


Microsoft Search and Rescue Lab suggests a hypothetical use case for Custom Vision. In the lab, you fly a Microsoft
AirSim simulated drone around in a 3D-rendered environment. You use the simulated drone to capture synthetic
images of the animals in that environment. After creating a dataset of images, you use the dataset to train a Custom
Vision classifier model. To train the model, you tag the images with the names of the animals. When you fly the
drone again, take new images of the animals. This solution identifies the name of the animal in each new image.
In a practical application of the lab, an actual drone replaces the Microsoft AirSim simulated drone. If a pet is lost,
the owner provides images of the pet to the Custom Vision model trainer. Just like in the simulation, the images are
used to train the model to recognize the pet. Then, the drone pilot searches an area where the lost pet might be. As
it finds animals along the way, the drone's camera can capture images and determine if the animal is the lost pet.

Architecture

1. Use AirSim's 3D-rendered environment to take images taken with the drone. Use the images as the training
dataset.
2. Import and tag the dataset in a Custom Vision project. The cognitive service trains and tests the model.
3. Export the model into TensorFlow format so you can use it locally.

Components
Microsoft AirSim Drone simulator
Microsoft AirSim Drone simulator is built on the Unreal Engine. It's open-source, cross-platform, and developed to
help AI research. In this architecture, it creates the dataset of images used to train the model.
Azure Custom Vision
Azure Custom Vision s part of Azure Cognitive Services. In this architecture, it creates an image classifier model.
TensorFlow
TensorFlow is an open-source platform for machine learning (ML). It's a tool that helps you develop and train ML
models. When you export your model to TensorFlow format, you'll have a protocol buffer file with the Custom
Vision model that you can use locally in your script.

Deploy the solution


To deploy this reference architecture, follow the steps described in the GitHub repo of the Search and Rescue Lab.

Next steps
Learn more about Microsoft AirSim
Learn more about Azure Custom Vision Cognitive Service
Learn more about Azure Cognitive Services
Visual Assistant
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Visual assistant provides rich information based on content of the image with capabilities such as reading business
card, identifying barcode, and recognizing popular people, places, objects, artworks, and monuments.

Architecture

Download an SVG of this architecture.

Data Flow
1. Users interact with bot
2. Bot understands context from LUIS
3. Bot passes visual context to the Bing Visual Search API
4. Bot gets additional information from Bing Entity Search for rich context on people, place, artwork, monument,
and objects.
5. Bot gets additional information for barcodes.
6. Optionally Bot gets more information on barcodes/queries exclusively from your domain through the Bing
Custom Search API.
7. Assistant renders similar products/destinations from your domain or provides more information around
celebrity/place/monuments/artworks.
Deploy AI and machine learning at the edge by using
Azure Stack Edge
11/2/2020 • 5 minutes to read • Edit Online

This reference architecture illustrates how to use Microsoft Azure Stack Edge to extend the ability to perform rapid
machine learning inferencing from the cloud to on-premises or edge scenarios.

Download a Visio file of this architecture.


Typical uses for this architecture include:
Performing local, rapid machine learning inference against data as it’s ingested by organizations with a
significant on-premises hardware footprint .
Creating long-term research solutions where existing on-premises data is cleaned and used to generate a model.
The model is then used both on-premises and in the cloud; it's retrained regularly as new data arrives.
Building software applications that need to make inferences about users, both at a physical location and online.

Architecture
The architecture consists of the following components:
Azure Machine Learning . Machine Learning is a technology that lets you build, train, deploy, and manage
machine learning models in a cloud-based environment. These models can then deploy to various Azure
services, including (but not limited to) Azure Container Instances, Azure Kubernetes Service (AKS), and Azure
Functions.
Azure Container Registr y . Container Registry is a service that creates and manages the Docker Registry.
Container Registry builds, stores, and manages Docker container images and can store containerized machine
learning models.
Azure Stack Edge . Azure Stack Edge is an edge computing device that's designed to perform machine learning
inferencing at the edge and preprocess data before transferring it to Azure. Azure Stack Edge includes compute
acceleration hardware that's designed to improve performance of any AI inferencing that occurs at the edge.
Local data . Local data references any data that's used in the training of the machine learning model. The data
can reside in any local storage solution, including Azure Arc deployments.

Recommendations
Ingesting, transforming, and transferring data stored locally
Azure Stack Edge can transform data sourced from local storage before transferring that data to Azure. This
transformation is performed by using an Azure IoT Edge device that's deployed on the Azure Stack Edge device.
These IoT Edge devices are associated with an Azure IoT Hub resource on the Azure cloud platform.
Each IoT Edge module is a Docker container that performs a specific task in an ingest, transform, and transfer
workflow. For example, an IoT Edge module can collect data from an Azure Stack Edge local share, transform the
data into a format that's ready for machine learning, and then transfer the transformed data to an Azure Stack Edge
cloud share. You can add custom or built-in modules to your IoT Edge device; you can also develop custom IoT Edge
modules to use in Azure Stack Edge.

NOTE
IoT Edge modules are registered as Docker container images in Container Registry.

In the Azure Stack Edge resource on the Azure cloud platform, the cloud share is backed by an Azure Blob storage
account resource. All data in the cloud share will automatically upload to the associated storage account. You can
verify the data transformation and transfer by either mounting the local or cloud share, or by traversing the Azure
Storage account.
Training and deploying a model
After preparing and storing data in Blob storage, you can create a Machine Learning dataset that connects to Azure
Storage. Creating a dataset ensures that you only keep a single copy of your data in storage that's directly
referenced by Machine Learning.
You can use the Machine Learning command-line interface (CLI), the R SDK, the Python SDK, designer, or Visual
Studio Code to build the scripts that are required to train your model.
After training and readying the model to deploy, you can deploy it to a variety of Azure services, including but not
limited to:
Azure Container Registr y . You can deploy the models to a private Docker Registry such as Azure Container
Registry since they are Docker container images.
Azure Container Instances . You can deploy the model's Docker container image directly to a container group.
Azure Kubernetes Ser vices . You can use Azure Kubernetes Services to automatically scale the model's Docker
container image for high-scale production deployments.
Azure Functions . You can package a model to run directly on a Functions instance.
Azure Machine Learning . You can use Compute instances; managed cloud-based development workstations,
for both training and inferencing of models. You can also similarly deploy the model to on-premises IoT Edge
and Azure Stack Edge devices.

NOTE
For this reference architecture, the model deploys to Azure Stack Edge to make the model available for inferencing on-
premises. The model also deploys to Container Registry to ensure that the model is available for inferencing across the widest
variety of Azure services.

Inferencing the newly deployed model


Azure Stack Edge can run machine learning models locally against data in your on-premises location quickly by
using its built-in compute acceleration hardware. This computation is performed entirely at the edge and can be a
way to get rapid insights from data by using hardware that's closer to the data source than a public cloud region.
Additionally, data can continue to transfer from Azure Stack Edge to Machine Learning for continuous retraining
and improvement by using a machine learning pipeline that's associated with the model that's already running
against data stored locally.

Availability considerations
Consider placing your Azure Stack Edge resource in the same Azure region where you will want to access it from
other Azure services. To optimize upload performance, consider placing your Azure Blob storage account in the
region where your appliance has the best network connection.
Consider Azure ExpressRoute for a stable, redundant connection between your device and Azure.

Manageability considerations
As an administrator, you can verify that the data source from local storage has transferred to the Azure Stack
Edge resource correctly by mounting the Server Message Block (SMB)/Network File System (NFS) file share or
connecting to the associated Blob storage account by using Azure Storage Explorer.
Use Machine Learning datasets to reference your data in Blob storage while training your model. This eliminates
the need to embed secrets, data paths, or connection strings in your training scripts.
In your Azure Machine Learning workspace, register version models so that you can easily keep track of the
differences between your models at different points in time. You can similarly mirror the versioning and tracking
metadata in the tags that you use for the Docker container images that deploy to Container Registry.

DevOps considerations
Review the MLOps lifecycle management approach for Machine Learning. For example, you can use GitHub or
Azure Pipelines to create a continuous integration process to automatically train and retrain a model either when
new data populates the dataset or a change is made to the training scripts.
The Azure Machine Learning workspace will automatically register and manage Docker container images for
machine learning models and IoT Edge modules.

Cost considerations
Use the Azure pricing calculator to estimate costs.
Azure Stack Edge pricing is calculated as a flat-rate monthly subscription with a one-time shipping fee.
Azure Machine Learning also deploys Container Registry, Blob storage, and Azure Key Vault services, which incur
additional costs.
Azure Machine Learning pricing includes charges for the virtual machines that are used for training the model in
the public cloud.
Advanced Analytics Architecture
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Transform your data into actionable insights using the best-in-class machine learning tools. This architecture allows
you to combine any data at any scale, and to build and deploy custom machine learning models at scale.

Architecture

Download an SVG of this architecture.

Data Flow
1. Bring together all your structured, unstructured and semi-structured data (logs, files, and media) using Azure
Data Factory to Azure Data Lake Storage.
2. Use Azure Databricks to clean and transform the structureless datasets and combine them with structured data
from operational databases or data warehouses.
3. Use scalable machine learning/deep learning techniques, to derive deeper insights from this data using Python,
R or Scala, with inbuilt notebook experiences in Azure Databricks.
4. Leverage native connectors between Azure Databricks and Azure Synapse Analytics to access and move data at
scale.
5. Power users take advantage of the inbuilt capabilities of Azure Databricks to perform root cause determination
and raw data analysis.
6. Query and report on data in Power BI.
7. Take the insights from Azure Databricks to Cosmos DB to make them accessible through web and mobile apps.
Components
Azure Synapse Analytics is the fast, flexible and trusted cloud data warehouse that lets you scale, compute and
store elastically and independently, with a massively parallel processing architecture.
Azure Data Factory is a hybrid data integration service that allows you to create, schedule and orchestrate your
ETL/ELT workflows.
Azure Blob storage is a Massively scalable object storage for any type of unstructured data-images, videos,
audio, documents, and more-easily and cost-effectively.
Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform.
Azure Cosmos DB is a globally distributed, multi-model database service. Learn how to replicate your data
across any number of Azure regions and scale your throughput independent from your storage.
Azure Analysis Services is an enterprise grade analytics as a service that lets you govern, deploy, test, and
deliver your BI solution with confidence.
Power BI is a suite of business analytics tools that deliver insights throughout your organization. Connect to
hundreds of data sources, simplify data prep, and drive ad hoc analysis. Produce beautiful reports, then publish
them for your organization to consume on the web and across mobile devices.

Next steps
Synapse Analytics Documentation
Azure Data Factory V2 Preview Documentation
Introduction to object storage in Azure
Azure Databricks Documentation
Azure Cosmos DB Documentation
Analysis Services Documentation
Power BI Documentation

Pricing Calculator
Customize and get pricing estimates
Anomaly Detector Process
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Architecture

Download an SVG of this architecture.

Data Flow
1. Ingests data from the various stores that contain raw data to be monitored by Anomaly Detector.
2. Aggregates, samples, and computes the raw data to generate the time series, or calls the Anomaly Detector API
directly if the time series are already prepared and gets a response with the detection results.
3. Queues the anomaly related meta data.
4. Based on the anomaly related meta data, calls the customized alerting service.
5. Stores the anomaly detection meta data.
6. Visualizes the results of the time series anomaly detection.

Components
Service Bus: Reliable cloud messaging as a service (MaaS) and simple hybrid integration
Azure Databricks: Fast, easy, and collaborative Apache Spark–based analytics service
Power BI: Interactive data visualization BI tools
Storage Accounts: Durable, highly available, and massively scalable cloud storage

Next steps
Service Bus Documentation
Azure Databricks Documentation
Power BI Documentation
Storage Documentation
Application integration using Event Grid
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Event Grid connects your app with other services. For example, create an application topic to send your app's event
data to Event Grid and take advantage of its reliable delivery, advanced routing, and direct integration with Azure.
Alternatively, you can use Event Grid with Logic Apps to process data anywhere, without writing code.

Architecture

Download an SVG of this architecture.


Automated enterprise BI with Azure Synapse
Analytics and Azure Data Factory
11/2/2020 • 14 minutes to read • Edit Online

This reference architecture shows how to perform incremental loading in an extract, load, and transform (ELT)
pipeline. It uses Azure Data Factory to automate the ELT pipeline. The pipeline incrementally moves the latest OLTP
data from an on-premises SQL Server database into Azure Synapse. Transactional data is transformed into a
tabular model for analysis.

A reference implementation for this architecture is available on GitHub.

This architecture builds on the one shown in Enterprise BI with Azure Synapse, but adds some features that are
important for enterprise data warehousing scenarios.
Automation of the pipeline using Data Factory.
Incremental loading.
Integrating multiple data sources.
Loading binary data such as geospatial data and images.

Architecture
The architecture consists of the following components.
Data sources
On-premises SQL Ser ver . The source data is located in a SQL Server database on premises. To simulate the on-
premises environment, the deployment scripts for this architecture provision a virtual machine in Azure with SQL
Server installed. The Wide World Importers OLTP sample database is used as the source database.
External data . A common scenario for data warehouses is to integrate multiple data sources. This reference
architecture loads an external data set that contains city populations by year, and integrates it with the data from
the OLTP database. You can use this data for insights such as: "Does sales growth in each region match or exceed
population growth?"
Ingestion and data storage
Blob Storage . Blob storage is used as a staging area for the source data before loading it into Azure Synapse.
Azure Synapse . Azure Synapse is a distributed system designed to perform analytics on large data. It supports
massive parallel processing (MPP), which makes it suitable for running high-performance analytics.
Azure Data Factor y . Data Factory is a managed service that orchestrates and automates data movement and
data transformation. In this architecture, it coordinates the various stages of the ELT process.
Analysis and reporting
Azure Analysis Ser vices . Analysis Services is a fully managed service that provides data modeling capabilities.
The semantic model is loaded into Analysis Services.
Power BI . Power BI is a suite of business analytics tools to analyze data for business insights. In this architecture,
it queries the semantic model stored in Analysis Services.
Authentication
Azure Active Director y (Azure AD) authenticates users who connect to the Analysis Services server through
Power BI.
Data Factory can also use Azure AD to authenticate to Azure Synapse, by using a service principal or Managed
Service Identity (MSI). For simplicity, the example deployment uses SQL Server authentication.

Data pipeline
In Azure Data Factory, a pipeline is a logical grouping of activities used to coordinate a task — in this case, loading
and transforming data into Azure Synapse.
This reference architecture defines a parent pipeline that runs a sequence of child pipelines. Each child pipeline
loads data into one or more data warehouse tables.

Incremental loading
When you run an automated ETL or ELT process, it's most efficient to load only the data that changed since the
previous run. This is called an incremental load, as opposed to a full load that loads all the data. To perform an
incremental load, you need a way to identify which data has changed. The most common approach is to use a
high water mark value, which means tracking the latest value of some column in the source table, either a
datetime column or a unique integer column.
Starting with SQL Server 2016, you can use temporal tables. These are system-versioned tables that keep a full
history of data changes. The database engine automatically records the history of every change in a separate
history table. You can query the historical data by adding a FOR SYSTEM_TIME clause to a query. Internally, the
database engine queries the history table, but this is transparent to the application.

NOTE
For earlier versions of SQL Server, you can use Change Data Capture (CDC). This approach is less convenient than temporal
tables, because you have to query a separate change table, and changes are tracked by a log sequence number, rather than
a timestamp.

Temporal tables are useful for dimension data, which can change over time. Fact tables usually represent an
immutable transaction such as a sale, in which case keeping the system version history doesn't make sense.
Instead, transactions usually have a column that represents the transaction date, which can be used as the
watermark value. For example, in the Wide World Importers OLTP database, the Sales.Invoices and
Sales.InvoiceLines tables have a LastEditedWhen field that defaults to sysdatetime() .
Here is the general flow for the ELT pipeline:
1. For each table in the source database, track the cutoff time when the last ELT job ran. Store this information
in the data warehouse. (On initial setup, all times are set to '1-1-1900'.)
2. During the data export step, the cutoff time is passed as a parameter to a set of stored procedures in the
source database. These stored procedures query for any records that were changed or created after the
cutoff time. For the Sales fact table, the LastEditedWhen column is used. For the dimension data, system-
versioned temporal tables are used.
3. When the data migration is complete, update the table that stores the cutoff times.
It's also useful to record a lineage for each ELT run. For a given record, the lineage associates that record with the
ELT run that produced the data. For each ETL run, a new lineage record is created for every table, showing the
starting and ending load times. The lineage keys for each record are stored in the dimension and fact tables.

After a new batch of data is loaded into the warehouse, refresh the Analysis Services tabular model. See
Asynchronous refresh with the REST API.

Data cleansing
Data cleansing should be part of the ELT process. In this reference architecture, one source of bad data is the city
population table, where some cities have zero population, perhaps because no data was available. During
processing, the ELT pipeline removes those cities from the city population table. Perform data cleansing on staging
tables, rather than external tables.
Here is the stored procedure that removes the cities with zero population from the City Population table. (You can
find the source file here.)
DELETE FROM [Integration].[CityPopulation_Staging]
WHERE RowNumber in (SELECT DISTINCT RowNumber
FROM [Integration].[CityPopulation_Staging]
WHERE POPULATION = 0
GROUP BY RowNumber
HAVING COUNT(RowNumber) = 4)

External data sources


Data warehouses often consolidate data from multiple sources. This reference architecture loads an external data
source that contains demographics data. This dataset is available in Azure blob storage as part of the
WorldWideImportersDW sample.
Azure Data Factory can copy directly from blob storage, using the blob storage connector. However, the connector
requires a connection string or a shared access signature, so it can't be used to copy a blob with public read
access. As a workaround, you can use PolyBase to create an external table over Blob storage and then copy the
external tables into Azure Synapse.

Handling large binary data


In the source database, the Cities table has a Location column that holds a geography spatial data type. Azure
Synapse doesn't support the geography type natively, so this field is converted to a varbinar y type during
loading. (See Workarounds for unsupported data types.)
However, PolyBase supports a maximum column size of varbinary(8000) , which means some data could be
truncated. A workaround for this problem is to break the data up into chunks during export, and then reassemble
the chunks, as follows:
1. Create a temporary staging table for the Location column.
2. For each city, split the location data into 8000-byte chunks, resulting in 1 – N rows for each city.
3. To reassemble the chunks, use the T-SQL PIVOT operator to convert rows into columns and then
concatenate the column values for each city.
The challenge is that each city will be split into a different number of rows, depending on the size of geography
data. For the PIVOT operator to work, every city must have the same number of rows. To make this work, the T-
SQL query (which you can view here) does some tricks to pad out the rows with blank values, so that every city
has the same number of columns after the pivot. The resulting query turns out to be much faster than looping
through the rows one at a time.
The same approach is used for image data.

Slowly changing dimensions


Dimension data is relatively static, but it can change. For example, a product might get reassigned to a different
product category. There are several approaches to handling slowly changing dimensions. A common technique,
called Type 2, is to add a new record whenever a dimension changes.
In order to implement the Type 2 approach, dimension tables need additional columns that specify the effective
date range for a given record. Also, primary keys from the source database will be duplicated, so the dimension
table must have an artificial primary key.
The following image shows the Dimension.City table. The WWI City ID column is the primary key from the source
database. The City Key column is an artificial key generated during the ETL pipeline. Also notice that the table
has Valid From and Valid To columns, which define the range when each row was valid. Current values have a
Valid To equal to '9999-12-31'.

The advantage of this approach is that it preserves historical data, which can be valuable for analysis. However, it
also means there will be multiple rows for the same entity. For example, here are the records that match
WWI City ID = 28561:

For each Sales fact, you want to associate that fact with a single row in City dimension table, corresponding to the
invoice date.
The following T-SQL query creates a temporary table that associates each invoice with the correct City Key from
the City dimension table.

CREATE TABLE CityHolder


WITH (HEAP , DISTRIBUTION = HASH([WWI Invoice ID]))
AS
SELECT DISTINCT s1.[WWI Invoice ID] AS [WWI Invoice ID],
c.[City Key] AS [City Key]
FROM [Integration].[Sale_Staging] s1
CROSS APPLY (
SELECT TOP 1 [City Key]
FROM [Dimension].[City]
WHERE [WWI City ID] = s1.[WWI City ID]
AND s1.[Last Modified When] > [Valid From]
AND s1.[Last Modified When] <= [Valid To]
ORDER BY [Valid From], [City Key] DESC
) c

This table is used to populate a column in the Sales fact table:

UPDATE [Integration].[Sale_Staging]
SET [Integration].[Sale_Staging].[WWI Customer ID] = CustomerHolder.[WWI Customer ID]

This column enables a Power BI query to find the correct City record for a given sales invoice.

Security considerations
For additional security, you can use Virtual Network service endpoints to secure Azure service resources to only
your virtual network. This fully removes public Internet access to those resources, allowing traffic only from your
virtual network.
With this approach, you create a VNet in Azure and then create private service endpoints for Azure services. Those
services are then restricted to traffic from that virtual network. You can also reach them from your on-premises
network through a gateway.
Be aware of the following limitations:
If service endpoints are enabled for Azure Storage, PolyBase cannot copy data from Storage into Azure
Synapse. There is a mitigation for this issue. For more information, see Impact of using VNet Service
Endpoints with Azure storage.
To move data from on-premises into Azure Storage, you will need to allow public IP addresses from your
on-premises or ExpressRoute. For details, see Securing Azure services to virtual networks.
To enable Analysis Services to read data from Azure Synapse, deploy a Windows VM to the virtual network
that contains the Azure Synapse service endpoint. Install Azure On-premises Data Gateway on this VM.
Then connect your Azure Analysis service to the data gateway.

DevOps considerations
Create separate resource groups for production, development, and test environments. Separate resource
groups make it easier to manage deployments, delete test deployments, and assign access rights.
Use the Azure Building blocks templates provided in this architecture or create Azure Resource Manager
template to deploy the Azure resources following the infrastructure as Code (IaC) Process. With templates,
automating deployments using Azure DevOps Services, or other CI/CD solutions is easier.
Put each workload in a separate deployment template and store the resources in source control systems.
You can deploy the templates together or individually as part of a CI/CD process, making the automation
process easier.
In this architecture, there are three main workloads:
The data warehouse server, Analysis Services, and related resources.
Azure Data Factory.
An on-premises to cloud simulated scenario.
Each workload has its own deployment template.
The data warehouse server is set up and configured by using Azure CLI commands which follows the
imperative approach of the IaC practice. Consider using deployment scripts and integrate them in the
automation process.
Consider staging your workloads. Deploy to various stages and run validation checks at each stage before
moving to the next stage. That way you can push updates to your production environments in a highly
controlled way and minimize unanticipated deployment issues. Use Blue-green deployment and Canary
releases strategies for updating live production environments.
Have a good rollback strategy for handling failed deployments. For example, you can automatically
redeploy an earlier, successful deployment from your deployment history. See the --rollback-on-error flag
parameter in Azure CLI.
Azure Monitor is the recommended option for analyzing the performance of your data warehouse and the
entire Azure analytics platform for an integrated monitoring experience. Azure Synapse Analytics provides
a monitoring experience within the Azure portal to show insights to your data warehouse workload. The
Azure portal is the recommended tool when monitoring your data warehouse because it provides
configurable retention periods, alerts, recommendations, and customizable charts and dashboards for
metrics and logs.
For more information, see the DevOps section in Microsoft Azure Well-Architected Framework.

Cost considerations
Use the Azure pricing calculator to estimate costs. Here are some considerations for services used in this
reference architecture.
Azure Data Factory
In this architecture, Azure Data Factory automates the ELT pipeline. The pipeline moves the data from an on-
premises SQL Server database into Azure Synapse. The data is then transformed into a tabular model for analysis.
For this scenario, pricing starts from $ 0.001 activity runs per month that includes activity, trigger, and debug runs.
That price is the base charge only for orchestration. You are also charged for execution activities, such as copying
data, lookups, and external activities. Each activity is individually priced. You are also charged for pipelines with no
associated triggers or runs within the month. All activities are prorated by the minute and rounded up.
Example cost analysis
Consider a use case where there are two lookups activities from two different sources. One takes 1 minute and 2
seconds (rounded up to 2 minutes) and the other one takes 1 minute resulting in total time of 3 minutes. One
data copy activity takes 10 minutes. One stored procedure activity takes 2 minutes. Total activity runs for 4
minutes. Cost is calculated as follows:
Activity runs: 4 * $ 0.001 = $0.004
Lookups: 3 * ($0.005 / 60) = $0.00025
Stored procedure: 2 * ($0.00025 / 60) = $0.000008
Data copy: 10 * ($0.25 / 60) * 4 data integration unit (DIU) = $0.167
Total cost per pipeline run: $0.17.
Run once per day for 30 days: $5.1 month.
Run once per day per 100 tables for 30 days: $ 510
Every activity has an associated cost. Understand the pricing model and use the ADF pricing calculator to get a
solution optimized not only for performance but also for cost. Manage your costs by starting, stopping, pausing,
and scaling your services.
Azure Synapse
Azure Synapse is ideal for intensive workloads with higher query performance and compute scalability needs. You
can choose the pay-as-you-go model or use reserved plans of one year (37% savings) or 3 years (65% savings).
Data storage is charged separately. Other services such as disaster recovery and threat detection are also charged
separately.
For more information, see Azure Synapse Pricing.
Analysis Services
Pricing for Azure Analysis Services depends on the tier. The reference implementation of this architecture uses the
Developer tier, which is recommended for evaluation, development, and test scenarios. Other tiers include, the
Basic tier, which is recommended for small production environment; the Standard tier for mission-critical
production applications. For more information, see The right tier when you need it.
No charges apply when you pause your instance.
For more information, see Azure Analysis Services pricing.
Blob Storage
Consider using the Azure Storage reserved capacity feature to lower cost on storage. With this model, you get a
discount if you can commit to reservation for fixed storage capacity for one or three years. For more information,
see Optimize costs for Blob storage with reserved capacity.
For more information, see the Cost section in Microsoft Azure Well-Architected Framework.

Deploy the solution


To the deploy and run the reference implementation, follow the steps in the GitHub readme. It deploys the
following:
A Windows VM to simulate an on-premises database server. It includes SQL Server 2017 and related tools,
along with Power BI Desktop.
An Azure storage account that provides Blob storage to hold data exported from the SQL Server database.
An Azure Synapse instance.
An Azure Analysis Services instance.
Azure Data Factory and the Data Factory pipeline for the ELT job.

Related resources
You may want to review the following Azure example scenarios that demonstrate specific solutions using some of
the same technologies:
Data warehousing and analytics for sales and marketing
Hybrid ETL with existing on-premises SSIS and Azure Data Factory
Enterprise BI in Azure with Azure Synapse.
Big data analytics with Azure Data Explorer
11/2/2020 • 2 minutes to read • Edit Online

This architecture pattern demonstrates big data analytics over large volumes of high velocity data from various
sources. This pattern illustrates how Azure Data Explorer and Azure Synapse Analytics complement each other for
near real-time analytics and modern data warehousing use cases.
This pattern is already being used by Microsoft customers. For example, the Singapore-based ride-hailing company,
Grab, implemented real-time analytics over a huge amount of data collected from their taxi and food delivery
services as well as merchant partner apps. The team from Grab presented their solution at MS Ignite in this video
(20:30 onwards). Using this pattern, Grab processed more than a trillion events per day.

Data flow
1. Raw structured, semi-structured, and unstructured (free text) data such as any type of logs, business events, and
user activities can be ingested into Azure Data Explorer from various sources.
2. Ingest data into Azure Data Explorer with low-latency and high throughput using its connectors for Azure Data
Factory, Azure Event Hub, Azure IoT Hub, Kafka, and so on. Alternatively, ingest data through Azure Storage (Blob
or ADLS Gen2), which uses Azure Event Grid and triggers the ingestion pipeline to Azure Data Explorer. You can
also continuously export data to Azure Storage in compressed, partitioned parquet format and seamlessly query
that data as detailed in the Continuous data export overview.
3. Export pre-aggregated data from Azure Data Explorer to Azure Storage, and then ingest the data into Synapse
Analytics to build data models and reports.
4. Use Azure Data Explorer’s native capabilities to process, aggregate, and analyze data. To get insights at a
lightning speed, build near real-time analytics dashboards using Azure Data Explorer dashboards, Power BI,
Grafana, or other tools. Use Azure Synapse Analytics to build a modern data warehouse and combine it with the
Azure Data Explorer data to generate BI reports on curated and aggregated data models.
5. Azure Data Explorer provides native advanced analytics capabilities for time series analysis, pattern recognition,
anomaly detection and forecasting, and machine learning. Azure Data Explorer is also well integrated with ML
services such as Databricks and Azure Machine Learning. This integration allows you to build models using
other tools and services and export ML models to Azure Data Explorer for scoring data.

Components
Azure Event Hub: Fully managed, real-time data ingestion service that’s simple, trusted, and scalable.
Azure IoT Hub: Managed service to enable bi-directional communication between IoT devices and Azure.
Kafka on HDInsight: Easy, cost-effective, enterprise-grade service for open source analytics with Apache Kafka.
Azure Data Explorer: Fast, fully managed and highly scalable data analytics service for real-time analysis on large
volumes of data streaming from applications, websites, IoT devices, and more.
Azure Data Explorer Dashboards: Natively export Kusto queries that were explored in the Web UI to optimized
dashboards.
Azure Synapse Analytics: Analytics service that brings together enterprise data warehousing and Big Data
analytics.

Next steps
For more information, see Azure Data Explorer documentation.
Content Delivery Network analytics with Azure Data
Explorer
11/2/2020 • 2 minutes to read • Edit Online

This architecture pattern demonstrates low-latency high throughput ingestion for large volumes of Content
Delivery Network (CDN) logs for building near real-time analytics dashboards.

Data flow
1. Content Delivery Network providers such as Verizon and Fastly ingest huge amounts of CDN logs into Azure
Data Explorer to analyze latencies, health, and performance of CDN assets.
2. Most CDN scenarios ingest data through Azure Storage (Blob or ADLS Gen2), which uses Azure Event Grid and
triggers the ingestion pipeline to Azure Data Explorer. Alternatively you can bulk ingest the data using the
LightIngest tool. You can also continuously export data to Azure Storage in compressed, partitioned parquet
format and seamlessly query that data as detailed in Continuous data export overview.
3. Azure Data Explorer provides easy to use native operators and functions to process, aggregate, and analyze time
series and log data, as well as supply insights at lightning speed. You can build near real-time analytics
dashboards using Azure Data Explorer dashboards, Power BI, or Grafana.
4. Create and schedule alerts and notifications using Azure Data Explorer connector for Azure Logic Apps.

Components
Azure Storage Azure Data Explorer connector: Continuous ingestion from Azure Storage (Blob storage and
ADLSv2) with Azure Event Grid subscription to stream these notifications to Azure Data Explorer.
Azure Data Explorer: Fast, fully managed and highly scalable data analytics service for real-time analysis on
large volumes of data streaming from applications, websites, IoT devices, and more.
Azure Data Explorer Dashboards: Natively export Kusto queries that were explored in the Web UI to optimized
dashboards.
Azure Logic Apps Azure Data Explorer connector: Run Kusto queries and commands automatically as part of a
scheduled or triggered task.
Next steps
For more information, see Azure Data Explorer documentation.
Data warehousing and analytics for sales and
marketing
11/2/2020 • 5 minutes to read • Edit Online

This example scenario demonstrates a data pipeline that integrates large amounts of data from multiple sources
into a unified analytics platform in Azure. This specific scenario is based on a sales and marketing solution, but the
design patterns are relevant for many industries requiring advanced analytics of large datasets such as e-
commerce, retail, and healthcare.
This example demonstrates a sales and marketing company that creates incentive programs. These programs
reward customers, suppliers, salespeople, and employees. Data is fundamental to these programs, and the
company wants to improve the insights gained through data analytics using Azure.
The company needs a modern approach to analysis data, so that decisions are made using the right data at the
right time. The company's goals include:
Combining different kinds of data sources into a cloud-scale platform.
Transforming source data into a common taxonomy and structure, to make the data consistent and easily
compared.
Loading data using a highly parallelized approach that can support thousands of incentive programs, without
the high costs of deploying and maintaining on-premises infrastructure.
Greatly reducing the time needed to gather and transform data, so you can focus on analyzing the data.

Relevant use cases


This approach can also be used to:
Establish a data warehouse to be a single source of truth for your data.
Integrate relational data sources with other unstructured datasets.
Use semantic modeling and powerful visualization tools for simpler data analysis.

Architecture

The data flows through the solution as follows:


1. For each data source, any updates are exported periodically into a staging area in Azure Blob storage.
2. Data Factory incrementally loads the data from Blob storage into staging tables in Azure Synapse Analytics. The
data is cleansed and transformed during this process. PolyBase can parallelize the process for large datasets.
3. After loading a new batch of data into the warehouse, a previously created Analysis Services tabular model is
refreshed. This semantic model simplifies the analysis of business data and relationships.
4. Business analysts use Microsoft Power BI to analyze warehoused data via the Analysis Services semantic model.
Components
The company has data sources on many different platforms:
SQL Server on-premises
Oracle on-premises
Azure SQL Database
Azure table storage
Cosmos DB
Data is loaded from these different data sources using several Azure components:
Blob storage is used to stage source data before it's loaded into Azure Synapse.
Data Factory orchestrates the transformation of staged data into a common structure in Azure Synapse. Data
Factory uses PolyBase when loading data into Azure Synapse to maximize throughput.
Azure Synapse is a distributed system for storing and analyzing large datasets. Its use of massive parallel
processing (MPP) makes it suitable for running high-performance analytics. Azure Synapse can use PolyBase to
rapidly load data from Blob storage.
Analysis Services provides a semantic model for your data. It can also increase system performance when
analyzing your data.
Power BI is a suite of business analytics tools to analyze data and share insights. Power BI can query a semantic
model stored in Analysis Services, or it can query Azure Synapse directly.
Azure Active Directory (Azure AD) authenticates users who connect to the Analysis Services server through
Power BI. Data Factory can also use Azure AD to authenticate to Azure Synapse via a service principal or
Managed identity for Azure resources.
Alternatives
The example pipeline includes several different kinds of data sources. This architecture can handle a wide
variety of relational and non-relational data sources.
Data Factory orchestrates the workflows for your data pipeline. If you want to load data only one time or on
demand, you could use tools like SQL Server bulk copy (bcp) and AzCopy to copy data into Blob storage.
You can then load the data directly into Azure Synapse using PolyBase.
If you have very large datasets, consider using Data Lake Storage, which provides limitless storage for
analytics data.
An on-premises SQL Server Parallel Data Warehouse appliance can also be used for big data processing.
However, operating costs are often much lower with a managed cloud-based solution like Azure Synapse.
Azure Synapse is not a good fit for OLTP workloads or data sets smaller than 250 GB. For those cases you
should use Azure SQL Database or SQL Server.
For comparisons of other alternatives, see:
Choosing a data pipeline orchestration technology in Azure
Choosing a batch processing technology in Azure
Choosing an analytical data store in Azure
Choosing a data analytics technology in Azure

Considerations
The technologies in this architecture were chosen because they met the company's requirements for scalability and
availability, while helping them control costs.
The massively parallel processing architecture of Azure Synapse provides scalability and high performance.
Azure Synapse has guaranteed SLAs and recommended practices for achieving high availability.
When analysis activity is low, the company can scale Azure Synapse on demand, reducing or even pausing
compute to lower costs.
Azure Analysis Services can be scaled out to reduce response times during high query workloads. You can also
separate processing from the query pool, so that client queries aren't slowed down by processing operations.
Azure Analysis Services also has guaranteed SLAs and recommended practices for achieving high availability.
The Azure Synapse security model provides connection security, authentication and authorization via Azure AD
or SQL Server authentication, and encryption. Azure Analysis Services uses Azure AD for identity management
and user authentication.

Pricing
Review a pricing sample for a data warehousing scenario via the Azure pricing calculator. Adjust the values to see
how your requirements affect your costs.
Azure Synapse allows you to scale your compute and storage levels independently. Compute resources are
charged per hour, and you can scale or pause these resources on demand. Storage resources are billed per
terabyte, so your costs will increase as you ingest more data.
Data Factory costs are based on the number of read/write operations, monitoring operations, and orchestration
activities performed in a workload. Your Data Factory costs will increase with each additional data stream and
the amount of data processed by each one.
Analysis Services is available in developer, basic, and standard tiers. Instances are priced based on query
processing units (QPUs) and available memory. To keep your costs lower, minimize the number of queries you
run, how much data they process, and how often they run.
Power BI has different product options for different requirements. Power BI Embedded provides an Azure-based
option for embedding Power BI functionality inside your applications. A Power BI Embedded instance is included
in the pricing sample above.

Next steps
Review the Azure reference architecture for automated enterprise BI, which includes instructions for deploying
an instance of this architecture in Azure.
Read the Maritz Motivation Solutions customer story. That story describes a similar approach to managing
customer data.
Find comprehensive architectural guidance on data pipelines, data warehousing, online analytical processing
(OLAP), and big data in the Azure Data Architecture Guide.
Demand Forecasting
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Accurately forecasting spikes in demand for products and services can give a company a competitive advantage.
This solution focuses on demand forecasting within the energy sector.

Architecture
Download an SVG of this architecture.

Overview
Accurately forecasting spikes in demand for products and services can give a company a competitive advantage.
The better the forecasting, the more they can scale as demand increases, and the less they risk holding onto
unneeded inventory. Use cases include predicting demand for a product in a retail/online store, forecasting hospital
visits, and anticipating power consumption.
This solution focuses on demand forecasting within the energy sector. Storing energy is not cost-effective, so
utilities and power generators need to forecast future power consumption so that they can efficiently balance the
supply with the demand. During peak hours, short supply can result in power outages. Conversely, too much
supply can result in waste of resources. Advanced demand forecasting techniques detail hourly demand and peak
hours for a particular day, allowing an energy provider to optimize the power generation process. This solution
using Cortana Intelligence enables energy companies to quickly introduce powerful forecasting technology into
their business.

Details
The Cortana Intelligence Suite provides advanced analytics tools through Microsoft Azure - data ingestion, data
storage, data processing and advanced analytics components - all of the essential elements for building an demand
forecasting for energy solution.
This solution combines several Azure services to provide powerful advantages. Event Hubs collects real-time
consumption data. Stream Analytics aggregates the streaming data and makes it available for visualization. Azure
SQL stores and transforms the consumption data. Machine Learning implements and executes the forecasting
model. PowerBI visualizes the real-time energy consumption as well as the forecast results. Finally, Data Factory
orchestrates and schedules the entire data flow.
The 'Deploy' button will launch a workflow that will deploy an instance of the solution within a Resource Group in
the Azure subscription you specify. The solution includes multiple Azure services (described below) along with a
web job that simulates data so that immediately after deployment you have a working end-to-end solution. The
sample data of this solution is simulated from publicly available data from the NYISO.

Technical details and workflow


1. The sample data is streamed by newly deployed Azure Web Jobs.
2. This synthetic data feeds into the Azure Event Hubs and Azure SQL service as data points or events, that will be
used in the rest of the solution flow.
3. Azure Stream Analytics analyze the data to provide near real-time analytics on the input stream from the event
hub and directly publish to PowerBI for visualization.
4. Azure Machine Learning is used to make forecast on the energy demand of particular region given the inputs
received.
5. Azure SQL Database is used to store the prediction results received from Azure Machine Learning. These results
are then consumed in the Power BI dashboard.
6. Azure Data Factory handles orchestration, and scheduling of the hourly model retraining.
7. Finally, Power BI is used for results visualization, so that users can monitor the energy consumption from a
region in real time and use the forecast demand to optimize the power generation or distribution process.

Pricing Info
Your Azure subscription used for the deployment will incur consumption charges on the services used in this
solution. For pricing details, visit the Azure Pricing Page.
Demand Forecasting + Price Optimization
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Pricing is pivotal for many industries, but it can be one of the most challenging tasks. Companies often struggle to
accurately forecast the fiscal impact of potential tactics, fully consider core business constraints, and fairly validate
pricing decisions once they've been made. As product offerings expand and complicate the calculations behind
real-time pricing decisions, the process grows even more difficult.
This solution addresses those challenges by using historical transaction data to train a demand-forecasting model
in a retail context. It also incorporates the pricing of products in a competing group to predict cannibalization and
other cross-product impacts. A price-optimization algorithm then uses that model to forecast demand at various
price points and factors in business constraints to maximize potential profit.
By using this solution to ingest historical transaction data, predict future demand, and regularly optimize pricing,
you'll have the opportunity to save time and effort around the process and improve your company's profitability.

Architecture

Download an SVG of this architecture.

Components
Azure Data Lake Storage: Data Lake Store stores the weekly raw sales data, which is read by Spark on HDInsight.
Spark on HDInsight ingests the data and executes data preprocessing, forecasting modeling, and price-
optimization algorithms.
Data Factory handles orchestration and scheduling of the model retraining.
Power BI visualizes sales results, the predicted future demand, and the recommended optimal prices for a
variety of products sold in different stores.

Next steps
Learn more about Data Lake Store
Get started with HDInsight using a Spark cluster with R Server
Learn more about Data Factory
Learn more about Power BI
Demand Forecasting and Price Optimization
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Pricing is recognized as a pivotal determinant of success in many industries and can be one of the most challenging
tasks. Companies often struggle with several aspects of the pricing process, including accurately forecasting the
financial impact of potential tactics, taking reasonable consideration of core business constraints, and fairly
validating the executed pricing decisions. Expanding product offerings add further computational requirements to
make real-time pricing decisions, compounding the difficulty of this already overwhelming task.
This solution addresses the challenges raised above by utilizing historical transaction data to train a demand
forecasting model. Pricing of products in a competing group is also incorporated to predict cross-product impacts
such as cannibalization. A price optimization algorithm then employs the model to forecast demand at various
candidate price points and takes into account business constraints to maximize profit. The solution can be
customized to analyze various pricing scenarios as long as the general data science approach remains similar.
The process described above is operationalized and deployed in the Cortana Intelligence Suite. This solution will
enable companies to ingest historical transaction data, predict future demand, and obtain optimal pricing
recommendations on a regular basis. As a result, the solution drives opportunities for improved profitability and
reductions in time and effort allocated to pricing tasks.

Architecture
Download an SVG of this architecture.

Description
Save time and let one of these trained SI partners help you with a proof of concept, deployment & integration of
this solution.
The Cortana Intelligence Suite provides advanced analytics tools through Microsoft Azure - data ingestion, data
storage, data processing and advanced analytics components - all of the essential elements for building a demand
forecasting and price optimization solution.
This solution combines several Azure services to create powerful advantages. Azure Blob Storage stores the weekly
raw sales data. Apache Spark for Azure HDInsight ingests the data and executes data preprocessing, forecasting
modeling and price optimization algorithms. Finally, Data Factory orchestrates and schedules the entire data flow.
The 'Deploy' button will launch a workflow that will deploy an instance of the solution within a Resource Group in
the Azure subscription you specify. The solution includes multiple Azure services (described below) along with a
web job that simulates data so that immediately after deployment you can see data flowing through the end-to-end
pipeline.
For post deployment instructions and more details on the technical implementation, please see the instructions
here.
Technical details and workflow
1. The simulation data is generated hourly by newly deployed Azure Web Jobs.
2. This synthetic data is stored at Azure Blob Storage, that will be used in the rest of the solution flow.
3. Spark on HDInsight is used to ingest and preprocess the raw data, build and retrain the demand forecasting
models, and execute price optimization algorithms.
4. Azure Data Factory orchestrates and schedules the entire data flow.
5. Finally, Power BI is used for results visualization, so that users can monitor the results of the sales, predicted
future demand as well as recommended optimal prices for a variety of products sold in different stores.
Demand Forecasting for Shipping and Distribution
11/2/2020 • 5 minutes to read • Edit Online

The Demand Forecasting for Shipping and Distribution Solution uses historical demand data to forecast demand in
future periods across various customers, products and destinations. For instance, a shipping or delivery company
wants to predict the quantities of the different products its customers want delivered at different locations at future
times. A company can use these forecasts as input to an allocation tool that optimizes operations, such as delivery
vehicles routing, or to plan capacity in the longer term.
View on GitHub

Architecture
Download an SVG of this architecture.

Summary
This is an an Azure Solution to reduce the uncertainty in forecasted shipments for organizations that need to plan
based on future quantities. This page explains what the Solution does, and how to install a copy that you can run
and modify in your Azure subscription.
Azure Solutions in the Cortana Intelligence Gallery are composed of advanced analytics tools for data ingestion,
data storage, scheduling and advanced analytics components - all of the essential elements for running a demand
forecasting solution that can be integrated with your current production systems. This Solution combines several
Azure services. Azure SQL Server is used for storing forecasts and historical distribution data, Azure Machine
Learning (AML) webservice for hosting the R forecasting code, Azure Data Factory to orchestrate the entire
workflow, and Power BI to visualize it.
Use the GitHub repository to deploy an instance of the Solution for the Azure subscription you specify. This will
bring you through the steps in your subscription needed to create and launch the resources that make up this
solution so that you can run it. The Solution includes multiple Azure services (described below) along with Azure
functions that, among other tasks, simulate the data and populate the database with it, so that immediately after
deployment you will have a working end-to-end solution.

Description
Estimated Daily Cost: $4.66
The Demand Forecasting for Shipping and Distribution Solution uses historical demand data to forecast demand in
future periods across various customers, products and destinations. For instance, a shipping or delivery company
wants to predict the quantities of the different products its customers want delivered at different locations at future
times. Similarly a vendor or insurer wants to know the number of products that will be returned due to failures
over the course of a year. A company can use these forecasts as input to an allocation tool that optimizes
operations, such as delivery vehicles routing, or to plan capacity in the longer term.
Characteristics of all of these forecasting cases are:
There are numerous kinds of items with differing volumes, that roll up under one or more category levels.
There is a history available for the quantity of the item at each time in the past. The volumes of the items differ
widely, with possibly a substantial number that have zero volume at times.
The history of items shows both trend and seasonality, possibly at multiple time scales. The quantities
committed or returned are not strongly price sensitive. In other words, the delivery company cannot strongly
influence quantities by short-term changes in prices, although there may be other determinants that affect
volume, such as weather.
Under these conditions we can take advantage of the hierarchy formed among the time series of the different
items. By enforcing consistency so that the quantities lower in the hierarchy (e.g. individual product quantities) sum
to the quantities above (customer product totals) we improve the accuracy of the overall forecast. The same applies
if individual items are grouped into categories, even possibly categories that overlap. For example, one might be
interested in forecasting demand of all products in total, by location, by product category, by customer, etc.
This Solution computes forecasts at all aggregation levels in the hierarchy for each time period specified. For
simplicity, we will refer to both hierarchial and grouped time series as "hierarchical time series."

Shipping and distribution forecasting in use


We thank Kotahi for working with us to develop this Solution. Kotahi is a supply chain company that plans, sources,
and delivers shipping containers for New Zealand exports. Read their Customer Story on how they engaged with us
at Microsoft and a Microsoft Dynamics 2016 Inner Circle partner, DXC Eclipse, to put this into production. The
solution helped to increase forecast accuracy and so improve their ability to choose the right-size container ships, at
the right times, and dispatch them to the right ports.

What's under the hood


The Solution uses five types of resources hosted and managed in Azure:
Azure SQL Server instance (Azure SQL) for persistent storage
Azure Machine Learning (AML) webservice to host the R forecasting code
Azure Blob Storage for intermediate storage of generated forecasts
Azure Data Factory (ADF) that orchestrates regular runs of the AML model
Power BI dashboard to display and drill down on the forecasts
The Solution automates the running of periodic forecasts, at a pace configured in ADF (e.g. monthly), where it
learns a model with the current historical data, and predicts quantities for future periods for all products in the
product hierarchy. Each forecast cycle consists of a round-trip from the database, through the model, then back to
the database. Each cycle measures forecast accuracy by conventional data holdout techniques. You can configure
the number of periods, the product categories and the hierarchy among products. You need to load your current
data in the Azure SQL database, and extract forecasts after each run from the same database. The Solution exposes
the R code model to allow further customizations, and to allow you to simulate historical data, to test the Solution.

Using the forecasting solution: Getting started


See the Technical Solution Guide for a full set of instructions on how to use this Solution as an example of what is
possible with the Cortana Intelligence Suite. For technical problems or questions about deploying this Solution,
please post in the issues tab of the repository.

Solution Dashboard
Here is an example of a snapshot of the forecasts generated by the solution in the PowerBI dashboard that comes
with the Solution.

Pricing Info
Your Azure subscription used for the deployment will incur consumption charges on the services used in this
solution, approximately $4.66/day. For more information, please visit the Pricing Calculator.
Note: If you are no longer using the deployed solution, remember to delete it to stop incurring consumption
charges.
Discovery Hub with Cloud Scale Analytics
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Use Discovery Hub to define a data estate using a graphical user interface, with definitions stored in a metadata
repository. Code for building the data estate is generated automatically while remaining fully customizable. The
resulting modern data warehouse is ready to support cloud scale analytics and AI.

Architecture

Download an SVG of this architecture.

Data Flow
1. Combine all your structured and semi-structured data in Azure Data Lake Storage using Discovery Hub's data
engineering pipeline with hundreds of native data connectors.
2. Clean and transform data using the powerful analytics and computational ability of Azure Databricks.
3. Move cleansed and transformed data to Azure Synapse Analytics, creating one hub for all your data. Take
advantage of native connectors between Azure Databricks (PolyBase) and Azure Synapse Analytics to access and
move data at scale.
4. Build operational reports and analytical dashboards on top of SQL Database to derive insights from the data
and use Azure Analysis Services to serve the data.
5. Run ad-hoc queries directly on data within Azure Databricks.

Components
Azure Data Lake Storage: Massively scalable, secure data lake functionality built on Azure Blob Storage
Azure Databricks: Fast, easy, and collaborative Apache Spark-based analytics platform
Azure Synapse Analytics: Limitless analytics service with unmatched time to insight (formerly SQL Data
Warehouse)
Azure Analysis Services: Enterprise-grade analytics engine as a service
Power BI Embedded: Embed fully interactive, stunning data visualizations in your applications
Next steps
Azure Data Lake Storage documentation
Azure Databricks documentation
Azure Synapse Analytics documentation
Azure Analysis Services documentation
Power BI Embedded documentation
Extend your on-premises big data investments with
HDInsight
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Extend your on-premises big data investments to the cloud and transform your business using the advanced
analytics capabilities of HDInsight.

Architecture

Download an SVG of this architecture.


End-to-end monitoring solution with Azure Data
Explorer
11/2/2020 • 2 minutes to read • Edit Online

This architecture pattern demonstrates a hybrid end-to-end monitoring solution integrated with Azure Sentinel and
Azure Monitor for ingesting streamed and batched logs from diverse sources, on-premises or any cloud, within an
enterprise ecosystem.

Key features of this architecture pattern


1. Combine features provided by Azure Sentinel and Azure Monitor with Azure Data Explorer to build a flexible
and cost-optimized end-to-end monitoring solution. Below are some examples:
Use Azure Sentinel as a SIEM and SOAR component in the overall monitoring solution where you can
ingest security logs from firewalls, security center, and so on. SIEM is short for security information and
event management, whereas SOAR is short for security orchestration, automation and response.
Use Azure Monitor’s native capabilities for IT asset monitoring, dashboarding, and alerting so you can
ingest logs from VMs, services, and so on.
Use Azure Data Explorer for full flexibility and control in all aspects for all types of logs in the following
scenarios:
No out of the box features provided by Azure Sentinel and Azure Monitor SaaS solutions such as
application trace logs.
Greater flexibility for building quick and easy near-real-time analytics dashboards, granular role-
based access control, time series analysis, pattern recognition, anomaly detection and forecasting,
and machine learning. Azure Data Explorer is also well integrated with ML services such as
Databricks and Azure Machine Learning. This integration allows you to build models using other
tools and services and export ML models to Azure Data Explorer for scoring data.
Longer data retention is required in cost effective manner.
Centralized repository is required for different types of logs. Azure Data Explorer, as a unified big
data analytics platform, allows you to build advanced analytics scenarios.
2. Query across different products without moving data using the Azure Data Explorer proxy feature to analyze
data from Azure Sentinel, Azure Monitor, and Azure Data Explorer in a single query.
3. To ingest logs with low latency and high throughput from on-premises or any other cloud, use native Azure
Data Explorer connectors such as Logstash, Azure Event Hub, or Kafka.
4. Alternatively, ingest data through Azure Storage (Blob or ADLS Gen2) using Apache Nifi, Fluentd, or Fluentbit
connectors. Then use Azure Event Grid to trigger the ingestion pipeline to Azure Data Explorer.
5. You can also continuously export data to Azure Storage in compressed, partitioned parquet format and
seamlessly query that data as detailed in the Continuous data export overview.

NOTE
Azure Sentinel is built on Azure Monitor (Log Analytics) which in turn, is built on Azure Data Explorer. Therefore, switching
between these services is seamless. This allows you to reuse Kusto query language queries and dashboards across these
services.

Components
Azure Event Hub: Fully managed, real-time data ingestion service that’s simple, trusted, and scalable.
Azure IoT Hub: Managed service to enable bi-directional communication between IoT devices and Azure.
Kafka on HDInsight: Easy, cost-effective, enterprise-grade service for open source analytics with Apache Kafka.
Azure Data Explorer: Fast, fully managed and highly scalable data analytics service for real-time analysis on
large volumes of data streaming from applications, websites, IoT devices, and more.
Azure Data Explorer Dashboards: Natively export Kusto queries that were explored in the Web UI to optimized
dashboards.
Azure Sentinel: Intelligent security analytics for your entire enterprise.
Azure Monitor: Full observability into your applications, infrastructure, and network

Next steps
For more information, see Azure Data Explorer documentation.
Extract, transform, and load (ETL) using HDInsight
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Extract, transform, and load your big data clusters on demand with Hadoop MapReduce and Apache Spark.

Architecture
Interactive analytics with Azure Data Explorer
11/2/2020 • 2 minutes to read • Edit Online

This architecture pattern demonstrates interactive analytics with Azure Data Explorer to explore data with ad hoc,
interactive, and lightning fast queries over small to extremely large volumes of data. This data exploration can be
done using native Azure Data Explorer tools or alternative tools of your choice. This architecture focuses on the
integration of Azure Data Explorer with rest of the data platform ecosystem.
This pattern is already being used by Microsoft customers to track user activity, manage user profiles and user
segmentation scenarios. For example, global software company Episerver uses Azure Data Explorer to gain
enhanced consumer insight.

Data flow
1. Raw structured, semi-structured, and unstructured (free text) data such as, any type of logs, business events, and
user activities can be ingested into Azure Data Explorer from various sources. Ingest the data in streaming or
batch mode using various methods.
2. Ingest data into Azure Data Explorer with low-latency and high-throughput using its connectors for Azure Data
Factory, Azure Event Hub, Azure IoT Hub, Kafka, and so on. Instead, ingest data through Azure Storage (Blob or
ADLS Gen2), which uses Azure Event Grid and triggers the ingestion pipeline to Azure Data Explorer. You can
also continuously export data to Azure Storage in compressed, partitioned parquet format and seamlessly
query that data as detailed in continuous data export overview.
3. Run interactive queries over small to extremely large volumes of data using native Azure Data Explorer tools or
alternative tools of your choice. Azure Data Explorer provides many plugins and integrations with the rest of the
data platform ecosystem. Use any of the following tools and integrations:
For interactive analytics, use Azure Data Explorer Web UI, web client for Azure Data Explorer, or
Kusto.Explorer, rich windows client for Azure Data Explorer.
To connect to your Azure Data Explorer cluster, use Jupyter notebooks, Spark connector, any TDS-
compliant SQL client, as well as JDBC and ODBC connections.
To build new apps or integrate with existing apps or frameworks, use Azure Data Explorer REST APIs and
SDKs available in different languages.
Build near real-time analytics dashboards using Azure Data Explorer dashboards, Power BI, or Grafana.
4. Enrich data running federated queries by combining data from SQL database and Azure Cosmos DB using
Azure Data Explorer plugins.

Components
Azure Event Hub: Fully managed, real-time data ingestion service that’s simple, trusted, and scalable.
Azure IoT Hub: Managed service to enable bi-directional communication between IoT devices and Azure.
Kafka on HDInsight: Easy, cost-effective, enterprise-grade service for open-source analytics with Apache Kafka.
Azure Data Factory: Hybrid data integration service that simplifies ETL at scale.
Azure Data Explorer: Fast, fully managed and highly scalable data analytics service for real-time analysis on
large volumes of data streaming from applications, websites, IoT devices, and more.
Azure Data Explorer Dashboards: Natively export Kusto queries that were explored in the Web UI to optimized
dashboards.
Azure Cosmos DB: Fully managed fast NoSQL database service for modern app development with open APIs
for any scale.
Azure SQL DB: Build apps that scale with the pace of your business with managed and intelligent SQL in the
cloud.

Next steps
For more information, see Azure Data Explorer documentation.
IoT telemetry analytics with Azure Data Explorer
11/2/2020 • 3 minutes to read • Edit Online

This architecture pattern demonstrates near real-time analytics over fast flowing, high volume streaming data from
IoT devices, sensors, connected buildings and vehicles, and so on. It focuses on integration of Azure Data Explorer
with other IoT services to cater to both operational and analytical workloads using Cosmos DB and Azure Data
Explorer.
This architecture is already being used by Microsoft customers for IoT device telemetry analytics. For example,
Bosch uses this to combine real-time road conditions with weather data for safer autonomous driving.

Data flow
1. Ingest a wide variety of fast-flowing streaming data such as logs, business events, and user activities from
various sources such as Azure Event Hub, Azure IoT Hub, or Kafka.
2. Process data in real-time using Azure Functions or Azure Stream Analytics.
3. Store streamed JSON messages in Cosmos DB to serve a real-time operational application.
4. Ingest data into Azure Data Explorer with low-latency and high throughput using its connectors for Azure Event
Hub, Azure IoT Hub, Kafka, and so on. Alternatively, ingest data through Azure Storage (Blob or ADLS Gen2),
which uses Azure Event Grid and triggers the ingestion pipeline to Azure Data Explorer. You can also
continuously export data to Azure Storage in compressed, partitioned parquet format and seamlessly query that
data as detailed in the continuous data export overview.
5. Azure Data Explorer is a big data analytics store for serving near real-time analytics applications and
dashboards. As this pattern serves both operational and analytical use cases, data can either be routed to Azure
Data Explorer and Cosmos DB in parallel, or from Cosmos DB to Azure Data Explorer.
Integrate Cosmos DB with Azure Data Explorer using change feed. Every transaction in Cosmos DB can
trigger an Azure Function via this change feed. Data can then be streamed to Event Hub and ingested into
Azure Data Explorer.
From Azure Function, invoke Azure Digital Twin (ADT) APIs. ADT is used for storing digital models of
physical environments. This service streams data to Event Hub, which then gets ingested to Azure Data
Explorer.
6. Gain insights from data stored in Azure Data Explorer by any of the following methods:
Build a custom analytics app that invokes APIs exposed by ADT and Azure Data Explorer to blend the data
from both sources.
Build near real-time analytics dashboards using Azure Data Explorer dashboards, Power BI, or Grafana.
Create and schedule alerts and notifications using the Azure Data Explorer connector for Azure Logic
Apps.
Analyze data using Azure Data Explorer Web UI, Kusto.Explorer, or Jupyter notebooks.
7. Azure Data Explorer provides native advanced analytics capabilities for time series analysis, pattern recognition,
anomaly detection and forecasting, and machine learning. Azure Data Explorer is also well integrated with ML
services such as Databricks and Azure Machine Learning. This integration allows you to build models using
other tools and services and export ML models to Azure Data Explorer for scoring data.

Components
Azure Event Hub: Fully managed, real-time data ingestion service that’s simple, trusted, and scalable.
Azure IoT Hub: Managed service to enable bi-directional communication between IoT devices and Azure.
Kafka on HDInsight: Easy, cost-effective, enterprise-grade service for open source analytics with Apache Kafka.
Azure Cosmos DB: Fully managed fast NoSQL database service for modern app development with open APIs for
any scale.
Azure Data Explorer: Fast, fully managed and highly scalable data analytics service for real-time analysis on large
volumes of data streaming from applications, websites, IoT devices, and more.
Azure Data Explorer Dashboards: Natively export Kusto queries that were explored in the Web UI to optimized
dashboards.
Azure Digital Twins: Create next-generation IoT solutions that model the real world.

Next steps
For more information, see Azure Data Explorer documentation.
Interactive Price Analytics
11/2/2020 • 4 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

The Price Analytics solution uses your transactional history data to show you how the demand for your products
responds to the prices you offer, to recommend pricing changes, and allow you to simulate how changes in price
would affect your demand, at a fine granularity.
The solution provides a dashboard, where you can see optimal pricing recommendations, item elasticities at a item-
site-channel-segment level, estimates of related-product effects such "as cannibalization", forecasts given current
process, and model performance metrics.
Direct interaction with the pricing model in Excel lets you simply paste your sales data there and analyze your
prices without the need to integrate the data into the solution database first, simulate promotions and plot demand
curves (showing demand response to price), and access dashboard data in numerical form.
The rich functionality is not confined to Excel. It is driven by web services that you, or your implementation partner,
can call directly from your business applications, integrating price analysis into your business applications.

Architecture
Download an SVG of this architecture.

Description
At the core of a rigorous price analysis workflow is price elasticity modeling and optimal pricing recommendations.
The state-of-the-art modeling approach mitigates the two worst pitfalls of modeling price sensitivity from historical
data: confounding and data sparsity.
Confounding is the presence of factors other than price which affect demand. We use a "double-ML" approach that
subtracts out the predictable components of price and demand variation before estimating the elasticity,
immunizing the estimates to most forms of confounding. The solution can also be customized by an
implementation partner to use your data capturing potential external demand drivers other than price. Our blog
post gives additional detail on the data science of prices.
Data sparsity occurs because the optimal price varies at a fine grain: businesses can set prices by item, site, sales
channel and even customer segment, but pricing solutions often only give estimates at product category level
because the transaction history may only contain a few sales for each specific situation. Our pricing solution uses
"hierarchical regularization" to produce consistent estimates in such data-poor situations: in absence of evidence,
the model borrows information from other items in the same category, same items in other sites, and so on. As the
amount of historical data on a given item-site-channel combination increases, its elasticity estimate will be fine-
tuned more specifically.
This solution analyzes your historical prices and
shows you in one glance at the dashboard how elastic your product demand is
provides pricing recommendations for every product in your item catalog
discovers related products (substitutes and complements)
lets you simulate promotional scenarios in Excel.

Estimated cost
The estimated cost for the solution is approximately $10/day ($300/month)
$100 for S1 standard ML service plan
$75 for an S2 SQL database
$75 for app hosting plan
$50 in miscellaneous ADF data activities and storage costs
If you are just exploring the solution, you can delete it in a few days or hours. The costs are pro-rated and will cease
to be charged when you delete the Azure components.

Getting started
Deploy the solution with the button on the right. Instructions at the end of the deployment will have important
configuration information. Please leave them open.
The solution deploys with the same example data set of orange juice prices that you find behind the Try-It-Now
button on the right.
While the solution is deploying, you can get a head start and
See what is available in the Try-It-Now dashboard
Peruse the User Guide for usage instructions from the perspective of a pricing analyst (MSFT login required)
Review the Technical Deployment Guide for a technical implementation view (MSFT login required)
Download the interactive Excel worksheet
After the solution deploys, complete the first walkthrough (MSFT login required).

Solution Dashboard
The solution dashboard's most actionable part is the Pricing Suggestion tab. It tells you which of your items are
underpriced, overpriced, and suggests an optimal price for each item, as well as the predicted impact of adopting
the suggestion. The suggestions are prioritized by the largest opportunity to earn incremental gross margin.

Other tabs provide supplemental information illuminating how the system arrived at the suggestions and are
discussed in more detail in the User Guide. (You must be logged into Github with a MSFT Azure account while
solution is in private preview.)

Solution Architecture
The solution uses an Azure SQL server to store your transactional data and the generated model predictions. There
are a dozen elasticity modeling core services, which are authored in AzureML using Python core libraries. Azure
Data Factory schedules weekly model refreshes. The results display in a PowerBI dashboard. The provided Excel
spreadsheet consumes the predictive Web Services.
Please read the Technical Deployment Guide for a more detailed discussion of the architecture, connecting your
own data and customization (GitHub login required).
Mass ingestion and analysis of news feeds on Azure
11/2/2020 • 6 minutes to read • Edit Online

This example scenario describes a pipeline for mass ingestion and near real-time analysis of documents using
public RSS news feeds. It uses Azure Cognitive Services to offer useful insights including text translation, facial
recognition, and sentiment detection.
This scenario contains examples for English, Russian, and German news feeds, but you can easily extend it to other
RSS feeds. For ease of deployment, the data collection, processing, and analysis are based entirely on Azure
services.

Relevant use cases


While this scenario is based on processing of RSS feeds, it's relevant to any document, website, or article where
you would need to:
Translate any text to the language of choice.
Find key phrases, entities, and user sentiment in digital content.
Detect objects, text, and landmarks in images associated with a digital article.
Detect people by their gender and age in any image associated with digital content.

Architecture
The data flows through the solution as follows:
1. An RSS news feed acts as the generator that obtains data from a document or article. For example, with an
article, data typically includes a title, a summary of the original body of the news item, and sometimes
images.
2. A generator or ingestion process inserts the article and any associated images into an Azure Cosmos DB
Collection.
3. A notification triggers an ingest function in Azure Functions that stores the article text in Cosmos DB and the
article images (if any) in Azure Blob Storage. The article is then passed to the next queue.
4. A translate function is triggered by the queue event. It uses the Translate Text API of Azure Cognitive Services
to detect the language, translate if necessary, and collect the sentiment, key phrases, and entities from the
body and the title. Then it passes the article to the next queue.
5. A detect function is triggered from the queued article. It uses the Computer Vision service to detect objects,
landmarks, and written words in the associated image, then passes the article to the next queue.
6. A face function is triggered is triggered from the queued article. It uses the Azure Face API service to detect
faces for gender and age in the associated image, then passes the article to the next queue.
7. When all functions are complete, the notify function is triggered. It loads the processed records for the
article and scans them for any results you want. If found, the content is flagged and a notification is sent to
the system of your choice.
At each processing step, the function writes the results to Azure Cosmos DB. Ultimately, the data can be used as
desired. For example, you can use it to enhance business processes, locate new customers, or identify customer
satisfaction issues.
Components
The following list of Azure components is used in this example.
Azure Storage is used to hold raw image and video files associated with an article. A secondary storage
account is created with Azure App Service and is used to host the Azure Function code and logs.
Azure Cosmos DB holds article text, image, and video tracking information. The results of the Cognitive
Services steps are also stored here.
Azure Functions executes the function code used to respond to queue messages and transform the
incoming content. Azure App Service hosts the function code and processes the records serially. This
scenario includes five functions: Ingest, Transform, Detect Object, Face, and Notify.
Azure Service Bus hosts the Azure Service Bus queues used by the functions.
Azure Cognitive Services provides the AI for the pipeline based on implementations of the Computer Vision
service, Face API, and Translate Text machine translation service.
Azure Application Insights provides analytics to help you diagnose issues and to understand functionality of
your application.
Alternatives
Instead of using a pattern based on queue notification and Azure Functions, use another pattern for this data
flow. For example, Azure Service Bus Topics can be used to processes the various parts of the article in
parallel as opposed to the serial processing done in this example. For more information, compare queues
and topics.
Use Azure Logic Apps to implement the function code and implement record-level locking such as Redlock
(needed for parallel processing until Azure Cosmos DB supports partial document updates). For more
information, compare Functions and Logic Apps.
Implement this architecture using customized AI components rather than existing Azure services. For
example, extend the pipeline using a customized model that detects certain people in an image as opposed
to the generic people count, gender, and age data collected in this example. To use customized machine
learning or AI models with this architecture, build the models as RESTful endpoints so they can be called
from Azure Functions.
Use a different input mechanism instead of RSS feeds. Use multiple generators or ingestion processes to
feed Azure Cosmos DB and Azure Storage.
Azure Cognitive Search is an AI feature in Azure Search that can also used to extract text from images, blobs,
and other unstructured data sources.

Considerations
For simplicity, this example scenario uses only a few of the available APIs and services from Azure Cognitive
Services. For example, text in images can be analyzed using the Text Analytics API. The target language in this
scenario is assumed to be English, but you can change the input to any supported language of your choice.
Scalability
Azure Functions scaling depends on the hosting plan you use. This solution assumes a Consumption plan, in which
compute power is automatically allocated to the functions when required. You pay only when your functions are
running. Another option is to use an Azure App Service plan, which allows you to scale between tiers to allocate a
different amount of resources.
With Azure Cosmos DB, the key is to distribute your workload roughly evenly among a sufficiently large number of
partition keys. There's no limit to the total amount of data that a container can store or to the total amount of
throughput that a container can support.
Management and logging
This solution uses Application Insights to collect performance and logging information. An instance of Application
Insights is created with the deployment in the same resource group as the other services needed for this
deployment.
To view the logs generated by the solution:
1. Go to Azure portal and navigate to the resource group created for the deployment.
2. Click the Application Insights instance.
3. From the Application Insights section, navigate to Investigate\Search and search the data.
Security
Azure Cosmos DB uses a secured connection and shared access signature through the C# SDK provided by
Microsoft. There are no other externally facing surface areas. Learn more about security best practices for Azure
Cosmos DB.

Pricing
The estimated daily cost to keep this deployment available is approximately $20 U.S. with no data moving through
the system.
Azure Cosmos DB is powerful but incurs the greatest cost in this deployment. You can use another storage solution
by refactoring the Azure Functions code provided.
Pricing for Azure Functions varies depending on the plan it runs in.

Deploy the scenario


NOTE
You must have an existing Azure account. If you don't have an Azure subscription, create a free account before you begin.

All the code for this scenario is available in the GitHub repository. This repository contains the source code used to
build the generator application that feeds the pipeline for this demo.
Oil and Gas Tank Level Forecasting
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Today, most facilities operate reactively to problems in tank levels. This often leads to spills, emergency shutdowns,
expensive remediation costs, regulatory issues, costly repairs and fines. Tank level forecasting helps manage and
abate these and other problems.
Forecasts are created by harnessing the power of real-time and historical data readily available from sensors,
meters and records, which helps to:
Prevent tank spillage and emergency shutdowns
Discover hardware malfunction or failure
Schedule maintenance, shutdowns, and logistics
Optimize operations and facility efficiency
Detect pipeline leaks and slugging
Reduce costs, fines, and downtime
The tank level forecasting process starts at the well input. Oil is measured as it comes into the facility via meters
and is sent to tanks. Levels are monitored and recorded in tanks during the refining process and then oil, gas, and
water output are recorded via sensors, meters, and records. Forecasts are then made using data from the facility;
for example, forecasts can be made every 15 minutes.
The Cortana Intelligence Suite is adaptable and can be customized to meet different requirements that facilities and
corporations have.

Architecture
Download an SVG of this architecture.

Description
For more details on how this solution is built, visit the solution guide in GitHub.
The Cortana Intelligence Suite provides advanced analytics tools through Microsoft Azure - data ingestion, data
storage, data processing and advanced analytics components - all of the essential elements for building a tank level
forecasting solution.
This solution combines several Azure services to provide powerful advantages. Event Hubs collects real-time tank
level data. Stream Analytics aggregates the streaming data and makes it available for visualization. Azure Synapse
Analytics stores and transforms the tank level data. Machine Learning implements and executes the forecasting
model. Power BI visualizes the real-time tank level as well as the forecast results. Finally, Data Factory orchestrates
and schedules the entire data flow.
The 'Deploy' button will launch a workflow that will deploy an instance of the solution within a Resource Group in
the Azure subscription you specify. The solution includes multiple Azure services (described below) along with a
web job that simulates data so that immediately after deployment you have a working end-to-end solution.
After deployment, see the post deployment instructions.

Technical details and workflow


1. The data feeds into the Azure Event Hubs and Azure Synapse Analytics service as data points or events, that will
be used in the rest of the solution flow.
2. Azure Stream Analytics analyze the data to provide near real-time analytics on the input stream from the event
hub and directly publish to Power BI for visualization.
3. Azure Machine Learning is used to make forecast on the tank level of particular region given the inputs received.
4. Azure Synapse Analytics is used to store the prediction results received from Azure Machine Learning. These
results are then consumed in the Power BI dashboard.
5. Azure Data Factory handles orchestration, and scheduling of the hourly model retraining.
6. Finally, Power BI is used for results visualization, so that users can monitor the tank level from a facility in real
time and use the forecast level to prevent spillage.
Partitioning in Event Hubs and Kafka
11/13/2020 • 14 minutes to read • Edit Online

This reference architecture provides strategies for the partitioning model that event ingestion services use.
Because event ingestion services provide solutions for high-scale event streaming, they need to process events in
parallel and be able to maintain event order. They also need to balance loads and offer scalability. Partitioning
models meet all of these requirements.
Specifically, this document discusses the following strategies:
How to assign events to partitions.
How many partitions to use.
How to assign partitions to subscribers when rebalancing.
Many event ingestion technologies exist, including:
Azure Event Hubs: A fully managed big data streaming platform.
Apache Kafka: An open-source stream-processing platform.
Event Hubs with Kafka: An alternative to running your own Kafka cluster. This Event Hubs feature provides an
endpoint that is compatible with Kafka APIs.
Besides offering partitioning strategies, this document also points out differences between partitioning in Event
Hubs and Kafka.

Architecture

At the center of the diagram is a box labeled Kafka Cluster or Event Hub Namespace. Three smaller boxes sit inside
that box. Each is labeled Topic or Event Hub, and each contains multiple rectangles labeled Partition. Above the
main box are rectangles labeled Producer. Arrows point from the producers to the main box. Below the main box
are rectangles labeled Consumer. Arrows point from the main box to the consumers and are labeled with various
offset values. A single blue frame labeled Consumer Group surrounds two of the consumers, grouping them
together.
Producers publish data to the ingestion service, or pipeline. Event Hubs pipelines consist of namespaces.
The Kafka equivalents are clusters.
The pipeline distributes incoming events among partitions. Within each partition, events remain in
production order. Events don't remain in sequence across partitions, however. The number of partitions can
affect throughput, or the amount of data that passes through the system in a set period of time. Pipelines
usually measure throughput in bits per second (bps), and sometimes in data packets per second (pps).
Partitions reside within named streams of events. Event Hubs calls these streams event hubs. In Kafka,
they're topics.
Consumers are processes or applications that subscribe to topics. Each consumer reads a specific subset of
the event stream. That subset can include more than one partition. However, the pipeline can assign each
partition to only one consumer at a time.
Multiple consumers can make up consumer groups. When a group subscribes to a topic, each consumer in
the group has a separate view of the event stream. The applications work independently from each other, at
their own pace. The pipeline can also use consumer groups for load sharing.
Consumers process the feed of published events that they subscribe to. Consumers also engage in
checkpointing. Through this process, subscribers use offsets to mark their position within a partition event
sequence. An offset is a placeholder that works like a bookmark to identify the last event that the consumer
read.

Recommendations
Keep the following recommendations in mind when developing a partitioning strategy.
Distribute events to partitions
One aspect of the partitioning strategy is the assignment policy. An event that arrives at an ingestion service goes
to a partition. The assignment policy determines that partition.
Each event stores its content in its value. Besides the value, each event also contains a key, as the following
diagram shows:

At the center of the diagram are multiple pairs of boxes. A label below the boxes indicates that each pair represents
a message. Each message contains a blue box labeled Key and a black box labeled Value. The messages are
arranged horizontally. Arrows between messages that point from left to right indicate that the messages form a
sequence. Above the messages is the label Stream. Brackets indicate that the sequence forms a stream.
The key contains data about the event and can also play a role in the assignment policy.
Multiple approaches exist for assigning events to partitions:
By default, services distribute events among partitions in a round-robin fashion.
Producers can specify a partition ID with an event. The event then goes to the partition with that ID.
Producers can provide a value for the event key. When they do, a hashing-based partitioner determines a hash
value from the key. The event then goes to the partition associated with that hash value.
Keep these recommendations in mind when choosing an assignment policy:
Use partition IDs when consumers are only interested in certain events. When those events flow to a single
partition, the consumer can easily receive them by subscribing to that partition.
Use keys when consumers need to receive events in production order. Since all events with the same key go to
the same partition, events with key values can maintain their order during processing. Consumers then receive
them in that order.
With Kafka, if event grouping or ordering isn't required, avoid keys. The producer doesn't know the status of the
destination partition in Kafka. If a key routes an event to a partition that's down, delays or lost events can result.
In Event Hubs, events with keys first pass through a gateway before proceeding to a partition. This approach
prevents events from going to unavailable partitions.
The shape of the data can influence the partitioning approach. Consider how the downstream architecture will
distribute the data when deciding on assignments.
If consumers aggregate data on a certain attribute, you should partition on that attribute, too.
When storage efficiency is a concern, partition on an attribute that concentrates the data to help speed up
storage operations.
Ingestion pipelines sometimes shard data to get around problems with resource bottlenecks. In these
environments, align the partitioning with how the shards are split in the database.
Determine the number of partitions
Use these guidelines to decide how many partitions to use:
Use more partitions to achieve more throughput. Each consumer reads from its assigned partition. So with
more partitions, more consumers can receive events from a topic at the same time.
Use at least as many partitions as the value of your target throughput in megabytes.
To avoid starving consumers, use at least as many partitions as consumers. For instance, suppose eight
partitions are assigned to eight consumers. Any additional consumers that subscribe will have to wait.
Alternatively, you can keep one or two consumers ready to receive events when an existing consumer fails.
Use more keys than partitions. Otherwise, some partitions won't receive any events, leading to unbalanced
partition loads.
In both Kafka and Event Hubs at the Dedicated tier level, you can change the number of partitions in an
operating system. However, avoid making that change if you use keys to preserve event ordering. The reason
involves the following facts:
Customers rely on certain partitions and the order of the events they contain.
When the number of partitions changes, the mapping of events to partitions can change. For instance,
when the partition count changes, this formula can produce a different assignment:
partition assignment = hash key % number of partitions
Kafka and Event Hubs don't attempt to redistribute events that arrived at partitions before the shuffle. As
a result, the guarantee no longer holds that events arrive at a certain partition in publication order.
Besides these guidelines, you can also use this rough formula to determine the number of partitions:
max(t/p, t/c)

It uses the following values:


t : The target throughput.
p : The production throughput on a single partition.
c : The consumption throughput on a single partition.

For example, consider this situation:


The ideal throughput is 2 MBps. For the formula, t is 2 MBps.
A producer sends events at a rate of 1,000 events per second, making p 1 MBps.
A consumer receives events at a rate of 500 events per second, setting c to 0.5 MBps.

With these values, the number of partitions is 4:


max(t/p, t/c) = max(2/1, 2/0.5) = max(2, 4) = 4

When measuring throughput, keep these points in mind:


The slowest consumer determines the consumption throughput. However, sometimes no information is
available about downstream consumer applications. In this case, estimate the throughput by starting with
one partition as a baseline. (Use this setup only in testing environments, not in production systems). Event
Hubs with Standard tier pricing and one partition should produce throughput between 1 MBps and 20
MBps.
Consumers can consume events from an ingestion pipeline at a high rate only if producers send events at a
comparable rate. To determine the total required capacity of the ingestion pipeline, measure the producer's
throughput, not just the consumer's.
Assign partitions to consumers when rebalancing
When consumers subscribe or unsubscribe, the pipeline rebalances the assignment of partitions to consumers. By
default, Event Hubs and Kafka use a round robin approach for rebalancing. This method distributes partitions
evenly across members.
With Kafka, if you don't want the pipeline to automatically rebalance assignments, you can statically assign
partitions to consumers. But you need to make sure that all partitions have subscribers and that the loads are
balanced.
Besides the default round robin strategy, Kafka offers two other strategies for automatic rebalancing:
Range assignor: Use this approach to bring together partitions from different topics. This assignment identifies
topics that use the same number of partitions and the same key-partitioning logic. Then it joins partitions from
those topics when making assignments to consumers.
Sticky assignor: Use this assignment to minimize partition movement. Like round robin, this strategy ensures a
uniform distribution. However, it also preserves existing assignments during rebalancing.

Considerations
Keep these points in mind when using a partitioning model.
Scalability considerations
Using a large number of partitions can limit scalability:
In Kafka, brokers store event data and offsets in files. The more partitions you use, the more open file
handles you'll have. If the operating system limits the number of open files, you may need to reconfigure
that setting.
In Event Hubs, users don't face file system limitations. However, each partition manages its own Azure blob
files and optimizes them in the background. A large number of partitions makes it expensive to maintain
checkpoint data. The reason is that I/O operations can be time-consuming, and the storage API calls are
proportional to the number of partitions.
Each producer for Kafka and Event Hubs stores events in a buffer until a sizeable batch is available or until a
specific amount of time passes. Then the producer sends the events to the ingestion pipeline. The producer
maintains a buffer for each partition. When the number of partitions increases, the memory requirement of
the client also expands. If consumers receive events in batches, they may also face the same issue. When
consumers subscribe to a large number of partitions but have limited memory available for buffering,
problems can arise.
Availability considerations
A significant number of partitions can also adversely affect availability:
Kafka generally positions partitions on different brokers. When a broker fails, Kafka rebalances the
partitions to avoid losing events. The more partitions there are to rebalance, the longer the failover takes,
increasing unavailability. Limit the number of partitions to the low thousands to avoid this issue.
The more partitions you use, the more physical resources you put in operation. Depending on the client
response, more failures can then occur.
With more partitions, the load-balancing process has to work with more moving parts and more stress.
Transient exceptions can result. These errors can occur when there are temporary disturbances, such as
network issues or intermittent internet service. They can appear during an upgrade or load balancing, when
Event Hubs sometimes moves partitions to different nodes. Handle transient behavior by incorporating
retries to minimize failures. Use the EventProcessorClient in the .NET and Java SDKs or the
EventHubConsumerClient in the Python and JavaScript SDKs to simplify this process.
Performance considerations
In Kafka, events are committed after the pipeline has replicated them across all in-sync replicas. This approach
ensures a high availability of events. Since consumers only receive committed events, the replication process adds
to the latency. In ingestion pipelines, this term refers to the time between when a producer publishes an event and
a consumer reads it. According to experiments that Confluent ran, replicating 1,000 partitions from one broker to
another can take about 20 milliseconds. The end-to-end latency is then at least 20 milliseconds. When the number
of partitions increases further, the latency also grows. This drawback doesn't apply to Event Hubs.
Security considerations
In Event Hubs, publishers use a Shared Access Signature (SAS) token to identify themselves. Consumers connect
via an AMQP 1.0 session. This state-aware bidirectional communication channel provides a secure way to transfer
messages. Kafka also offers encryption, authorization, and authentication features, but you have to implement
them yourself.

Examples
The following code examples demonstrate how to maintain throughput, distribute to a specific partition, and
preserve event order.
Maintain throughput
This example involves log aggregation. The goal isn't to process events in order, but rather, to maintain a specific
throughput.
A Kafka client implements the producer and consumer methods. Since order isn't important, the code doesn't send
messages to specific partitions. Instead, it uses the default partitioning assignment:

public static void RunProducer(string broker, string connectionString, string topic)


{
// Set the configuration values of the producer.
var producerConfig = new ProducerConfig
{
BootstrapServers = broker,
SecurityProtocol = SecurityProtocol.SaslSsl,
SaslMechanism = SaslMechanism.Plain,
SaslUsername = "$ConnectionString",
SaslPassword = connectionString,
};

// Set the message key to Null since the code does not use it.
using (var p = new ProducerBuilder<Null, string>(producerConfig).Build())
{
try
{
// Send a fixed number of messages. Use the Produce method to generate
// many messages in rapid succession instead of the ProduceAsync method.
for (int i=0; i < NumOfMessages; i++)
{
string value = "message-" + i;
Console.WriteLine($"Sending message with key: not-specified," +
$"value: {value}, partition-id: not-specified");
p.Produce(topic, new Message<Null, string> { Value = value });
p.Produce(topic, new Message<Null, string> { Value = value });
}

// Wait up to 10 seconds for any in-flight messages to be sent.


p.Flush(TimeSpan.FromSeconds(10));
}
catch (ProduceException<Null, string> e)
{
Console.WriteLine($"Delivery failed with error: {e.Error.Reason}");
}
}
}

public static void RunConsumer(string broker, string connectionString, string consumerGroup, string topic)
{
var consumerConfig = new ConsumerConfig
{
BootstrapServers = broker,
SecurityProtocol = SecurityProtocol.SaslSsl,
SocketTimeoutMs = 60000,
SessionTimeoutMs = 30000,
SaslMechanism = SaslMechanism.Plain,
SaslUsername = "$ConnectionString",
SaslPassword = connectionString,
GroupId = consumerGroup,
AutoOffsetReset = AutoOffsetReset.Earliest
};

using (var c = new ConsumerBuilder<string, string>(consumerConfig).Build())


{
c.Subscribe(topic);

CancellationTokenSource cts = new CancellationTokenSource();


Console.CancelKeyPress += (_, e) =>
{
e.Cancel = true;
cts.Cancel();
};

try
{
while (true)
{
try
{
var message = c.Consume(cts.Token);
Console.WriteLine($"Consumed - key: {message.Message.Key}, "+
$"value: {message.Message.Value}, " +
$"partition-id: {message.Partition}," +
$"offset: {message.Offset}");
}
catch (ConsumeException e)
{
Console.WriteLine($"Error occured: {e.Error.Reason}");
}
}
}
catch(OperationCanceledException)
{
// Close the consumer to ensure that it leaves the group cleanly
// and that final offsets are committed.
c.Close();
}
}
}

This code example produces the following results:


In this case, the topic has four partitions. The following events took place:
The producer sent 10 messages, each without a partition key.
The messages arrived at partitions in a random order.
A single consumer listened to all four partitions and received the messages out of order.
If the code had used two instances of the consumer, each instance would have subscribed to two of the four
partitions.
Distribute to a specific partition
This example involves error messages. Suppose certain applications need to process error messages, but all other
messages can go to a common consumer. In this case, the producer sends error messages to a specific partition.
Consumers who want to receive error messages listen to that partition. The following code shows how to
implement this scenario:

// Producer code.
var topicPartition = new TopicPartition(topic, partition);

...

p.Produce(topicPartition, new Message<Null, string> { Value = value });

// Consumer code.
// Subscribe to one partition.
c.Assign(new TopicPartition(topic, partition));

// Use this code to subscribe to a list of partitions.


c.Assign(new List<TopicPartition> {
new TopicPartition(topic, partition1),
new TopicPartition(topic, partition2)
});

As these results show, the producer sent all messages to partition 2, and the consumer only read messages from
partition 2:
In this scenario, if you add another consumer instance to listen to this topic, the pipeline won't assign any
partitions to it. The new consumer will starve until the existing consumer shuts down. The pipeline will then assign
a different, active consumer to read from the partition. But the pipeline will only make that assignment if the new
consumer isn't dedicated to another partition.
Preserve event order
This example involves bank transactions that a consumer needs to process in order. In this scenario, you can use
the customer ID of each event as the key. For the event value, use the details of the transaction. The following code
shows a simplified implementation of this case:

Producer code
// This code assigns the key an integer value. You can also assign it any other valid key value.
using (var p = new ProducerBuilder<int, string>(producerConfig).Build())
...
p.Produce(topic, new Message<int, string> { Key = i % 2, Value = value });

This code produces the following results:


As these results show, the producer only used two unique keys. The messages then went to only two partitions
instead of all four. The pipeline guarantees that messages with the same key go to the same partition.

Next steps
Use Azure Event Hubs from Apache Kafka applications
Apache Kafka developer guide for Azure Event Hubs
Quickstart: Data streaming with Event Hubs using the Kafka protocol
Send events to and receive events from Azure Event Hubs - .NET (Azure.Messaging.EventHubs)

Related resources
Balance partition load across multiple instances of your application
Dynamically add partitions to an event hub (Apache Kafka topic) in Azure Event Hubs
Availability and consistency in Event Hubs
Azure Event Hubs Event Processor client library for .NET
Effective strategies for Kafka topic partitioning
Confluent blog post: How to choose the number of topics/partitions in a Kafka cluster?
Predicting Length of Stay in Hospitals
11/2/2020 • 3 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

This solution enables a predictive model for Length of Stay for in-hospital admissions. Length of Stay (LOS) is
defined in number of days from the initial admit date to the date that the patient is discharged from any given
hospital facility.

Architecture

Download an SVG of this architecture.

Description
Required preliminary agreement: You need to accept the Terms of Use for the Data Science Virtual Machine on your
Azure Subscription before you deploy this VM the first time. Click here to agree to these terms.

Overview
This solution enables a predictive model for Length of Stay for in-hospital admissions. Length of Stay (LOS) is
defined in number of days from the initial admit date to the date that the patient is discharged from any given
hospital facility. There can be significant variation of LOS across various facilities and across disease conditions and
specialties even within the same healthcare system. Advanced LOS prediction at the time of admission can greatly
enhance the quality of care as well as operational workload efficiency and help with accurate planning for
discharges resulting in lowering of various other quality measures such as readmissions.

Business Perspective
There are two different business users in hospital management who can expect to benefit from more reliable
predictions of the Length of Stay. These are:
The Chiefs Medical Information Officer (CMIO), who straddles the divide between informatics/technology and
healthcare professionals in a healthcare organization. Their duties typically include using analytics to determine
if resources are being allocated appropriately in a hospital network. As part of this, the CMIO needs to be able to
determine which facilities are being overtaxed and, specifically, what resources at those facilities may need to be
bolstered to realign such resources with demand.
The Care Line Manager, who is directly involved with the care of patients. This role requires monitoring the
status of individual patients as well as ensuring that staff is available to meet the specific care requirements of
their patients. A Care Line Manager also needs to manage the discharge of their patients. The ability to predict
LOS of a patient enables Care Line Managers to determine if staff resources will be adequate to handle the
release of a patient.
Data Scientist Perspective
SQL Server R Services brings the computing resources to the data by running R on the computer that hosts the
database. It includes a database service that runs outside the SQL Server process and that communicates securely
with the R runtime.
This solution walks through the steps needed to create and refine data, train the R models, and perform scoring on
the SQL Server machine. The final scored database table in SQL Server gives the predicted LOS for each patient.
This data is then visualized in PowerBI. (Simulated data is used in this template to illustrate the feature.)
Data scientists who are testing and developing solutions can work conveniently from their preferred R IDE on their
client machine, while pushing the compute to the SQL Server machine. The completed solutions are deployed to
SQL Server 2016 by embedding calls to R in stored procedures. These solutions can then be further automated
with SQL Server Integration Services and SQL Server agent.
This solution includes the R code needed by a data scientist in the R folder. It shows the stored procedures (.sql files)
that can be deployed in the SQLR folder. A PowerShell script (.ps1 file) is also provided that automates the running
of the SQL code. Click on the Deploy button to test the automation and the entire solution will be made available in
your Azure subscription.

Pricing
Your Azure subscription used for the deployment will incur consumption charges on the services used in this
solution, approximately $1.15/hour for the default VM.
Please ensure that you stop your VM instance when not actively using the solution. Running the VM will incur
higher costs.
Please delete the solution if you are not using it.
Predictive Aircraft Engine Monitoring
11/13/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Microsoft Azure's Predictive Maintenance solution demonstrates how to combine real-time aircraft data with
analytics to monitor aircraft health.
This solution is built on the Azure managed services: Azure Stream Analytics, Event Hubs, Azure Machine Learning,
HDInsight, Azure SQL Database, Data Factory and Power BI. These services run in a high-availability environment,
patched and supported, allowing you to focus on your solution instead of the environment they run in.

Architecture

Download an SVG of this architecture.

Components
Azure Stream Analytics: Stream Analytics provides near real-time analytics on the input stream from the Azure
Event Hub. Input data is filtered and passed to a Machine Learning endpoint, finally sending the results to the
Power BI dashboard.
Event Hubs ingests raw assembly-line data and passes it on to Stream Analytics.
Azure Machine Learning: Machine Learning predicts potential failures based on real-time assembly-line data
from Stream Analytics.
HDInsight runs Hive scripts to provide aggregations on the raw events that were archived by Stream Analytics.
Azure SQL Database: SQL Database stores prediction results received from Machine Learning and publishes
data to Power BI.
Data Factory handles orchestration, scheduling, and monitoring of the batch processing pipeline.
Power BI visualizes real-time assembly-line data from Stream Analytics and the predicted failures and alerts
from Data Warehouse.

Next steps
Learn more about Stream Analytics
Learn more about Event Hubs
Learn more about Azure Machine Learning
Learn more about HDInsight
Learn more about SQL Database
Learn more about Azure Data Factory
Learn more about Power BI
Real Time Analytics on Big Data Architecture
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Get insights from live streaming data with ease. Capture data continuously from any IoT device, or logs from
website clickstreams, and process it in near-real time.

Architecture

Download an SVG of this architecture.

Data Flow
1. Easily ingest live streaming data for an application using Apache Kafka cluster in Azure HDInsight.
2. Bring together all your structured data using Azure Data Factory to Azure Blob Storage.
3. Take advantage of Azure Databricks to clean, transform, and analyze the streaming data, and combine it with
structured data from operational databases or data warehouses.
4. Use scalable machine learning/deep learning techniques, to derive deeper insights from this data using Python,
R or Scala, with inbuilt notebook experiences in Azure Databricks.
5. Leverage native connectors between Azure Databricks and Azure Synapse Analytics to access and move data at
scale.
6. Build analytical dashboards and embedded reports on top of Azure Data Warehouse to share insights within
your organization and use Azure Analysis Services to serve this data to thousands of users.
7. Power users take advantage of the inbuilt capabilities of Azure Databricks and Azure HDInsight to perform root
cause determination and raw data analysis.
8. Take the insights from Azure Databricks to Cosmos DB to make them accessible through real time apps.
Components
Azure Synapse Analytics is the fast, flexible and trusted cloud data warehouse that lets you scale, compute and
store elastically and independently, with a massively parallel processing architecture.
Azure Data Factory is a hybrid data integration service that allows you to create, schedule and orchestrate your
ETL/ELT workflows.
Azure Data Lake Storage: Massively scalable, secure data lake functionality built on Azure Blob Storage
Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform.
Azure HDInsight is a fully managed, full spectrum open-source analytics service for popular open-source
frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R & more.
Azure Cosmos DB is a globally distributed, multi-model database service. Then learn how to replicate your data
across any number of Azure regions and scale your throughput independent from your storage.
Azure Analysis Services is an enterprise grade analytics as a service that lets you govern, deploy, test, and
deliver your BI solution with confidence.
Power BI is a suite of business analytics tools that deliver insights throughout your organization. Connect to
hundreds of data sources, simplify data prep, and drive ad hoc analysis. Produce beautiful reports, then publish
them for your organization to consume on the web and across mobile devices.

Next steps
Synapse Analytics Documentation
Azure Data Factory V2 Preview Documentation
Azure Data Lake Storage documentation
Azure Data Explorer
Azure Databricks Documentation
Azure HDInsight Documentation
Azure Cosmos DB Documentation
Analysis Services Documentation
Power BI Documentation

Pricing Calculator
Customize and get pricing estimates
Stream processing pipeline with Azure Databricks
11/2/2020 • 13 minutes to read • Edit Online

This reference architecture shows an end-to-end stream processing pipeline. This type of pipeline has four stages:
ingest, process, store, and analysis and reporting. For this reference architecture, the pipeline ingests data from two
sources, performs a join on related records from each stream, enriches the result, and calculates an average in real
time. The results are stored for further analysis.

A reference implementation for this architecture is available on GitHub.

Scenario : A taxi company collects data about each taxi trip. For this scenario, we assume there are two separate
devices sending data. The taxi has a meter that sends information about each ride — the duration, distance, and
pickup and drop-off locations. A separate device accepts payments from customers and sends data about fares. To
spot ridership trends, the taxi company wants to calculate the average tip per mile driven, in real time, for each
neighborhood.

Architecture
The architecture consists of the following components.
Data sources . In this architecture, there are two data sources that generate data streams in real time. The first
stream contains ride information, and the second contains fare information. The reference architecture includes a
simulated data generator that reads from a set of static files and pushes the data to Event Hubs. The data sources in
a real application would be devices installed in the taxi cabs.
Azure Event Hubs . Event Hubs is an event ingestion service. This architecture uses two event hub instances, one
for each data source. Each data source sends a stream of data to the associated event hub.
Azure Databricks . Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure
cloud services platform. Databricks is used to correlate of the taxi ride and fare data, and also to enrich the
correlated data with neighborhood data stored in the Databricks file system.
Cosmos DB . The output from Azure Databricks job is a series of records, which are written to Cosmos DB using
the Cassandra API. The Cassandra API is used because it supports time series data modeling.
Azure Log Analytics . Application log data collected by Azure Monitor is stored in a Log Analytics workspace. Log
Analytics queries can be used to analyze and visualize metrics and inspect log messages to identify issues within
the application.

Data ingestion
To simulate a data source, this reference architecture uses the New York City Taxi Data dataset[1]. This dataset
contains data about taxi trips in New York City over a four-year period (2010 – 2013). It contains two types of
record: Ride data and fare data. Ride data includes trip duration, trip distance, and pickup and drop-off location.
Fare data includes fare, tax, and tip amounts. Common fields in both record types include medallion number, hack
license, and vendor ID. Together these three fields uniquely identify a taxi plus a driver. The data is stored in CSV
format.
[1] Donovan, Brian; Work, Dan (2016): New York City Taxi Trip Data (2010-2013). University of Illinois at Urbana-
Champaign. https://doi.org/10.13012/J8PN93H8
The data generator is a .NET Core application that reads the records and sends them to Azure Event Hubs. The
generator sends ride data in JSON format and fare data in CSV format.
Event Hubs uses partitions to segment the data. Partitions allow a consumer to read each partition in parallel.
When you send data to Event Hubs, you can specify the partition key explicitly. Otherwise, records are assigned to
partitions in round-robin fashion.
In this scenario, ride data and fare data should end up with the same partition ID for a given taxi cab. This enables
Databricks to apply a degree of parallelism when it correlates the two streams. A record in partition n of the ride
data will match a record in partition n of the fare data.

In the data generator, the common data model for both record types has a PartitionKey property that is the
concatenation of Medallion , HackLicense , and VendorId .
public abstract class TaxiData
{
public TaxiData()
{
}

[JsonProperty]
public long Medallion { get; set; }

[JsonProperty]
public long HackLicense { get; set; }

[JsonProperty]
public string VendorId { get; set; }

[JsonProperty]
public DateTimeOffset PickupTime { get; set; }

[JsonIgnore]
public string PartitionKey
{
get => $"{Medallion}_{HackLicense}_{VendorId}";
}

This property is used to provide an explicit partition key when sending to Event Hubs:

using (var client = pool.GetObject())


{
return client.Value.SendAsync(new EventData(Encoding.UTF8.GetBytes(
t.GetData(dataFormat))), t.PartitionKey);
}

Event Hubs
The throughput capacity of Event Hubs is measured in throughput units. You can autoscale an event hub by
enabling auto-inflate, which automatically scales the throughput units based on traffic, up to a configured
maximum.

Stream processing
In Azure Databricks, data processing is performed by a job. The job is assigned to and runs on a cluster. The job can
either be custom code written in Java, or a Spark notebook.
In this reference architecture, the job is a Java archive with classes written in both Java and Scala. When specifying
the Java archive for a Databricks job, the class is specified for execution by the Databricks cluster. Here, the main
method of the com.microsoft.pnp.TaxiCabReader class contains the data processing logic.
Reading the stream from the two event hub instances
The data processing logic uses Spark structured streaming to read from the two Azure event hub instances:
val rideEventHubOptions = EventHubsConf(rideEventHubConnectionString)
.setConsumerGroup(conf.taxiRideConsumerGroup())
.setStartingPosition(EventPosition.fromStartOfStream)
val rideEvents = spark.readStream
.format("eventhubs")
.options(rideEventHubOptions.toMap)
.load

val fareEventHubOptions = EventHubsConf(fareEventHubConnectionString)


.setConsumerGroup(conf.taxiFareConsumerGroup())
.setStartingPosition(EventPosition.fromStartOfStream)
val fareEvents = spark.readStream
.format("eventhubs")
.options(fareEventHubOptions.toMap)
.load

Enriching the data with the neighborhood information


The ride data includes the latitude and longitude coordinates of the pick up and drop off locations. While these
coordinates are useful, they are not easily consumed for analysis. Therefore, this data is enriched with
neighborhood data that is read from a shapefile.
The shapefile format is binary and not easily parsed, but the GeoTools library provides tools for geospatial data
that use the shapefile format. This library is used in the com.microsoft.pnp.GeoFinder class to determine the
neighborhood name based on the pick up and drop off coordinates.

val neighborhoodFinder = (lon: Double, lat: Double) => {


NeighborhoodFinder.getNeighborhood(lon, lat).get()
}

Joining the ride and fare data


First the ride and fare data is transformed:
val rides = transformedRides
.filter(r => {
if (r.isNullAt(r.fieldIndex("errorMessage"))) {
true
}
else {
malformedRides.add(1)
false
}
})
.select(
$"ride.*",
to_neighborhood($"ride.pickupLon", $"ride.pickupLat")
.as("pickupNeighborhood"),
to_neighborhood($"ride.dropoffLon", $"ride.dropoffLat")
.as("dropoffNeighborhood")
)
.withWatermark("pickupTime", conf.taxiRideWatermarkInterval())

val fares = transformedFares


.filter(r => {
if (r.isNullAt(r.fieldIndex("errorMessage"))) {
true
}
else {
malformedFares.add(1)
false
}
})
.select(
$"fare.*",
$"pickupTime"
)
.withWatermark("pickupTime", conf.taxiFareWatermarkInterval())

And then the ride data is joined with the fare data:

val mergedTaxiTrip = rides.join(fares, Seq("medallion", "hackLicense", "vendorId", "pickupTime"))

Processing the data and inserting into Cosmos DB


The average fare amount for each neighborhood is calculated for a given time interval:

val maxAvgFarePerNeighborhood = mergedTaxiTrip.selectExpr("medallion", "hackLicense", "vendorId",


"pickupTime", "rateCode", "storeAndForwardFlag", "dropoffTime", "passengerCount", "tripTimeInSeconds",
"tripDistanceInMiles", "pickupLon", "pickupLat", "dropoffLon", "dropoffLat", "paymentType", "fareAmount",
"surcharge", "mtaTax", "tipAmount", "tollsAmount", "totalAmount", "pickupNeighborhood", "dropoffNeighborhood")
.groupBy(window($"pickupTime", conf.windowInterval()), $"pickupNeighborhood")
.agg(
count("*").as("rideCount"),
sum($"fareAmount").as("totalFareAmount"),
sum($"tipAmount").as("totalTipAmount")
)
.select($"window.start", $"window.end", $"pickupNeighborhood", $"rideCount", $"totalFareAmount",
$"totalTipAmount")

Which is then inserted into Cosmos DB:


maxAvgFarePerNeighborhood
.writeStream
.queryName("maxAvgFarePerNeighborhood_cassandra_insert")
.outputMode(OutputMode.Append())
.foreach(new CassandraSinkForeach(connector))
.start()
.awaitTermination()

Security considerations
Access to the Azure Databricks workspace is controlled using the administrator console. The administrator console
includes functionality to add users, manage user permissions, and set up single sign-on. Access control for
workspaces, clusters, jobs, and tables can also be set through the administrator console.
Managing secrets
Azure Databricks includes a secret store that is used to store secrets, including connection strings, access keys, user
names, and passwords. Secrets within the Azure Databricks secret store are partitioned by scopes :

databricks secrets create-scope --scope "azure-databricks-job"

Secrets are added at the scope level:

databricks secrets put --scope "azure-databricks-job" --key "taxi-ride"

NOTE
An Azure Key Vault-backed scope can be used instead of the native Azure Databricks scope. To learn more, see Azure Key
Vault-backed scopes.

In code, secrets are accessed via the Azure Databricks secrets utilities.

Monitoring considerations
Azure Databricks is based on Apache Spark, and both use log4j as the standard library for logging. In addition to
the default logging provided by Apache Spark, this reference architecture sends logs and metrics to Azure Log
Analytics.
The com.microsoft.pnp.TaxiCabReader class configures the Apache Spark logging system to send its logs to
Azure Log Analytics using the values in the log4j.proper ties file. While the Apache Spark logger messages are
strings, Azure Log Analytics requires log messages to be formatted as JSON. The
com.microsoft.pnp.log4j.LogAnalyticsAppender class transforms these messages to JSON:
@Override
protected void append(LoggingEvent loggingEvent) {
if (this.layout == null) {
this.setLayout(new JSONLayout());
}

String json = this.getLayout().format(loggingEvent);


try {
this.client.send(json, this.logType);
} catch(IOException ioe) {
LogLog.warn("Error sending LoggingEvent to Log Analytics", ioe);
}
}

As the com.microsoft.pnp.TaxiCabReader class processes ride and fare messages, it's possible that either one
may be malformed and therefore not valid. In a production environment, it's important to analyze these
malformed messages to identify a problem with the data sources so it can be fixed quickly to prevent data loss. The
com.microsoft.pnp.TaxiCabReader class registers an Apache Spark Accumulator that keeps track of the number
of malformed fare and ride records:

@transient val appMetrics = new AppMetrics(spark.sparkContext)


appMetrics.registerGauge("metrics.malformedrides", AppAccumulators.getRideInstance(spark.sparkContext))
appMetrics.registerGauge("metrics.malformedfares", AppAccumulators.getFareInstance(spark.sparkContext))
SparkEnv.get.metricsSystem.registerSource(appMetrics)

Apache Spark uses the Dropwizard library to send metrics, and some of the native Dropwizard metrics fields are
incompatible with Azure Log Analytics. Therefore, this reference architecture includes a custom Dropwizard sink
and reporter. It formats the metrics in the format expected by Azure Log Analytics. When Apache Spark reports
metrics, the custom metrics for the malformed ride and fare data are also sent.
The last metric to be logged to the Azure Log Analytics workspace is the cumulative progress of the Spark
Structured Streaming job progress. This is done using a custom StreamingQuery listener implemented in the
com.microsoft.pnp.StreamingMetricsListener class. This class is registered to the Apache Spark Session when
the job runs:

spark.streams.addListener(new StreamingMetricsListener())

The methods in the StreamingMetricsListener are called by the Apache Spark runtime whenever a structured
steaming event occurs, sending log messages and metrics to the Azure Log Analytics workspace. You can use the
following queries in your workspace to monitor the application:
Latency and throughput for streaming queries

taxijob_CL
| where TimeGenerated > startofday(datetime(<date>)) and TimeGenerated < endofday(datetime(<date>))
| project mdc_inputRowsPerSecond_d, mdc_durationms_triggerExecution_d
| render timechart

Exceptions logged during stream query execution

taxijob_CL
| where TimeGenerated > startofday(datetime(<date>)) and TimeGenerated < endofday(datetime(<date>))
| where Level contains "Error"
Accumulation of malformed fare and ride data

SparkMetric_CL
| where TimeGenerated > startofday(datetime(<date>)) and TimeGenerated < endofday(datetime(<date>))
| render timechart
| where name_s contains "metrics.malformedrides"

SparkMetric_CL
| where TimeGenerated > startofday(datetime(<date>)) and TimeGenerated < endofday(datetime(<date>))
| render timechart
| where name_s contains "metrics.malformedfares"

Job execution to trace resiliency

SparkMetric_CL
| where TimeGenerated > startofday(datetime(<date>)) and TimeGenerated < endofday(datetime(<date>))
| render timechart
| where name_s contains "driver.DAGScheduler.job.allJobs"

For more information, see Monitoring Azure Databricks.

DevOps considerations
Create separate resource groups for production, development, and test environments. Separate resource
groups make it easier to manage deployments, delete test deployments, and assign access rights.
Use Azure Resource Manager template to deploy the Azure resources following the infrastructure as Code
(IaC) Process. With templates, automating deployments using Azure DevOps Services, or other CI/CD
solutions is easier.
Put each workload in a separate deployment template and store the resources in source control systems.
You can deploy the templates together or individually as part of a CI/CD process, making the automation
process easier.
In this architecture, Azure Event Hubs, Log Analytics, and Cosmos DB are identified as a single workload.
These resources are included in a single ARM template.
Consider staging your workloads. Deploy to various stages and run validation checks at each stage before
moving to the next stage. That way you can push updates to your production environments in a highly
controlled way and minimize unanticipated deployment issues.
In this architecture there are multiple deployment stages. Consider creating an Azure DevOps Pipeline and
adding those stages. Here are some examples of stages that you can automate:
Start a Databricks Cluster
Configure Databricks CLI
Install Scala Tools
Add the Databricks secrets
Also, consider writing automated integration tests to improve the quality and the reliability of the Databricks
code and its life cycle.
Consider using Azure Monitor to analyze the performance of your stream processing pipeline. For more
information, see Monitoring Azure Databricks.
For more information, see the DevOps section in Microsoft Azure Well-Architected Framework.

Cost considerations
Use the Azure pricing calculator to estimate costs. Here are some considerations for services used in this reference
architecture.
Event Hubs
This reference architecture deploys Event Hubs in the Standard tier. The pricing model is based on throughput
units, ingress events, and capture events. An ingress event is a unit of data 64 KB or less. Larger messages are
billed in multiples of 64 KB. You specify throughput units either through the Azure portal or Event Hub
management APIs.
If you need more retention days, consider the Dedicated tier. This tier offers single-tenant deployments with most
demanding requirements. This offering builds a cluster based on capacity units (CU) that is not bound by
throughput units.
The Standard tier is also billed based on ingress events and throughput units.
For information about Event Hubs pricing, see the Event Hubs pricing.
Azure Databricks
Azure Databricks offers two tiers Standard and Premium each supports three workloads. This reference
architecture deploys Azure Databricks workspace in the Premium tier.
Data Engineering and Data Engineering Light workloads are for data engineers to build and execute jobs. The
Data Analytics workload is intended for data scientists to explore, visualize, manipulate, and share data and
insights interactively.
Azure Databricks offers many pricing models.
Pay-as-you-go plan
You are billed for virtual machines (VMs) provisioned in clusters and Databricks Units (DBUs) based on the
VM instance selected. A DBU is a unit of processing capability, billed on a per-second usage. The DBU
consumption depends on the size and type of instance running Azure Databricks. Pricing will depend on the
selected workload and tier.
Pre-purchase plan
You commit to Azure Databricks Units (DBU) as Databricks Commit Units (DBCU) for either one or three
years. When compared to the pay-as-you-go model, you can save up to 37%.
For more information, see Azure Databricks Pricing.
Azure Cosmos DB
In this architecture, a series of records are written to Cosmos DB by the Azure Databricks job. You are charged for
the capacity that you reserve, expressed in Request Units per second (RU/s), used to perform insert operations. The
unit for billing is 100 RU/sec per hour. For example, the cost of writing 100-KB items is 50 RU/s.
For write operations, provision enough capacity to support the number of writes needed per second. You can
increase the provisioned throughput by using the portal or Azure CLI before performing write operations and then
reduce the throughput after those operations are complete. Your throughput for the write period is the minimum
throughput needed for the given data plus the throughput required for the insert operation assuming no other
workload is running.
Example cost analysis
Suppose you configure a throughput value of 1,000 RU/sec on a container. It's deployed for 24 hours for 30 days, a
total of 720 hours.
The container is billed at 10 units of 100 RU/sec per hour for each hour. 10 units at $0.008 (per 100 RU/sec per
hour) are charged $0.08 per hour.
For 720 hours or 7,200 units (of 100 RUs), you are billed $57.60 for the month.
Storage is also billed, for each GB used for your stored data and index. For more information, see Cosmos DB
pricing model.
Use the Cosmos DB capacity calculator to get a quick estimate of the workload cost.
For more information, see the cost section in Microsoft Azure Well-Architected Framework.

Deploy the solution


To the deploy and run the reference implementation, follow the steps in the GitHub readme.
Stream processing pipeline with Azure Stream
Analytics
11/2/2020 • 10 minutes to read • Edit Online

This reference architecture shows an end-to-end stream processing pipeline. The pipeline ingests data from two
sources, correlates records in the two streams, and calculates a rolling average across a time window. The results
are stored for further analysis.

A reference implementation for this architecture is available on GitHub.

Scenario : A taxi company collects data about each taxi trip. For this scenario, we assume there are two separate
devices sending data. The taxi has a meter that sends information about each ride — the duration, distance, and
pickup and dropoff locations. A separate device accepts payments from customers and sends data about fares. The
taxi company wants to calculate the average tip per mile driven, in real time, in order to spot trends.

Architecture
The architecture consists of the following components.
Data sources . In this architecture, there are two data sources that generate data streams in real time. The first
stream contains ride information, and the second contains fare information. The reference architecture includes a
simulated data generator that reads from a set of static files and pushes the data to Event Hubs. In a real
application, the data sources would be devices installed in the taxi cabs.
Azure Event Hubs . Event Hubs is an event ingestion service. This architecture uses two event hub instances, one
for each data source. Each data source sends a stream of data to the associated event hub.
Azure Stream Analytics . Stream Analytics is an event-processing engine. A Stream Analytics job reads the data
streams from the two event hubs and performs stream processing.
Cosmos DB . The output from the Stream Analytics job is a series of records, which are written as JSON
documents to a Cosmos DB document database.
Microsoft Power BI . Power BI is a suite of business analytics tools to analyze data for business insights. In this
architecture, it loads the data from Cosmos DB. This allows users to analyze the complete set of historical data
that's been collected. You could also stream the results directly from Stream Analytics to Power BI for a real-time
view of the data. For more information, see Real-time streaming in Power BI.
Azure Monitor . Azure Monitor collects performance metrics about the Azure services deployed in the solution. By
visualizing these in a dashboard, you can get insights into the health of the solution.

Data ingestion
To simulate a data source, this reference architecture uses the New York City Taxi Data dataset[1]. This dataset
contains data about taxi trips in New York City over a four-year period (2010–2013). It contains two types of record:
ride data and fare data. Ride data includes trip duration, trip distance, and pickup and dropoff location. Fare data
includes fare, tax, and tip amounts. Common fields in both record types include medallion number, hack license,
and vendor ID. Together these three fields uniquely identify a taxi plus a driver. The data is stored in CSV format.
[1] Donovan, Brian; Work, Dan (2016): New York City Taxi Trip Data (2010-2013). University of Illinois at Urbana-
Champaign. https://doi.org/10.13012/J8PN93H8
The data generator is a .NET Core application that reads the records and sends them to Azure Event Hubs. The
generator sends ride data in JSON format and fare data in CSV format.
Event Hubs uses partitions to segment the data. Partitions allow a consumer to read each partition in parallel.
When you send data to Event Hubs, you can specify the partition key explicitly. Otherwise, records are assigned to
partitions in round-robin fashion.
In this particular scenario, ride data and fare data should end up with the same partition ID for a given taxi cab. This
enables Stream Analytics to apply a degree of parallelism when it correlates the two streams. A record in partition
n of the ride data will match a record in partition n of the fare data.

In the data generator, the common data model for both record types has a PartitionKey property which is the
concatenation of Medallion , HackLicense , and VendorId .
public abstract class TaxiData
{
public TaxiData()
{
}

[JsonProperty]
public long Medallion { get; set; }

[JsonProperty]
public long HackLicense { get; set; }

[JsonProperty]
public string VendorId { get; set; }

[JsonProperty]
public DateTimeOffset PickupTime { get; set; }

[JsonIgnore]
public string PartitionKey
{
get => $"{Medallion}_{HackLicense}_{VendorId}";
}

This property is used to provide an explicit partition key when sending to Event Hubs:

using (var client = pool.GetObject())


{
return client.Value.SendAsync(new EventData(Encoding.UTF8.GetBytes(
t.GetData(dataFormat))), t.PartitionKey);
}

Stream processing
The stream processing job is defined using a SQL query with several distinct steps. The first two steps simply select
records from the two input streams.

WITH
Step1 AS (
SELECT PartitionId,
TRY_CAST(Medallion AS nvarchar(max)) AS Medallion,
TRY_CAST(HackLicense AS nvarchar(max)) AS HackLicense,
VendorId,
TRY_CAST(PickupTime AS datetime) AS PickupTime,
TripDistanceInMiles
FROM [TaxiRide] PARTITION BY PartitionId
),
Step2 AS (
SELECT PartitionId,
medallion AS Medallion,
hack_license AS HackLicense,
vendor_id AS VendorId,
TRY_CAST(pickup_datetime AS datetime) AS PickupTime,
tip_amount AS TipAmount
FROM [TaxiFare] PARTITION BY PartitionId
),

The next step joins the two input streams to select matching records from each stream.
Step3 AS (
SELECT tr.TripDistanceInMiles,
tf.TipAmount
FROM [Step1] tr
PARTITION BY PartitionId
JOIN [Step2] tf PARTITION BY PartitionId
ON tr.PartitionId = tf.PartitionId
AND tr.PickupTime = tf.PickupTime
AND DATEDIFF(minute, tr, tf) BETWEEN 0 AND 15
)

This query joins records on a set of fields that uniquely identify matching records ( PartitionId and PickupTime ).

NOTE
We want the TaxiRide and TaxiFare streams to be joined by the unique combination of Medallion , HackLicense ,
VendorId and PickupTime . In this case the PartitionId covers the Medallion , HackLicense and VendorId fields,
but this should not be taken as generally the case.

In Stream Analytics, joins are temporal, meaning records are joined within a particular window of time. Otherwise,
the job might need to wait indefinitely for a match. The DATEDIFF function specifies how far two matching records
can be separated in time for a match.
The last step in the job computes the average tip per mile, grouped by a hopping window of 5 minutes.

SELECT System.Timestamp AS WindowTime,


SUM(tr.TipAmount) / SUM(tr.TripDistanceInMiles) AS AverageTipPerMile
INTO [TaxiDrain]
FROM [Step3] tr
GROUP BY HoppingWindow(Duration(minute, 5), Hop(minute, 1))

Stream Analytics provides several windowing functions. A hopping window moves forward in time by a fixed
period, in this case 1 minute per hop. The result is to calculate a moving average over the past 5 minutes.
In the architecture shown here, only the results of the Stream Analytics job are saved to Cosmos DB. For a big data
scenario, consider also using Event Hubs Capture to save the raw event data into Azure Blob storage. Keeping the
raw data will allow you to run batch queries over your historical data at later time, in order to derive new insights
from the data.

Scalability considerations
Event Hubs
The throughput capacity of Event Hubs is measured in throughput units. You can autoscale an event hub by
enabling auto-inflate, which automatically scales the throughput units based on traffic, up to a configured
maximum.
Stream Analytics
For Stream Analytics, the computing resources allocated to a job are measured in Streaming Units. Stream
Analytics jobs scale best if the job can be parallelized. That way, Stream Analytics can distribute the job across
multiple compute nodes.
For Event Hubs input, use the PARTITION BY keyword to partition the Stream Analytics job. The data will be divided
into subsets based on the Event Hubs partitions.
Windowing functions and temporal joins require additional SU. When possible, use PARTITION BY so that each
partition is processed separately. For more information, see Understand and adjust Streaming Units.
If it's not possible to parallelize the entire Stream Analytics job, try to break the job into multiple steps, starting with
one or more parallel steps. That way, the first steps can run in parallel. For example, in this reference architecture:
Steps 1 and 2 are simple SELECT statements that select records within a single partition.
Step 3 performs a partitioned join across two input streams. This step takes advantage of the fact that matching
records share the same partition key, and so are guaranteed to have the same partition ID in each input stream.
Step 4 aggregates across all of the partitions. This step cannot be parallelized.
Use the Stream Analytics job diagram to see how many partitions are assigned to each step in the job. The
following diagram shows the job diagram for this reference architecture:

Cosmos DB
Throughput capacity for Cosmos DB is measured in Request Units (RU). In order to scale a Cosmos DB container
past 10,000 RU, you must specify a partition key when you create the container, and include the partition key in
every document.
In this reference architecture, new documents are created only once per minute (the hopping window interval), so
the throughput requirements are quite low. For that reason, there's no need to assign a partition key in this
scenario.

Monitoring considerations
With any stream processing solution, it's important to monitor the performance and health of the system. Azure
Monitor collects metrics and diagnostics logs for the Azure services used in the architecture. Azure Monitor is built
into the Azure platform and does not require any additional code in your application.
Any of the following warning signals indicate that you should scale out the relevant Azure resource:
Event Hubs throttles requests or is close to the daily message quota.
The Stream Analytics job consistently uses more than 80% of allocated Streaming Units (SU).
Cosmos DB begins to throttle requests.
The reference architecture includes a custom dashboard, which is deployed to the Azure portal. After you deploy
the architecture, you can view the dashboard by opening the Azure portal and selecting TaxiRidesDashboard from
list of dashboards. For more information about creating and deploying custom dashboards in the Azure portal, see
Programmatically create Azure Dashboards.
The following image shows the dashboard after the Stream Analytics job ran for about an hour.
The panel on the lower left shows that the SU consumption for the Stream Analytics job climbs during the first 15
minutes and then levels off. This is a typical pattern as the job reaches a steady state.
Notice that Event Hubs is throttling requests, shown in the upper right panel. An occasional throttled request is not
a problem, because the Event Hubs client SDK automatically retries when it receives a throttling error. However, if
you see consistent throttling errors, it means the event hub needs more throughput units. The following graph
shows a test run using the Event Hubs auto-inflate feature, which automatically scales out the throughput units as
needed.

Auto-inflate was enabled at about the 06:35 mark. You can see the p drop in throttled requests, as Event Hubs
automatically scaled up to 3 throughput units.
Interestingly, this had the side effect of increasing the SU utilization in the Stream Analytics job. By throttling, Event
Hubs was artificially reducing the ingestion rate for the Stream Analytics job. It's actually common that resolving
one performance bottleneck reveals another. In this case, allocating additional SU for the Stream Analytics job
resolved the issue.

Cost considerations
Use the Azure pricing calculator to estimate costs. Here are some considerations for services used in this reference
architecture.
Azure Stream Analytics
Azure Stream Analytics is priced by the number of streaming units ($0.11/hour) required to process the data into
the service.
Stream Analytics can be expensive if you are not processing the data in real-time or small amounts of data. For
those use cases, consider using Azure Functions or Logic Apps to move data from Azure Event Hubs to a data store.
Azure Event Hubs and Azure Cosmos DB
For cost considerations about Azure Event Hubs and Cosmos DB, see Cost considerations see the Stream
processing with Azure Databricks reference architecture.

Deploy the solution


To the deploy and run the reference implementation, follow the steps in the GitHub readme.

DevOps considerations
Create separate resource groups for production, development, and test environments. Separate resource
groups make it easier to manage deployments, delete test deployments, and assign access rights.
Use Azure Resource Manager template to deploy the Azure resources following the infrastructure as Code
(IaC) Process. With templates, automating deployments using Azure DevOps Services, or other CI/CD
solutions is easier.
Put each workload in a separate deployment template and store the resources in source control systems.
You can deploy the templates together or individually as part of a CI/CD process, making the automation
process easier.
In this architecture, Azure Event Hubs, Log Analytics, and Cosmos DB are identified as a single workload.
These resources are included in a single ARM template.
Consider staging your workloads. Deploy to various stages and run validation checks at each stage before
moving to the next stage. That way you can push updates to your production environments in a highly
controlled way and minimize unanticipated deployment issues.
Consider using Azure Monitor to analyze the performance of your stream processing pipeline. For more
information, see Monitoring Azure Databricks.
For more information, see the DevOps section in Microsoft Azure Well-Architected Framework.

Related resources
You may wish to review the following Azure example scenarios that demonstrate specific solutions using some of
the same technologies:
IoT and data analytics in the construction industry
Real-time fraud detection
Tier Applications & Data for Analytics
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Easily tier data and applications on-premises and in Azure with architecture that supports greater efficiency in
applications. Filter unnecessary data early in the process, easily bring cloud applications close to the data on-
premises, and analyze large scale aggregate data from multiple locations in Azure for fleet-level insights.

Architecture

Download an SVG of this architecture.

Data Flow
1. Data flows into a storage account.
2. Function on Azure Stack analyzes the data for anomalies or compliance.
3. Locally-relevant insights are displayed on the Azure Stack app.
4. Insights and anomalies are placed into a queue.
5. The bulk of the data is placed into an archive storage account.
6. Function sends data from queue to Azure Storage.
7. Globally-relevant and compliant insights are available in the global app.

Components
Storage: Durable, highly available, and massively scalable cloud storage
Azure Functions: Process events with serverless code
Azure Stack: Build and run innovative hybrid applications across cloud boundaries

Next steps
Storage documentation
Azure Functions documentation
Azure Stack documentation
Blockchain Workflow Application
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Businesses use blockchain to digitize workflows they share with other organizations, such as moving physical
assets across supply chains. The anatomy of blockchain apps is similar across use cases. Here, we use Azure
Blockchain Service as the foundational managed blockchain network and build a consortium application that can
ingest signals from relevant user interfaces and communicate ledger data to consuming apps across the
consortium.

Architecture

Download an SVG of this architecture.

Data Flow
1. Relevant apps, devices, and data sources send events or data to a message broker (Azure Service Bus).
2. The distributed ledger technology (DLT) consumer Logic App fetches the data from the Service Bus and sends to
transaction builder which builds and signs the transaction.
3. The signed transaction gets routed to Azure Blockchain Service (fully managed Ethereum consortium network)
via a ledger-specific Logic App connector.
4. The blockchain data manager captures block and transaction data from configured transaction nodes, decodes
events and properties and then sends the data to configured destinations.
5. Message broker sends ledger data to consuming business applications and off-chain database.
6. Information is analyzed and visualized using tools such as Power BI by connecting to off-chain database.
Components
Azure Blockchain Service: Build, govern, and expand consortium blockchain networks
Service Bus: Connect across private and public cloud environments
Azure IoT Central: Accelerate the creation of IoT solutions
Application Insights: Detect, triage, and diagnose issues in your web apps and services
Event Grid: Get reliable event delivery at massive scale
Logic Apps: Automate the access and use of data across clouds without writing code
Azure SQL Database: Managed, intelligent SQL in the cloud
Azure Active Directory: Synchronize on-premises directories and enable single sign-on
Key Vault: Safeguard and maintain control of keys and other secrets
App Service: Quickly create powerful cloud apps for web and mobile
Virtual Network: Provision private networks, optionally connect to on-premises datacenters
Power BI Embedded: Embed fully interactive, stunning data visualizations in your applications

Next steps
Azure Blockchain documentation
Service Bus documentation
Azure IoT Central documentation
Application Insights documentation
Event Grid Documentation
Logic Apps Documentation
Azure SQL Database Documentation
Azure Active Directory Documentation
Key Vault documentation
App Service Documentation
Virtual Network Documentation
Power BI Documentation
Decentralized trust between banks on Azure
11/2/2020 • 5 minutes to read • Edit Online

This example scenario is useful for banks or any other institutions that want to establish a trusted environment for
information sharing without resorting to a centralized database. For the purpose of this example, we will describe
the scenario in the context of maintaining credit score information between banks, but the architecture can be
applied to any scenario where a consortium of organizations want to share validated information with one another
without resorting to the use of a central system ran by one single party.
Traditionally, banks within a financial system rely on centralized sources such as credit bureaus for information on
an individual's credit score and history. A centralized approach presents a concentration of operational risk and
sometimes an unnecessary third party.
With DLTs (distributed ledger technology), a consortium of banks can establish a decentralized system that can be
more efficient, less susceptible to attack, and serve as a new platform where innovative structures can be
implemented to solve traditional challenges with privacy, speed, and cost.
This example will show you how Azure services such as virtual machine scale sets, Virtual Network, Key Vault,
Storage, Load Balancer, and Monitor can be quickly provisioned for the deployment of an efficient private Ethereum
PoA blockchain where member banks can establish their own nodes.

Relevant use cases


Other relevant use cases include:
Movement of allocated budgets between different business units of a multinational corporation
Cross-border payments
Trade finance scenarios
Loyalty systems involving different companies
Supply chain ecosystems

Architecture
This scenario covers the back-end components that are necessary to create a scalable, secure, and monitored
private, enterprise blockchain network within a consortium of two or more members. Details of how these
components are provisioned (that is, within different subscriptions and resource groups) as well as the connectivity
requirements (that is, VPN or ExpressRoute) are left for your consideration based on your organization's policy
requirements. Here's how data flows:
1. Bank A creates/updates an individual's credit record by sending a transaction to the blockchain network via
JSON-RPC.
2. Data flows from Bank A's private application server to the Azure load balancer and subsequently to a validating
node VM on the virtual machine scale set.
3. The Ethereum PoA network creates a block at a preset time (2 seconds for this scenario).
4. The transaction is bundled into the created block and validated across the blockchain network.
5. Bank B can read the credit record created by bank A by communicating with its own node similarly via JSON-
RPC.
Components
Virtual machines within virtual machine scale sets provides the on-demand compute facility to host the
validator processes for the blockchain
Key Vault is used as the secure storage facility for the private keys of each validator
Load Balancer spreads the RPC, peering, and Governance DApp requests
Storage hosting persistent network information and coordinating leasing
Operations Management Suite (a bundling of a few Azure services) provides insight into available nodes,
transactions per minute and consortium members
Alternatives
The Ethereum PoA approach is chosen for this example because it is a good entry point for a consortium of
organizations that want to create an environment where information can be exchanged and shared with one
another easily in a trusted, decentralized, and easy to understand way. The available Azure solution templates also
provide a fast and convenient way not just for a consortium leader to start an Ethereum PoA blockchain, but also
for member organizations in the consortium to spin up their own Azure resources within their own resource group
and subscription to join an existing network.
For other extended or different scenarios, concerns such as transaction privacy may arise. For example, in a
securities transfer scenario, members in a consortium may not want their transactions to be visible even to other
members. Other alternatives to Ethereum PoA exist that addresses these concerns in their own way:
Corda
Quorum
Hyperledger

Considerations
Availability
Azure Monitor is used to continuously monitor the blockchain network for issues to ensure availability. A link to a
custom monitoring dashboard based on Azure Monitor will be sent to you on successful deployment of the
blockchain solution template used in this scenario. The dashboard shows nodes that are reporting heartbeats in the
past 30 minutes as well as other useful statistics.
Scalability
A popular concern for blockchain is the number of transactions that a blockchain can include within a preset
amount of time. This scenario uses Proof-of-Authority where such scalability can be better managed than Proof-of-
Work. In Proof-of-Authority–based networks, consensus participants are known and managed, making it more
suitable for private blockchain for a consortium of organization that knows one another. Parameters such as
average block time, transactions per minute and compute resource consumption can be easily monitored via the
custom dashboard. Resources can then be adjusted accordingly based on scale requirements.
For general guidance on designing scalable solutions, see the performance efficiency checklist in the Azure
Architecture Center.
Security
Azure Key Vault is used to easily store and manage the private keys of validators. The default deployment in this
example creates a blockchain network that is accessible via the internet. For production scenario where a private
network is desired, members can be connected to each other via VNet-to-VNet VPN gateway connections. The
steps for configuring a VPN are included in the related resources section below.
For general guidance on designing secure solutions, see the Azure Security Documentation.
Resiliency
The Ethereum PoA blockchain can itself provide some degree of resilience as the validator nodes can be deployed
in different regions. Azure has options for deployments in over 54 regions worldwide. A blockchain such as the one
in this scenario provides unique and refreshing possibilities of cooperation to increase resilience. The resilience of
the network is not just provided for by a single centralized party but all members of the consortium. A proof-of-
authority–based blockchain allows network resilience to be even more planned and deliberate.
For general guidance on designing resilient solutions, see Designing reliable Azure applications.

Pricing
To explore the cost of running this scenario, all of the services are pre-configured in the cost calculator. To see how
the pricing would change for your particular use case, change the appropriate variables to match your expected
performance and availability requirements.
We have provided three sample cost profiles based on the number of scale set VM instances that run your
applications (the instances can reside in different regions).
Small: this pricing example correlates to 2 VMs per month with monitoring turned off
Medium: this pricing example correlates to 7 VMs per month with monitoring turned on
Large: this pricing example correlates to 15 VMs per month with monitoring turned on
The above pricing is for one consortium member to start or join a blockchain network. Typically in a consortium
where there are multiple companies or organizations involved, each member will get their own Azure subscription.

Next steps
To see an example of this scenario, deploy the Ethereum PoA blockchain demo application on Azure. Then review
the README of the scenario source code.

Related resources
For more information on using the Ethereum Proof-of-Authority solution template for Azure, review this usage
guide.
Supply Chain Track and Trace
11/2/2020 • 3 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

A common blockchain pattern is IoT-enabled monitoring of an asset as it moves along a multi-party supply chain. A
great example of this pattern is the refrigerated transportation of perishable goods like food or pharmaceuticals
where certain compliance rules must be met throughout the duration of the transportation process. In this
scenario, an initiating counterparty (such as a retailer) specifies contractual conditions, such as a required humidity
and temperature range, that the custodians on the supply chain must adhere to. At any point, if the device takes a
temperature or humidity measurement that is out of range, the smart contract state will be updated to indicate that
it's out of compliance, recording a transaction on the blockchain and triggering remediating events downstream.

Architecture

Download an SVG of this architecture.


IoT devices communicate with IoT Hub. IoT Hub as a route configured that will send specific messages to a Service
Bus associated with that route. The message is still in the native format for the device and needs to be translated to
the format used by Azure Blockchain Workbench.
An Azure Logic App performs that transformation. It is triggered when a new message is added to the Service Bus
associated with the IoT hub, it then transforms the message and delivers it to the Service Bus used to deliver
messages to Azure Blockchain workbench.
The first service bus effectively serves as an "Outbox" for IoT Hub and the second one serves as an "Inbox" for
Azure Blockchain Workbench.

Data Flow
1. IoT devices communicate with IoT Hub. IoT Hub as a route configured that will send specific messages to a
Service Bus associated with that route. The message is still in the native format for the device and needs to be
translated to the format used by Azure Blockchain Workbench. An Azure Logic App performs that
transformation. It is triggered when a new message is added to the Service Bus associated with the IoT hub, it
then transforms the message and delivers it to the Service Bus used to deliver messages to Azure Blockchain
workbench. The first service bus effectively serves as an "Outbox" for IoT Hub and the second one serves as an
"Inbox" for Azure Blockchain Workbench.
2. DLT Consumer fetches the data from the message broker (Service Bus) and sends data to Transaction Builder -
Signer.
3. Transaction Builder builds and signs the transaction.
4. The signed transaction gets routed to the Blockchain (Private Ethereum Consortium Network).
5. DLT Watcher gets confirmation of the transaction commitment to the Blockchain and sends the confirmation to
the message broker (Service Bus).
6. DB consumers send confirmed blockchain transactions to off-chain databases (Azure SQL Database).
7. Information analyzed and visualized using tools such as Power BI by connecting to off-chain database (Azure
SQL Database).
8. Events from the ledger are delivered to Event Grid and Service Bus for use by downstream consumers. Examples
of "downstream consumers" include logic apps, functions or other code that is designed to take action on the
events. For example, an Azure Function could receive an event and then place that in a datastore such as SQL
Server.

Components
Application Insights: Detect issues, diagnose crashes, and track usage in your web app with Application Insights.
Make informed decisions throughout the development lifecycle.
Web Apps: Quickly create and deploy mission critical web apps at scale
Storage Accounts: Durable, highly available, and massively scalable cloud storage
Virtual Machines: Provision virtual machines for Ubuntu, Red Hat, and more
Azure Active Directory: Synchronize on-premises directories and enable single sign-on
Azure SQL Database is a relational database service that lets you rapidly create, extend, and scale relational
applications into the cloud.
Azure Monitor: Highly granular and real-time monitoring data for any Azure resource.
Service Bus: Connect across private and public cloud environments
Event Grid: Get reliable event delivery at massive scale

Next steps
Find run-time exceptions with Application Insights
Create a blockchain app with Azure Blockchain Workbench
Azure Storage on Blockchain Workbench
Azure and Linux Virtual Machines
Blockchain Workbench API app registration
Blockchain Workbench Database
Log Analytics Tutorial
Service Bus on Blockchain Workbench
Event Notifications on Blockchain Workbench

Deploy to Azure
Deploy to Azure
High Performance Computing (HPC) on Azure
11/13/2020 • 7 minutes to read • Edit Online

Introduction to HPC
High Performance Computing (HPC), also called "Big Compute", uses a large number of CPU or GPU-based
computers to solve complex mathematical tasks.
Many industries use HPC to solve some of their most difficult problems. These include workloads such as:
Genomics
Oil and gas simulations
Finance
Semiconductor design
Engineering
Weather modeling
How is HPC different on the cloud?
One of the primary differences between an on-premises HPC system and one in the cloud is the ability for
resources to dynamically be added and removed as they're needed. Dynamic scaling removes compute capacity as
a bottleneck and instead allow customers to right size their infrastructure for the requirements of their jobs.
The following articles provide more detail about this dynamic scaling capability.
Big Compute Architecture Style
Autoscaling best practices

Implementation checklist
As you're looking to implement your own HPC solution on Azure, ensure you're reviewed the following topics:
Choose the appropriate architecture based on your requirements
Know which compute options is right for your workload
Identify the right storage solution that meets your needs
Decide how you're going to manage all your resources
Optimize your application for the cloud
Secure your Infrastructure

Infrastructure
There are a number of infrastructure components necessary to build an HPC system. Compute, Storage, and
Networking provide the underlying components, no matter how you choose to manage your HPC workloads.
Example HPC architectures
There are a number of different ways to design and implement your HPC architecture on Azure. HPC applications
can scale to thousands of compute cores, extend on-premises clusters, or run as a 100% cloud-native solution.
The following scenarios outline a few of the common ways HPC solutions are built.
Computer-aided engineering services on Azure
Provide a software-as-a-service (SaaS) platform for computer-aided engineering (CAE) on Azure.

Computational fluid dynamics (CFD) simulations on Azure


Execute computational fluid dynamics (CFD) simulations on Azure.
3D video rendering on Azure
Run native HPC workloads in Azure using the Azure Batch service
Compute
Azure offers a range of sizes that are optimized for both CPU & GPU intensive workloads.
CPU-based virtual machines
Linux VMs
Windows VMs VMs
GPU-enabled virtual machines
N-series VMs feature NVIDIA GPUs designed for compute-intensive or graphics-intensive applications including
artificial intelligence (AI) learning and visualization.
Linux VMs
Windows VMs
Storage
Large-scale Batch and HPC workloads have demands for data storage and access that exceed the capabilities of
traditional cloud file systems. There are a number of solutions to manage both the speed and capacity needs of
HPC applications on Azure
Avere vFXT for faster, more accessible data storage for high-performance computing at the edge
Azure NetApp Files
BeeGFS
Storage Optimized Virtual Machines
Blob, table, and queue storage
Azure SMB File storage
For more information comparing Lustre, GlusterFS, and BeeGFS on Azure, review the Parallel Files Systems on
Azure e-book and the Lustre on Azure blog.
Networking
H16r, H16mr, A8, and A9 VMs can connect to a high throughput back-end RDMA network. This network can
improve the performance of tightly coupled parallel applications running under Microsoft MPI or Intel MPI.
RDMA Capable Instances
Virtual Network
ExpressRoute

Management
Do -it-yourself
Building an HPC system from scratch on Azure offers a significant amount of flexibility, but is often very
maintenance intensive.
1. Set up your own cluster environment in Azure virtual machines or virtual machine scale sets.
2. Use Azure Resource Manager templates to deploy leading workload managers, infrastructure, and applications.
3. Choose HPC and GPU VM sizes that include specialized hardware and network connections for MPI or GPU
workloads.
4. Add high performance storage for I/O-intensive workloads.
Hybrid and cloud Bursting
If you have an existing on-premises HPC system that you'd like to connect to Azure, there are a number of
resources to help get you started.
First, review the Options for connecting an on-premises network to Azure article in the documentation. From there,
you may want information on these connectivity options:

Connect an on-premises network to Azure using a VPN gateway


This reference architecture shows how to extend an on-premises network to Azure, using a site-to-site
virtual private network (VPN).

Connect an on-premises network to Azure using ExpressRoute


ExpressRoute connections use a private, dedicated connection through a third-party connectivity provider.
The private connection extends your on-premises network into Azure.

Connect an on-premises network to Azure using ExpressRoute with VPN failover


Implement a highly available and secure site-to-site network architecture that spans an Azure virtual
network and an on-premises network connected using ExpressRoute with VPN gateway failover.
Once network connectivity is securely established, you can start using cloud compute resources on-demand with
the bursting capabilities of your existing workload manager.
Marketplace solutions
There are a number of workload managers offered in the Azure Marketplace.
RogueWave CentOS-based HPC
SUSE Linux Enterprise Server for HPC
TIBCO Grid Server Engine
Azure Data Science VM for Windows and Linux
D3View
UberCloud
Azure Batch
Azure Batch is a platform service for running large-scale parallel and high-performance computing (HPC)
applications efficiently in the cloud. Azure Batch schedules compute-intensive work to run on a managed pool of
virtual machines, and can automatically scale compute resources to meet the needs of your jobs.
SaaS providers or developers can use the Batch SDKs and tools to integrate HPC applications or container
workloads with Azure, stage data to Azure, and build job execution pipelines.
Azure CycleCloud
Azure CycleCloud Provides the simplest way to manage HPC workloads using any scheduler (like Slurm, Grid
Engine, HPC Pack, HTCondor, LSF, PBS Pro, or Symphony), on Azure
CycleCloud allows you to:
Deploy full clusters and other resources, including scheduler, compute VMs, storage, networking, and cache
Orchestrate job, data, and cloud workflows
Give admins full control over which users can run jobs, as well as where and at what cost
Customize and optimize clusters through advanced policy and governance features, including cost controls,
Active Directory integration, monitoring, and reporting
Use your current job scheduler and applications without modification
Take advantage of built-in autoscaling and battle-tested reference architectures for a wide range of HPC
workloads and industries
Workload managers
The following are examples of cluster and workload managers that can run in Azure infrastructure. Create stand-
alone clusters in Azure VMs or burst to Azure VMs from an on-premises cluster.
Alces Flight Compute
TIBCO DataSynapse GridServer
Bright Cluster Manager
IBM Spectrum Symphony and Symphony LSF
Altair PBS Works
Rescale
Univa Grid Engine
Microsoft HPC Pack
HPC Pack for Windows
HPC Pack for Linux
Containers
Containers can also be used to manage some HPC workloads. Services like the Azure Kubernetes Service (AKS)
makes it simple to deploy a managed Kubernetes cluster in Azure.
Azure Kubernetes Service (AKS)
Container Registry

Cost management
Managing your HPC cost on Azure can be done through a few different ways. Ensure you've reviewed the Azure
purchasing options to find the method that works best for your organization.
Security
For an overview of security best practices on Azure, review the Azure Security Documentation.
In addition to the network configurations available in the Cloud Bursting section, you may want to implement a
hub/spoke configuration to isolate your compute resources:

Implement a hub-spoke network topology in Azure


The hub is a virtual network (VNet) in Azure that acts as a central point of connectivity to your on-premises
network. The spokes are VNets that peer with the hub, and can be used to isolate workloads.

Implement a hub-spoke network topology with shared services in Azure


This reference architecture builds on the hub-spoke reference architecture to include shared services in the
hub that can be consumed by all spokes.

HPC applications
Run custom or commercial HPC applications in Azure. Several examples in this section are benchmarked to scale
efficiently with additional VMs or compute cores. Visit the Azure Marketplace for ready-to-deploy solutions.

NOTE
Check with the vendor of any commercial application for licensing or other restrictions for running in the cloud. Not all
vendors offer pay-as-you-go licensing. You might need a licensing server in the cloud for your solution, or connect to an on-
premises license server.

Engineering applications
Altair RADIOSS
ANSYS CFD
MATLAB Distributed Computing Server
StarCCM+
OpenFOAM
Graphics and rendering
Autodesk Maya, 3ds Max, and Arnold on Azure Batch
AI and deep learning
Microsoft Cognitive Toolkit
Batch Shipyard recipes for deep learning
MPI Providers
Microsoft MPI

Remote visualization
Linux virtual desktops with Citrix
Build a VDI environment for Linux Desktops using Citrix on Azure.

Performance Benchmarks
Compute benchmarks

Customer stories
There are a number of customers who have seen great success by using Azure for their HPC workloads. You can
find a few of these customer case studies below:
AXA Global P&C
Axioma
d3View
EFS
Hymans Robertson
MetLife
Microsoft Research
Milliman
Mitsubishi UFJ Securities International
NeuroInitiative
Schlumberger
Towers Watson

Other important information


Ensure your vCPU quota has been increased before attempting to run large-scale workloads.

Next steps
For the latest announcements, see:
Microsoft HPC and Batch team blog
Visit the Azure blog.
Microsoft Batch Examples
These tutorials will provide you with details on running applications on Microsoft Batch
Get started developing with Batch
Use Azure Batch code samples
Use low-priority VMs with Batch
Run containerized HPC workloads with Batch Shipyard
Run parallel R workloads on Batch
Run on-demand Spark jobs on Batch
Use compute-intensive VMs in Batch pools
3D video rendering on Azure
11/2/2020 • 5 minutes to read • Edit Online

3D video rendering is a time consuming process that requires a significant amount of CPU time to complete. On a
single machine, the process of generating a video file from static assets can take hours or even days depending on
the length and complexity of the video you are producing. Many companies will purchase either expensive high end
desktop computers to perform these tasks, or invest in large render farms that they can submit jobs to. However, by
taking advantage of Azure Batch, that power is available to you when you need it and shuts itself down when you
don't, all without any capital investment.
Batch gives you a consistent management experience and job scheduling, whether you select Windows Server or
Linux compute nodes. With Batch, you can use your existing Windows or Linux applications, including AutoDesk
Maya and Blender, to run large-scale render jobs in Azure.

Relevant use cases


Other relevant use cases include:
3D modeling
Visual FX (VFX) rendering
Video transcoding
Image processing, color correction, and resizing

Architecture

This scenario shows a workflow that uses Azure Batch. The data flows as follows:
1. Upload input files and the applications to process those files to your Azure Storage account.
2. Create a Batch pool of compute nodes in your Batch account, a job to run the workload on the pool, and tasks in
the job.
3. Download input files and the applications to Batch.
4. Monitor task execution.
5. Upload task output.
6. Download output files.
To simplify this process, you could also use the Batch Plugins for Maya and 3ds Max
Components
Azure Batch builds on the following Azure technologies:
Virtual Networks are used for both the head node and the compute resources.
Azure Storage accounts are used for synchronization and data retention.
Virtual machine scale sets are used by CycleCloud for compute resources.

Considerations
Machine Sizes available for Azure Batch
While most rendering customers will choose resources with high CPU power, other workloads using virtual
machine scale sets may choose VMs differently and will depend on a number of factors:
Is the application being run memory bound?
Does the application need to use GPUs?
Are the job types embarrassingly parallel or require infiniband connectivity for tightly coupled jobs?
Require fast I/O to access storage on the compute Nodes.
Azure has a wide range of VM sizes that can address each and every one of the above application requirements,
some are specific to HPC, but even the smallest sizes can be used to provide an effective grid implementation:
HPC VM sizes Due to the CPU bound nature of rendering, Microsoft typically suggests the Azure H-Series VMs.
This type of VM is built specifically for high end computational needs, they have 8 and 16 core vCPU sizes
available, and features DDR4 memory, SSD temporary storage, and Haswell E5 Intel technology.
GPU VM sizes GPU optimized VM sizes are specialized virtual machines available with single or multiple NVIDIA
GPUs. These sizes are designed for compute-intensive, graphics-intensive, and visualization workloads.
NC, NCv2, NCv3, and ND sizes are optimized for compute-intensive and network-intensive applications and
algorithms, including CUDA and OpenCL-based applications and simulations, AI, and Deep Learning. NV sizes
are optimized and designed for remote visualization, streaming, gaming, encoding, and VDI scenarios using
frameworks such as OpenGL and DirectX.
Memory optimized VM sizes When more memory is required, the memory optimized VM sizes offer a higher
memory-to-CPU ratio.
General purposes VM sizes General-purpose VM sizes are also available and provide balanced CPU-to-memory
ratio.
Alternatives
If you require more control over your rendering environment in Azure or need a hybrid implementation, then
CycleCloud computing can help orchestrate an IaaS grid in the cloud. Using the same underlying Azure
technologies as Azure Batch, it makes building and maintaining an IaaS grid an efficient process. To find out more
and learn about the design principles use the following link:
For a complete overview of all the HPC solutions that are available to you in Azure, see the article HPC, Batch, and
Big Compute solutions using Azure VMs
Availability
Monitoring of the Azure Batch components is available through a range of services, tools, and APIs. Monitoring is
discussed further in the Monitor Batch solutions article.
Scalability
Pools within an Azure Batch account can either scale through manual intervention or, by using a formula based on
Azure Batch metrics, be scaled automatically. For more information on scalability, see the article Create an
automatic scaling formula for scaling nodes in a Batch pool.
Security
For general guidance on designing secure solutions, see the Azure Security Documentation.
Resiliency
While there is currently no failover capability in Azure Batch, we recommend using the following steps to ensure
availability if there is an unplanned outage:
Create an Azure Batch account in an alternate Azure location with an alternate Storage Account
Create the same node pools with the same name, with zero nodes allocated
Ensure Applications are created and updated to the alternate Storage Account
Upload input files and submit jobs to the alternate Azure Batch account

Deploy the scenario


Create an Azure Batch account and pools manually
This scenario demonstrates how Azure Batch works while showcasing Azure Batch Labs as an example SaaS
solution that can be developed for your own customers:
Azure Batch Labs
Deploy the components
The template will deploy:
A new Azure Batch account
A storage account
A node pool associated with the batch account
The node pool will be configured to use A2 v2 VMs with Canonical Ubuntu images
The node pool will contain zero VMs initially and will require you to manually scale to add VMs
Click the link below to deploy the solution.

Learn more about Resource Manager templates

Pricing
The cost of using Azure Batch will depend on the VM sizes that are used for the pools and how long these VMs are
allocated and running, there is no cost associated with an Azure Batch account creation. Storage and data egress
should be taken into account as these will apply additional costs.
The following are examples of costs that could be incurred for a job that completes in 8 hours using a different
number of servers:
100 High-Performance CPU VMs: Cost Estimate
100 x H16m (16 cores, 225 GB RAM, Premium Storage 512 GB), 2 TB Blob Storage, 1-TB egress
50 High-Performance CPU VMs: Cost Estimate
50 x H16m (16 cores, 225 GB RAM, Premium Storage 512 GB), 2 TB Blob Storage, 1-TB egress
10 High-Performance CPU VMs: Cost Estimate
10 x H16m (16 cores, 225 GB RAM, Premium Storage 512 GB), 2 TB Blob Storage, 1-TB egress
Pricing for low-priority VMs
Azure Batch also supports the use of low-priority VMs in the node pools, which can potentially provide a
substantial cost saving. For more information, including a price comparison between standard VMs and low-
priority VMs, see Azure Batch Pricing.

NOTE
Low-priority VMs are only suitable for certain applications and workloads.

Related resources
Azure Batch Overview
Azure Batch Documentation
Using containers on Azure Batch
A computer-aided engineering service on Azure
11/2/2020 • 5 minutes to read • Edit Online

This example scenario demonstrates delivery of a software-as-a-service (SaaS) platform built on the high-
performance computing (HPC) capabilities of Azure. This scenario is based on an engineering software solution.
However, the architecture is relevant to other industries requiring HPC resources such as image rendering, complex
modeling, and financial risk calculation.
This example demonstrates an engineering software provider that delivers computer-aided engineering (CAE)
applications to engineering firms and manufacturing enterprises. CAE solutions enable innovation, reduce
development times, and lower costs throughout the lifetime of a product's design. These solutions require
substantial compute resources and often process high data volumes. The high costs of an on-premises HPC
appliance or high-end workstations often put these technologies out of reach for small engineering firms,
entrepreneurs, and students.
The company wants to expand the market for its applications by building a SaaS platform backed by cloud-based
HPC technologies. Their customers should be able to pay for compute resources as needed and access massive
computing power that would be unaffordable otherwise.
The company's goals include:
Taking advantage of HPC capabilities in Azure to accelerate the product design and testing process.
Using the latest hardware innovations to run complex simulations, while minimizing the costs for simpler
simulations.
Enabling true-to-life visualization and rendering in a web browser, without requiring a high-end engineering
workstation.

Relevant use cases


Other relevant use cases include:
Genomics research
Weather simulation
Computational chemistry applications

Architecture
Users can access NV-series virtual machines (VMs) via a browser with an HTML5-based RDP connection using
the Apache Guacamole service. These VM instances provide powerful GPUs for rendering and collaborative
tasks. Users can edit their designs and view their results without needing access to high-end mobile computing
devices or laptops. The scheduler spins up additional VMs based on user-defined heuristics.
From a desktop CAD session, users can submit workloads for execution on available HPC cluster nodes. These
workloads perform tasks such as stress analysis or computational fluid dynamics calculations, eliminating the
need for dedicated on-premises compute clusters. These cluster nodes can be configured to autoscale based on
load or queue depth based on active user demand for compute resources.
Azure Kubernetes Service (AKS) is used to host the web resources available to end users.
Components
H-series virtual machines are used to run compute-intensive simulations such as molecular modeling and
computational fluid dynamics. The solution also takes advantage of technologies like remote direct memory
access (RDMA) connectivity and InfiniBand networking.
NV-series virtual machines give engineers high-end workstation functionality from a standard web browser.
These virtual machines have NVIDIA Tesla M60 GPUs that support advanced rendering and can run single
precision workloads.
General purpose virtual machines running CentOS handle more traditional workloads such as web applications.
Application Gateway load balances the requests coming into the web servers.
Azure Kubernetes Service (AKS) is used to run scalable workloads at a lower cost for simulations that don't
require the high end capabilities of HPC or GPU virtual machines.
Altair PBS Works Suite orchestrates the HPC workflow, ensuring that enough virtual machine instances are
available to handle the current load. It also deallocates virtual machines when demand is lower to reduce costs.
Blob storage stores files that support the scheduled jobs.
Alternatives
Azure CycleCloud simplifies creating, managing, operating, and optimizing HPC clusters. It offers advanced
policy and governance features. CycleCloud supports any job scheduler or software stack.
HPC Pack can create and manage an Azure HPC cluster for Windows Server-based workloads. HPC Pack isn't an
option for Linux-based workloads.
Azure Automation State Configuration provides an infrastructure-as-code approach to defining the virtual
machines and software to be deployed. Virtual machines can be deployed as part of a virtual machine scale set,
with autoscaling rules for compute nodes based on the number of jobs submitted to the job queue. When a new
virtual machine is needed, it is provisioned using the latest patched image from the Azure image gallery, and
then the required software is installed and configured via a PowerShell DSC configuration script.
Azure Functions

Considerations
While using an infrastructure-as-code approach is a great way to manage virtual machine build definitions, it
can take a long time to provision a new virtual machine using a script. This solution found a good middle
ground by using the DSC script to periodically create a golden image, which can then be used to provision a
new virtual machine faster than completely building a VM on demand using DSC. Azure DevOps Services or
other CI/CD tooling can periodically refresh golden images using DSC scripts.
Balancing overall solution costs with fast availability of compute resources is a key consideration. Provisioning a
pool of N-series virtual machine instances and putting them in a deallocated state lowers the operating costs.
When an additional virtual machine is needed, reallocating an existing instance will involve powering up the
virtual machine on a different host, but the PCI bus detection time required by the OS to identify and install
drivers for the GPU is eliminated because a virtual machine that is deprovisioned and then reprovisioned will
retain the same PCI bus for the GPU when restarted.
The original architecture relied entirely on Azure virtual machines for running simulations. In order to reduce
costs for workloads that didn't require all the capabilities of a virtual machine, these workloads were
containerized and deployed to Azure Kubernetes Service (AKS).
The company's workforce had existing skills in open-source technologies. They can take advantage of these
skills by building on technologies like Linux and Kubernetes.

Pricing
To help you explore the cost of running this scenario, many of the required services are pre-configured in a cost
calculator example. The costs of your solution depend on the number and scale of services needed to meet your
requirements.
The following considerations will drive a substantial portion of the costs for this solution:
Azure virtual machine costs increase linearly as additional instances are provisioned. Virtual machines that are
deallocated will only incur storage costs, and not compute costs. These deallocated machines can then be
reallocated when demand is high.
Azure Kubernetes Services costs are based on the VM type chosen to support the workload. The costs will
increase linearly based on the number of VMs in the cluster.

Next steps
Read the Altair customer story. This example scenario is based on a version of their architecture.
Review other Big Compute solutions available in Azure.
Accelerate digital image-based modeling on Azure
11/2/2020 • 7 minutes to read • Edit Online

This example scenario provides architecture and design guidance for any organization that wants to perform
image-based modeling on Azure infrastructure-as-a-service (IaaS). The scenario is designed for running
photogrammetry software on Azure Virtual Machines (VMs) using high-performance storage that accelerates
processing time. The environment can be scaled up and down as needed and supports terabytes of storage without
sacrificing performance.

Relevant use cases


Relevant use cases include:
Modeling and measuring buildings, engineering structures, and forensic accident scenes.
Creating visual effects for computer games and movies.
Using digital images to indirectly generate measurements of objects of various scales as in urban planning and
other applications.

Architecture
This example describes the use of Agisoft PhotoScan photogrammetry software backed by Avere vFXT storage.
PhotoScan was chosen for its popularity in geographic information system (GIS) applications, cultural heritage
documentation, game development, and visual effects production. It is suitable for both close-range
photogrammetry and aerial photogrammetry.
The concepts in this article apply to any high-performance computing (HPC) workload based on a scheduler and
worker nodes managed as infrastructure. For this workload, Avere vFXT was selected for its superior performance
during benchmark tests. However, the scenario decouples the storage from the processing so that other storage
solutions can be used (see alternatives later in this document).
This architecture also includes Active Directory domain controllers to control access to Azure resources and provide
internal name resolution through the Domain Name System (DNS). Jump boxes provide administrator access to
the Windows and Linux VMs that run the solution.
1. User submits a number of images to PhotoScan.
2. The PhotoScan Scheduler runs on a Windows VM that serves as the head node and directs processing of the
user's images.
3. PhotoScan searches for common points on the photographs and constructs the geometry (mesh) using the
PhotoScan processing nodes running on VMs with graphics processing units (GPUs).
4. Avere vFXT provides a high-performance storage solution on Azure based on Network File System version 3
(NFSv3) and comprised of at least four VMs.
5. PhotoScan renders the model.
Components
Agisoft PhotoScan: The PhotoScan Scheduler runs on a Windows 2016 Server VM, and the processing nodes
use five VMs with GPUs that run CentOS Linux 7.5.
Avere vFXT is a file caching solution that uses object storage and traditional network-attached storage (NAS) to
optimize storage of large datasets. It includes:
Avere Controller. This VM executes the script that installs the Avere vFXT cluster and runs Ubuntu 18.04
LTS. The VM can be used later to add or remove cluster nodes and to destroy the cluster as well.
vFXT cluster. At least three VMs are used, one for each of the Avere vFXT nodes based on Avere OS
5.0.2.1. These VMs form the vFXT cluster, which is attached to Azure Blob storage.
Microsoft Active Directory domain controllers allow the host access to domain resources and provide DNS
name resolution. Avere vFXT adds a number of A records — for example, each A record in a vFXT cluster points
to the IP address of each Avere vFXT node. In this setup, all VMs use the round-robin pattern to access vFXT
exports.
Other VMs serve as jump boxes used by the administrator to access the scheduler and processing nodes. The
Windows jumpbox is mandatory to allow the administrator to access the head node via remote desktop
protocol. The second jumpbox is optional and runs Linux for administration of the worker nodes.
Network security groups limit access to the public IP address (PIP) and allow ports 3389 and 22 for access to the
VMs attached to the Jumpbox subnet.
Virtual network peering connects a PhotoScan virtual network to an Avere virtual network.
Azure Blob storage works with Avere vFXT as the core filer to store the committed data being processed. Avere
vFXT identifies the active data stored in Azure Blob and tiers it into solid-state drives (SSD) used for caching in
its compute nodes while a PhotoScan job is running. If changes are made, the data is asynchronously committed
back to the core filer.
Azure Key Vault is used to store the administrator passwords and PhotoScan activation code.
Alternatives
To take advantage of Azure services for managing an HPC cluster, use tools such as Azure CycleCloud or Azure
Batch instead of managing the resources through templates or scripts.
Deploy the BeeGFS parallel virtual file system as the back-end storage on Azure instead of Avere vFXT. Use the
BeeGFS template to deploy this end-to-end solution on Azure.
Deploy the storage solution of your choice, such as GlusterFS, Lustre, or Windows Storage Spaces Direct. To do
this, edit the PhotoScan template to work with the storage solution you want.
Deploy the worker nodes with the Windows operating system instead of Linux, the default option. When
choosing Windows nodes, storage integration options are not executed by the deployment templates. You must
manually integrate the environment with an existing storage solution, or customize the PhotoScan template to
provide such automation, as described in the repository.

Considerations
This scenario is designed specifically to provide high-performance storage for an HPC workload, whether it is
deployed on Windows or Linux. In general, the storage configuration of the HPC workload should match the
appropriate best practices used for on-premises deployments.
Deployment considerations depend on the applications and services used, but a few notes apply:
When building high-performance applications, use Azure Premium Storage and optimize the application layer.
Optimize storage for frequent access using Azure Blob hot tier access.
Use a storage replication option that meets your availability and performance requirements. In this example,
Avere vFXT is configured for high availability by default, with locally redundant storage (LRS). For load
balancing, all VMs in this setup use the round-robin pattern to access vFXT exports.
If the backend storage will be consumed by both Windows clients and Linux clients, use Samba servers to
support the Windows nodes. A version of this example scenario based on BeeGFS uses Samba to support the
scheduler node of the HPC workload (PhotoScan) running on Windows. A load balancer is deployed to act like a
smart replacement for DNS round robin.
Run HPC applications using the VM type best suited for your Windows or Linux workload.
To isolate the HPC workload from the storage resources, deploy each in its own virtual network, then use virtual
network peering to connect the two. Peering creates a low-latency, high-bandwidth connection between
resources in different virtual networks and routes traffic through the Microsoft backbone infrastructure through
private IP addresses only.
Security
This example focuses on deploying a high-performance storage solution for an HPC workload and is not a security
solution. Make sure to involve your security team for any changes.
For added security, this example infrastructure enables all the Windows VMs to be domain-joined and uses Active
Directory for central authentication. It also provides custom DNS services for all VMs. To help protect the
environment, this template relies on network security groups. Network security group offer basic traffic filters and
security rules.
Consider the following options to further improve security in this scenario:
Use network virtual appliances such as Fortinet, Checkpoint, and Juniper.
Apply role-based access control to the resource groups.
Enable VM JIT access if jump boxes are accessed via the Internet.
Use Azure Key Vault to store the passwords used by administrator accounts.

Pricing
The cost of running this scenario can vary greatly depending on multiple factors. The number and size of VMs, how
much storage is required, and the amount of time to complete a job will determine your cost.
The following sample cost profile in the Azure pricing calculator is based on a typical configuration for Avere vFXT
and PhotoScan:
1 A1_v2 Ubuntu VM to run the Avere controller.
3 D16s_v3 Avere OS VMs, one for each of the Avere vFXT nodes that form the vFXT cluster.
5 NC24_v2 Linux VMs to provide the GPUs needed by the PhotoScan processing nodes.
1 D8s_v3 CentOS VM for the PhotoScan scheduler node.
1 DS2_v2 CentOS used as administrator jumpbox.
2 DS2_v2 VMs for the Active Directory domain controllers.
Premium managed disks.
General purpose v2 (GPv2) Blob storage with LRS and hot tier access (only GPv2 storage accounts expose the
Access Tier attribute).
Virtual network with support for 10 TB data transfer.
For details about this architecture, see the e-book. To see how the pricing would change for your particular use
case, choose different VM sizes in the pricing calculator to match your expected deployment.

Deployment
For step-by-step instructions for deploying this architecture, including all the prerequisites for using either Avere
FxT or BeeGFS, download the e-book Deploy Agisoft PhotoScan on Azure With Avere vFXT for Azure or BeeGFS.

Related resources
The following resources will provide more information on the components used in this scenario, along with
alternative approaches for batch computing on Azure.
Overview of Avere vFXT for Azure
Agisoft PhotoScan home page
Azure Storage Performance and Scalability Checklist
Parallel Virtual File Systems on Microsoft Azure: Performance tests of Lustre, GlusterFS, and BeeGFS (PDF)
An example scenario for computer-aided engineering (CAE) on Azure
HPC on Azure home page
Overview of Big Compute: HPC & Microsoft Batch
HPC Risk Analysis Template – Solution Architecture
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

This templated risk analysis solution uses Azure HPC compute and GPU virtual machines (VMs) to expand on-
premises TIBCO GridServer compute to Azure using Azure CycleCloud for auto-scaling integration. The job
executes both on-premises and in the cloud by using Avere vFXT fast caching and native NFS access to market data
available on-premises.

Architecture

Download an SVG of this architecture.

Data Flow
1. Operations team uses Azure CycleCloud to configure and launch risk analysis grid in Azure.
2. Azure CycleCloud orchestrates VM creation and software configuration for TIBCO GridServer brokers and
HPCCA, in-memory data cache, and Avere vFXT cache.
3. Quant (or scheduled batch) submits a risk analysis template workflow to the on-premises TIBCO GridServer
director. Based on job policies and current on-premises use, the workflow is allowed to burst to Azure to expand
on-premises grid capacity.
4. The TIBCO HPCCA detects the change in queue depth for each TIBCO broker and requests additional TIBCO
engine capacity using the Azure CycleCloud Auto-Scaling API. Azure CycleCloud then autostarts engine nodes in
Virtual Machine Scale Sets using the Azure H-series, HB-series, and HC-series VMs to optimize cost and
performance and NC-series VMs to provide GPU capacity as required.
5. As soon as engine VMs join the Azure Grid, the brokers begin executing tasks to the new nodes.
6. Risk jobs pull artifacts from on-premises and Azure Blob storage as needed from NFS mounted Avere vFXT
and/or via the fast in-memory cache.
7. As each task completes, results are returned to the submitter or driver and data is written back to the in-
memory cache, or to NFS storage through the Avere vFXT, as required. Cached data is persisted either on-
premises or in Azure Blob storage.
8. As task queues drain, the TIBCO HPCCA uses the Azure CycleCloud Auto-Scaling API to shrink the compute grid
and reduce cost.

Components
N-Series Virtual Machines: N-series virtual machines are ideal for compute and graphics-intensive workloads,
helping customers to fuel innovation through scenarios like high-end remote visualization, deep learning, and
predictive analytics.
H-Series Virtual Machines: The H-series is a new family specifically designed to handle high performance
computing workloads such as financial risk modeling, seismic and reservoir simulation, molecular modeling,
and genomic research.
Effectively manage common workloads with ease while creating and optimizing HPC clusters with Microsoft
Azure CycleCloud.
Avere vFXT: Faster, more accessible data storage for high-performance computing at the edge
TIBCO GridServer® is a market-leading infrastructure platform for grid and elastic computing-and the
backbone of businesses operating in the world's most demanding markets. More than a million CPUs spread
across a thousand global installations form enterprise grids managed by GridServer.

Next steps
N-Series Virtual Machines Documentation
H-Series Virtual Machines Documentation
Azure CycleCloud Documentation
Avere vFXT Documentation
TIBCO GridServer Documentation
HPC System and Big Compute Solutions
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Big compute and high performance computing (HPC) workloads are typically compute-intensive and can be run in
parallel, taking advantage of the scale and flexibility of the cloud. The workloads are often run asynchronously
using batch processing, with compute resources required to run the work and job scheduling required to specify
the work. Examples of Big Compute and HPC workloads include financial risk Monte Carlo simulations, image
rendering, media transcoding, file processing, and engineering or scientific simulations.
This solution implements a cloud-native application with Azure Batch, which provides compute resource allocation
and management, application installation, resource autoscaling, and job scheduling as a platform service. Batch
also offers higher-level workload accelerators specifically for running R in parallel, AI training, and rendering
workloads.
This solution is built on managed services including Virtual Machines, Storage, and Batch. These Azure services run
in a high-availability environment, patched and supported, allowing you to focus on your solution.
The links to the right provide documentation on deploying and managing the Azure products listed in the solution
architecture above.
Batch documentation
Virtual Machines
Azure Batch
Azure Blob Storage

Architecture
Download an SVG of this architecture.

Data Flow
1. Upload input files and the applications to your Azure Storage account.
2. Create a Batch pool of compute nodes, a job to run the workload on the pool, and the tasks in the job.
3. Batch downloads input files and applications.
4. Batch monitors task execution.
5. Batch uploads task output.
6. Download output files.

Components
Storage Accounts: Massively scalable object storage for unstructured data.
Batch: Cloud-scale job scheduling and compute management.

Next steps
Quickstart: Upload, download, and list blobs using the Azure portal
Quickstart: Run your first Batch job in the Azure portal
HPC cluster deployed in the cloud
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

High performance computing (HPC) applications can scale to thousands of compute cores, extend on-premises big
compute, or run as a 100% cloud native solution. This HPC solution including the head node, compute nodes, and
storage nodes, runs in Azure with no hardware infrastructure to maintain.
This solution is built on the Azure managed services: Virtual Machine Scale Sets, Virtual Network and Storage
Accounts. These services run in a high-availability environment, patched and supported, allowing you to focus on
your solution instead of the environment they run in.

Architecture

Download an SVG of this architecture.


Components
HPC head node
Virtual Machine Scale Sets
Virtual Network provides IP connectivity between the head node, compute nodes, and storage nodes.
Storage Accounts: Azure Storage blobs store the disks backing the virtual machines and provides long-term
storage of unstructured data and executable files used by the HPC application.
Azure Resource Manager templates: Resource Manager templates or script files are used to deploy your
application to the HPC environment.
Hybrid HPC in Azure with HPC Pack
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Microsoft HPC Pack is a free high performance computing (HPC) solution built on Microsoft Azure and Windows
Server technologies. HPC Pack combines a comprehensive set of deployment, administration, job scheduling, and
monitoring tools for your Windows and Linux HPC cluster environment, providing a flexible platform for
developing and running HPC applications on premises and in Azure.
This solution shows the process for using HPC Pack to create a hybrid (on-premises and Azure) HPC environment.
The links to the right provide documentation on deploying and managing the Azure products listed in the solution
architecture above.
Documentation Home Page
HPC Pack Azure Deployment Options

Architecture

Download an SVG of this architecture.

Data Flow
1. Log into on-premises head node
2. Add Azure compute nodes to the cluster
3. Start the compute nodes
4. Submit jobs to the cluster
5. HPC Pack sends jobs to on-premises and Azure nodes based upon the node group selected
6. Monitor job progress
7. Stop the compute nodes or configure auto-scaling

Components
Virtual Machines: Create Linux and Windows virtual machines in seconds.
Microsoft HPC Pack: Free high performance computing (HPC) solution built on Microsoft Azure and Windows
Server technologies.

Next steps
Create a Windows virtual machine with the Azure portal
Set up a hybrid high performance computing (HPC) cluster with Microsoft HPC Pack and on-demand Azure
compute nodes
Linux Virtual Desktops with Citrix
11/13/2020 • 6 minutes to read • Edit Online

This example scenario is applicable to any industry that needs a Virtual Desktop Infrastructure (VDI) for Linux
Desktops. VDI refers to the process of running a user desktop inside a virtual machine that lives on a server in the
datacenter. The customer in this scenario chose to use a Citrix-based solution for their VDI needs.
Organizations often have heterogeneous environments with multiple devices and operating systems being used by
employees. It can be challenging to provide consistent access to applications while maintaining a secure
environment. A VDI solution for Linux desktops will allow your organization to provide access irrespective of the
device or OS used by the end user.
Some benefits of this scenario include the following:
Return on investment will be higher with shared Linux virtual desktops by giving more users access to the same
infrastructure. By consolidating resources on a centralized VDI environment, the end user devices don't need to
be as powerful.
Performance will be consistent regardless of the end user device.
Users can access Linux applications from any device (including non-Linux devices).
Sensitive data can be secured in the Azure datacenter for all distributed employees.

Relevant use cases


Consider this scenario for the following use case:
Providing secure access to mission-critical, specialized Linux VDI desktops from Linux or non-Linux devices

Architecture

This example scenario demonstrates allowing the corporate network to access the Linux Virtual Desktops:
An ExpressRoute is established between the on-premises environment and Azure, for fast and reliable
connectivity to the cloud.
Citrix XenDeskop solution deployed for VDI.
The CitrixVDA run on Ubuntu (or another supported distro).
Azure network security groups will apply the correct network ACLs.
Citrix ADC (NetScaler) will publish and load balance all the Citrix services.
Active Directory Domain Services will be used to domain join the Citrix servers. VDA servers will not be domain
joined.
Azure Hybrid File Sync will enable shared storage across the solution. For example, it can be used in
remote/home solutions.
For this scenario, the following SKUs are used:
Citrix ADC (NetScaler): 2 x D4sv3 with NetScaler 12.0 VPX Standard Edition 200 MBPS PAYG image
Citrix License Server: 1 x D2s v3
Citrix VDA: 4 x D8s v3
Citrix Storefront: 2 x D2s v3
Citrix Delivery Controller: 2 x D2s v3
Domain Controllers: 2 x D2sv3
Azure File Servers: 2 x D2sv3

NOTE
All the licenses (other than NetScaler) are bring-your-own-license (BYOL)

Components
Azure Virtual Network allows resources such as VMs to securely communicate with each other, the internet, and
on-premises networks. Virtual networks provide isolation and segmentation, filter and route traffic, and allow
connection between locations. One virtual network will be used for all resources in this scenario.
Network security groups contain a list of security rules that allow or deny inbound or outbound network traffic
based on source or destination IP address, port, and protocol. The virtual networks in this scenario are secured
with network security group rules that restrict the flow of traffic between the application components.
Azure Load Balancer distributes inbound traffic according to rules and health probes. A load balancer provides
low latency and high throughput, and scales up to millions of flows for all TCP and UDP applications. An internal
load balancer is used in this scenario to distribute traffic on the Citrix NetScaler.
Azure Hybrid File Sync will be used for all shared storage. The storage will replicate to two file servers using
Hybrid File Sync.
Azure SQL Database is a managed relational database service based on the latest stable version of the Microsoft
SQL Server Database Engine. In this example, it is used to host Citrix databases.
ExpressRoute lets you extend your on-premises networks into the Microsoft cloud over a private connection
facilitated by a connectivity provider.
Active Directory Domain Services is used for Directory Services and user authentication.
Azure Availability Sets will ensure that the VMs you deploy on Azure are distributed across multiple isolated
hardware nodes in a cluster. Doing this ensures that if a hardware or software failure within Azure happens, only
a subset of your VMs are affected and that your overall solution remains available and operational.
Citrix ADC (NetScaler) is an application delivery controller that performs application-specific traffic analysis to
intelligently distribute, optimize, and secure Layer 4-Layer 7 (L4–L7) network traffic for web applications.
Citrix Storefront is an enterprise app store that improves security and simplifies deployments, delivering a
modern, unmatched near-native user experience across Citrix Receiver on any platform. StoreFront makes it
easy to manage multi-site and multi-version Citrix Virtual Apps and Desktops environments.
Citrix License Server will manage the licenses for Citrix products.
Citrix XenDesktops VDA enables connections to applications and desktops. The VDA is installed on the machine
that runs the applications or virtual desktops for the user. It enables the machines to register with Delivery
Controllers and manage the High Definition eXperience (HDX) connection to a user device.
Citrix Delivery Controller is the server-side component responsible for managing user access, plus brokering
and optimizing connections. Controllers also provide the Machine Creation Services that create desktop and
server images.
Alternatives
There are multiple partners with VDI solutions that supported in Azure such as VMware, Workspot, and others.
This specific sample architecture is based on a deployed project that used Citrix.
Citrix provides a cloud service that abstracts part of this architecture. It could be an alternative for this solution.
For more information, see Citrix Cloud.

Considerations
Check the Citrix Linux Requirements.
Latency can have impact on the overall solution. For a production environment, test accordingly.
Depending on the scenario, the solution may need VMs with GPUs for VDA. For this solution, it is assumed that
GPU is not a requirement.
Availability, Scalability, and Security
This example is designed for high availability for all roles other than the licensing server. Because the
environment continues to function during a 30-day grace period if the license server is offline, no additional
redundancy is required on that server.
All servers providing similar roles should be deployed in Availability Sets.
This example scenario does not include Disaster Recovery capabilities. Azure Site Recovery could be a good add-
on to this design.
Consider deploying the VM instances in this scenario across Availability Zones. Each availability zone is made up
of one or more datacenters equipped with independent power, cooling, and networking. Each enabled region
has a minimum of three availability zones. This distribution of VM instances across zones provides high
availability to the application tiers. For more information, see What are Availability Zones in Azure?. You can also
deploy VPN and ExpressRoute gateways in Azure Availability Zones.
For a production deployment management solution should be implemented such as backup, monitoring and
update management.
This example should work for about 250 concurrent (about 50-60 per VDA server) users with a mixed usage.
But that will greatly depended on the type of applications being used. For production use, rigorous load testing
should be performed.

Deployment
For deployment information, see the official Citrix documentation.

Pricing
The Citrix XenDesktop licenses are not included in Azure service charges.
The Citrix NetScaler license is included in a pay-as-you-go model.
Using reserved instances will greatly reduce the compute cost for the solution.
The ExpressRoute cost is not included.

Next steps
Check Citrix documentation for planning and deployment here.
To deploy Citrix ADC (NetScaler) in Azure, review the Resource Manager templates provided by Citrix here.
Move Azure resources across regions
11/2/2020 • 7 minutes to read • Edit Online

With the growth of Microsoft Azure and its evolving set of regions worldwide, customers have a need to move
deployments from one region to another. Moving applications across regions is an activity that demands a well
thought-out plan, to ensure you move all resources seamlessly, and that applications are up and running in the new
region with minimal downtime.
The recommendations and architecture in this example provide guidance on efficiently, securely, and seamlessly
moving Azure resources across regions.

Use cases
Some of the top reasons for moving resources to a different region include the following cases:
Align to a region launch: Move resources to a newly introduced Azure region that wasn't previously available.
Align for services or features: Move resources to take advantage of services or features that are available in a
specific region.
Respond to business developments: Move resources to a region in response to business changes, such as
mergers or acquisitions.
Align for proximity: Move resources to a region local to your business.

Architecture

On-premises data center network : A private local-area network running within an organization to
support the on-premises resources.
ExpressRoute circuit : ExpressRoute connections use a private, dedicated connection through a third-party
connectivity provider. The private connection extends an on-premises network into Azure.
Local edge routers : Routers that connect the on-premises network to the circuit managed by the third-
party provider. Depending on how you've provisioned your connection, you may need to provide the public
IP addresses that the routers use.
Microsoft edge routers : Two routers in an active-active highly available configuration. These routers
enable a third-party connectivity provider to connect their circuits directly to their datacenter. Depending on
how you've provisioned your connection, you may need to provide the public IP addresses that the routers
use.
VPN Gateway : By using the VPN gateway service, you can connect the virtual network to the on-premises
network.
Active Director y subnet : Most enterprise organizations have an Active Directory Domain Services (AD
DS) environment in their on-premises datacenter. To facilitate management of assets moved to Azure from
your on-premises network that depends on AD DS, consider hosting AD DS domain controllers in Azure in a
central Virtual Network (VNET) hub that dependent workloads can access.
VNET Peering : Multiple VNETs with peering between them. VNET peering allows for group applications in
respective virtual networks, and provides a low latency high-bandwidth connection.
Multitiered web applications running in the cloud environment : This example architecture is
applicable to many industries that need to deploy resilient multitier applications in the cloud. In this
scenario, the application consists of three layers:
Web tier : The top layer, including the user interface. This layer parses user interactions and passes the
actions to the next layer for processing.
Business or App tier : Processes the user interactions and makes logical decisions about the next steps.
This layer connects the web tier and the data tier.
Data tier : Stores the application data. In this case, a SQL database stores the data.
Internal load balancer : Network traffic from the VPN gateway is routed to the cloud application through
an internal load balancer (ILB) endpoint located in the subnet of application tiers.
Platform as a Ser vice (PaaS) resources : In this example environment, there are a few PaaS services
such as Azure IoT hub, Azure Key Vault, and Azure App Service.

Components
The example architecture uses the following components:
Azure ExpressRoute
Azure VPN Gateway
Azure Active Directory Domain Services
Azure IoT Hub
Azure Key Vault
Azure App Service
Cognitive services
Azure Automation

Steps to move resources across regions


Since your requirements might differ from the example architecture, use the following recommendations as a
starting point:
1. Verify the prerequisites for the move.
Planned downtime: Plan the region move as a maintenance activity with scheduled downtime to
minimize customer impact.
Azure subscription limits and quota: Ensure that your subscription has enough resources to support
the specific resource type. For example, ensure the target region supports virtual machines with sizes
that match the virtual machines in your source region. Contact support to enable the required quota
if needed. To learn more, see Azure subscription and service limits, quotas, and constraints.
Account permissions: If you created a free Azure account, you're the administrator of your
subscription. If you're not the subscription administrator, work with the administrator to assign the
permissions that you need to move the resources. Verify that your Azure subscription allows you to
create the necessary resource in the target region.
Resource identification: Identify and categorize your resources based on the type of resource needed
to export an Azure Resource Manager (ARM) template or to start replication using various
technologies. For each of the resource types you want to move, the steps may be different. Refer to
Moving Azure resources across regions to identify the corresponding steps for each of the resource
types.
2. Move the networking components.
Network Security Groups: You can't move network security groups from one region to another.
However, you can use an ARM template to export the existing configuration and security rules of a
network security group. You can then stage the resource in another region by exporting the group to
a template, modifying the parameters to match the destination region, and then deploying the
template to the new region.
Virtual Network: You can use an Azure Resource Manager template to complete the move of the
virtual network to another region. Export the virtual network to a template, modify the parameters to
match the destination region, and then deploy the template to the new region.
Virtual network peering: VNET peering won't be recreated, and VNET peers will fail if they're still
present in the template. Before you export the template, remove any VNET peers. You can reestablish
them after the virtual network move.
Public IP addresses: Since Azure Public IPs are region-specific, you can't move them from one region
to another. However, you can use an ARM template to export the existing configuration and security
rules of a network security group. You can then stage the resource in another region by exporting the
group to a template, modifying the parameters to match the destination region, and then deploying
the template to the new region.
3. Move the app components.
Virtual machine resources: Moving virtual machine resources requires Azure Site Recovery to move
the resources by replicating them in the target region. To understand more about a virtual machine
move using Site Recovery, see Move Azure VMs to another region.
SQL resources: SQL databases make use of the SQL Failover Group mechanism to move across
regions. To learn more about moving SQL databases, see Move resources to new region - Azure SQL
Database & Azure SQL Managed Instance.
4. Move the PaaS services: PaaS services have their own specific steps for orchestrating the move. To find the
latest information on the list of supported services, see Support for moving Azure resources across regions.
5. Move the on-premises infrastructure: To ensure you have the full source region recreated on the target
region, re-establish and configure your on-premises components as they were before and connect them to
the Azure network.

Considerations
Consider the following points when making a cross-regional move:
Your plan for migrating across regions must take into account complex infrastructure. Modern infrastructure
environments often span across on-premises infrastructure to the cloud. Some even have an additional level
of complexity, with a multi-cloud strategy containing private or public deployments.
Move resource types together. By combining the move of similar resource types (for example, 50 virtual
machines or 20 SQL databases), you can plan the preparation step of your move more easily and ensure
that long-running operations complete together, which helps reduce downtime.
Move all resources within an application together. You can select the resources of an application and try to
move them together in a set, to ensure that you are able to bring up the app on the target region in an
orchestrated manner.
Ensure that you cover your capacity needs. The ability to verify capacity or quota is available in the target
region to support current and potential business growth before the actual move.
There should be minimal or no impact to current business-critical applications or infrastructure in the source
region while the move is in progress.
To ensure business continuity, you should have a functional environment up and running on the target
region with the least possible downtime.
The capability to validate the migration before the final commit to the target side is critical, especially if you
support Tier 0, Tier 1 workloads in the Financial Services Industry (FSI) or health care verticals.
Ensure you do due diligence by testing and validating the original configurations, connectivity, proper
security configuration, policies, data replication, and database connections before you commit the move to
the target region.
After you move the resources to the target, ensure the final configuration is up and running by making final
changes like the following:
Change the DNS configuration to point to a new IP.
Delete resources in the source region to avoid double billing and to prevent issues due to the existence of
two separate data sets that overlap in scope and configuration.
Delete any auxiliary resources created for the move. For example, delete any storage accounts that were
used for intermediate transfer.

Next steps
Moving Azure resources across regions

Related resources
Support for moving Azure resources across regions
Move Azure VMs to another region
Move resources to new region - Azure SQL Database & Azure SQL Managed Instance
Run a Linux virtual machine on Azure
11/2/2020 • 9 minutes to read • Edit Online

Provisioning a virtual machine (VM) in Azure requires some additional components besides the VM itself,
including networking and storage resources. This article shows best practices for running a Linux VM on Azure.

Resource group
A resource group is a logical container that holds related Azure resources. In general, group resources based on
their lifetime and who will manage them.
Put closely associated resources that share the same lifecycle into the same resource group. Resource groups
allow you to deploy and monitor resources as a group and track billing costs by resource group. You can also
delete resources as a set, which is useful for test deployments. Assign meaningful resource names to simplify
locating a specific resource and understanding its role. For more information, see Recommended Naming
Conventions for Azure Resources.

Virtual machine
You can provision a VM from a list of published images, or from a custom managed image or virtual hard disk
(VHD) file uploaded to Azure Blob storage. Azure supports running various popular Linux distributions, including
CentOS, Debian, Red Hat Enterprise, Ubuntu, and FreeBSD. For more information, see Azure and Linux.
Azure offers many different virtual machine sizes. For more information, see Sizes for virtual machines in Azure. If
you are moving an existing workload to Azure, start with the VM size that's the closest match to your on-premises
servers. Then measure the performance of your actual workload in terms of CPU, memory, and disk input/output
operations per second (IOPS), and adjust the size as needed.
Generally, choose an Azure region that is closest to your internal users or customers. Not all VM sizes are available
in all regions. For more information, see Services by region. For a list of the VM sizes available in a specific region,
run the following command from the Azure CLI:

az vm list-sizes --location <location>

For information about choosing a published VM image, see Find Linux VM images.

Disks
For best disk I/O performance, we recommend Premium Storage, which stores data on solid-state drives (SSDs).
Cost is based on the capacity of the provisioned disk. IOPS and throughput (that is, data transfer rate) also depend
on disk size, so when you provision a disk, consider all three factors (capacity, IOPS, and throughput).
We also recommend using Managed Disks. Managed disks simplify disk management by handling the storage for
you. Managed disks do not require a storage account. You simply specify the size and type of disk and it is
deployed as a highly available resource
The OS disk is a VHD stored in Azure Storage, so it persists even when the host machine is down. For Linux VMs,
the OS disk is /dev/sda1 . We also recommend creating one or more data disks, which are persistent VHDs used
for application data.
When you create a VHD, it is unformatted. Log into the VM to format the disk. In the Linux shell, data disks are
displayed as /dev/sdc , /dev/sdd , and so on. You can run lsblk to list the block devices, including the disks. To
use a data disk, create a partition and file system, and mount the disk. For example:

# Create a partition.
sudo fdisk /dev/sdc # Enter 'n' to partition, 'w' to write the change.

# Create a file system.


sudo mkfs -t ext3 /dev/sdc1

# Mount the drive.


sudo mkdir /data1
sudo mount /dev/sdc1 /data1

When you add a data disk, a logical unit number (LUN) ID is assigned to the disk. Optionally, you can specify the
LUN ID — for example, if you're replacing a disk and want to retain the same LUN ID, or you have an application
that looks for a specific LUN ID. However, remember that LUN IDs must be unique for each disk.
You may want to change the I/O scheduler to optimize for performance on SSDs because the disks for VMs with
premium storage accounts are SSDs. A common recommendation is to use the NOOP scheduler for SSDs, but you
should use a tool such as iostat to monitor disk I/O performance for your workload.
The VM is created with a temporary disk. This disk is stored on a physical drive on the host machine. It is not saved
in Azure Storage and may be deleted during reboots and other VM lifecycle events. Use this disk only for
temporary data, such as page or swap files. For Linux VMs, the temporary disk is /dev/sdb1 and is mounted at
/mnt/resource or /mnt .

Network
The networking components include the following resources:
Vir tual network . Every VM is deployed into a virtual network that can be segmented into multiple
subnets.
Network interface (NIC) . The NIC enables the VM to communicate with the virtual network. If you need
multiple NICs for your VM, be aware that a maximum number of NICs is defined for each VM size.
Public IP address . A public IP address is needed to communicate with the VM — for example, via remote
desktop (RDP). The public IP address can be dynamic or static. The default is dynamic.
Reserve a static IP address if you need a fixed IP address that won't change — for example, if you need to
create a DNS 'A' record or add the IP address to a safe list.
You can also create a fully qualified domain name (FQDN) for the IP address. You can then register a
CNAME record in DNS that points to the FQDN. For more information, see Create a fully qualified domain
name in the Azure portal.
Network security group (NSG) . Network security groups are used to allow or deny network traffic to
VMs. NSGs can be associated either with subnets or with individual VM instances.
All NSGs contain a set of default rules, including a rule that blocks all inbound Internet traffic. The default rules
cannot be deleted, but other rules can override them. To enable Internet traffic, create rules that allow inbound
traffic to specific ports — for example, port 80 for HTTP. To enable SSH, add an NSG rule that allows inbound traffic
to TCP port 22.

Operations
SSH . Before you create a Linux VM, generate a 2048-bit RSA public-private key pair. Use the public key file when
you create the VM. For more information, see How to Use SSH with Linux and Mac on Azure.
Diagnostics . Enable monitoring and diagnostics, including basic health metrics, diagnostics infrastructure logs,
and boot diagnostics. Boot diagnostics can help you diagnose boot failure if your VM gets into a non-bootable
state. Create an Azure Storage account to store the logs. A standard locally redundant storage (LRS) account is
sufficient for diagnostic logs. For more information, see Enable monitoring and diagnostics.
Availability . Your VM may be affected by planned maintenance or unplanned downtime. You can use VM reboot
logs to determine whether a VM reboot was caused by planned maintenance. For higher availability, deploy
multiple VMs in an availability set. This configuration provides a higher service level agreement (SLA).
Backups To protect against accidental data loss, use the Azure Backup service to back up your VMs to geo-
redundant storage. Azure Backup provides application-consistent backups.
Stopping a VM . Azure makes a distinction between "stopped" and "deallocated" states. You are charged when the
VM status is stopped, but not when the VM is deallocated. In the Azure portal, the Stop button deallocates the VM.
If you shut down through the OS while logged in, the VM is stopped but not deallocated, so you will still be
charged.
Deleting a VM . If you delete a VM, the VHDs are not deleted. That means you can safely delete the VM without
losing data. However, you will still be charged for storage. To delete the VHD, delete the file from Blob storage. To
prevent accidental deletion, use a resource lock to lock the entire resource group or lock individual resources, such
as a VM.

Cost considerations
There are various options for VM sizes depending on the usage and workload. The range includes most
economical option of the Bs-series to the newest GPU VMs optimized for machine learning. For information about
the available options, see Azure Linux VM pricing.
For workloads with no predictable time of completion or resource consumption, consider the Pay as you go
option.
Consider using Azure Reservations if you can commit to using a virtual machine over a one-year or three-year
term. VM reservations can reduce costs up to 72 % compared to pay-as-you-go prices.
Use Azure Spot VMs to run workloads the can be interrupted and do not require completion within a
predetermined timeframe or an SLA. Azure deploys Spot VMs if there is available capacity and evicts when it
needs the capacity back. Costs associated with Spot virtual machines are significantly lower. Consider Spot VMs
for these workloads:
High-performance computing scenarios, batch processing jobs, or visual rendering applications.
Test environments, including continuous integration and continuous delivery workloads.
Large-scale stateless applications.
Use the Azure Pricing Calculator to estimates costs.
For more information, see the cost section in Microsoft Azure Well-Architected Framework.

Security considerations
Use Azure Security Center to get a central view of the security state of your Azure resources. Security Center
monitors potential security issues and provides a comprehensive picture of the security health of your
deployment. Security Center is configured per Azure subscription. Enable security data collection as described in
Onboard your Azure subscription to Security Center Standard. When data collection is enabled, Security Center
automatically scans any VMs created under that subscription.
Patch management . If enabled, Security Center checks whether any security and critical updates are missing. Use
Group Policy settings on the VM to enable automatic system updates.
Antimalware . If enabled, Security Center checks whether antimalware software is installed. You can also use
Security Center to install antimalware software from inside the Azure portal.
Access control . Use role-based access control (RBAC) to control access to Azure resources. RBAC lets you assign
authorization roles to members of your DevOps team. For example, the Reader role can view Azure resources but
not create, manage, or delete them. Some permissions are specific to an Azure resource type. For example, the
Virtual Machine Contributor role can restart or deallocate a VM, reset the administrator password, create a new
VM, and so on. Other built-in RBAC roles that may be useful for this architecture include DevTest Labs User and
Network Contributor.

NOTE
RBAC does not limit the actions that a user logged into a VM can perform. Those permissions are determined by the
account type on the guest OS.

Audit logs . Use audit logs to see provisioning actions and other VM events.
Data encr yption . Use Azure Disk Encryption if you need to encrypt the OS and data disks.

DevOps considerations
Use a single Azure Resource Manager template for provisioning the Azure resources and its dependencies. Since
all the resources are in the same virtual network, they are isolated in the same basic workload, that makes it easier
to associate the workload's specific resources to a DevOps team, so that the team can independently manage all
aspects of those resources. This isolation enables the DevOps Team to perform continuous integration and
continuous delivery (CI/CD).
Also, you can use different Azure Resource Manager templates and integrate them with Azure DevOps Services to
provision different environments in minutes, for example to replicate production like scenarios or load testing
environments only when needed, saving cost.
For higher availability architecture see Linux N-tier application in Azure with Apache Cassandra, the reference
architecture includes more than one VM and each VM is included in an availability set.
Consider using the Azure Monitor to Analyze and optimize the performance of your infrastructure, Monitor and
diagnose networking issues without logging into your virtual machines.
For more information, see the Operational excellence section in Microsoft Azure Well-Architected Framework.

Next steps
To provision a Linux VM, see Create and Manage Linux VMs with the Azure CLI
For a complete N-tier architecture on Linux VMs, see Linux N-tier application in Azure with Apache Cassandra.
Run a Windows virtual machine on Azure
11/2/2020 • 8 minutes to read • Edit Online

Provisioning a virtual machine (VM) in Azure requires some additional components besides the VM itself,
including networking and storage resources. This article shows best practices for running a Windows VM on
Azure.

Resource group
A resource group is a logical container that holds related Azure resources. In general, group resources based on
their lifetime and who will manage them.
Put closely associated resources that share the same lifecycle into the same resource group. Resource groups allow
you to deploy and monitor resources as a group and track billing costs by resource group. You can also delete
resources as a set, which is very useful for test deployments. Assign meaningful resource names to simplify
locating a specific resource and understanding its role. For more information, see Recommended Naming
Conventions for Azure Resources.

Virtual machine
You can provision a VM from a list of published images, or from a custom managed image or virtual hard disk
(VHD) file uploaded to Azure Blob storage.
Azure offers many different virtual machine sizes. For more information, see Sizes for virtual machines in Azure. If
you are moving an existing workload to Azure, start with the VM size that's the closest match to your on-premises
servers. Then measure the performance of your actual workload in terms of CPU, memory, and disk input/output
operations per second (IOPS), and adjust the size as needed.
Generally, choose an Azure region that is closest to your internal users or customers. Not all VM sizes are available
in all regions. For more information, see Services by region. For a list of the VM sizes available in a specific region,
run the following command from the Azure CLI:

az vm list-sizes --location <location>

For information about choosing a published VM image, see Find Windows VM images.

Disks
For best disk I/O performance, we recommend Premium Storage, which stores data on solid-state drives (SSDs).
Cost is based on the capacity of the provisioned disk. IOPS and throughput also depend on disk size, so when you
provision a disk, consider all three factors (capacity, IOPS, and throughput).
We also recommend using Managed Disks. Managed disks simplify disk management by handling the storage for
you. Managed disks do not require a storage account. You simply specify the size and type of disk and it is
deployed as a highly available resource
The OS disk is a VHD stored in Azure Storage, so it persists even when the host machine is down. We also
recommend creating one or more data disks, which are persistent VHDs used for application data. When possible,
install applications on a data disk, not the OS disk. Some legacy applications might need to install components on
the C: drive; in that case, you can resize the OS disk using PowerShell.
The VM is also created with a temporary disk (the D: drive on Windows). This disk is stored on a physical drive on
the host machine. It is not saved in Azure Storage and may be deleted during reboots and other VM lifecycle
events. Use this disk only for temporary data, such as page or swap files.

Network
The networking components include the following resources:
Vir tual network . Every VM is deployed into a virtual network that can be segmented into multiple
subnets.
Network interface (NIC) . The NIC enables the VM to communicate with the virtual network. If you need
multiple NICs for your VM, be aware that a maximum number of NICs is defined for each VM size.
Public IP address . A public IP address is needed to communicate with the VM — for example, via remote
desktop (RDP). The public IP address can be dynamic or static. The default is dynamic.
Reserve a static IP address if you need a fixed IP address that won't change — for example, if you need to
create a DNS 'A' record or add the IP address to a safe list.
You can also create a fully qualified domain name (FQDN) for the IP address. You can then register a CNAME
record in DNS that points to the FQDN. For more information, see Create a fully qualified domain name in
the Azure portal.
Network security group (NSG) . Network security groups are used to allow or deny network traffic to
VMs. NSGs can be associated either with subnets or with individual VM instances.
All NSGs contain a set of default rules, including a rule that blocks all inbound Internet traffic. The default rules
cannot be deleted, but other rules can override them. To enable Internet traffic, create rules that allow inbound
traffic to specific ports — for example, port 80 for HTTP. To enable RDP, add an NSG rule that allows inbound traffic
to TCP port 3389.

Operations
Diagnostics . Enable monitoring and diagnostics, including basic health metrics, diagnostics infrastructure logs,
and boot diagnostics. Boot diagnostics can help you diagnose boot failure if your VM gets into a non-bootable
state. Create an Azure Storage account to store the logs. A standard locally redundant storage (LRS) account is
sufficient for diagnostic logs. For more information, see Enable monitoring and diagnostics.
Availability . Your VM may be affected by planned maintenance or unplanned downtime. You can use VM reboot
logs to determine whether a VM reboot was caused by planned maintenance. For higher availability, deploy
multiple VMs in an availability set. This configuration provides a higher service level agreement (SLA).
Backups To protect against accidental data loss, use the Azure Backup service to back up your VMs to geo-
redundant storage. Azure Backup provides application-consistent backups.
Stopping a VM . Azure makes a distinction between "stopped" and "deallocated" states. You are charged when the
VM status is stopped, but not when the VM is deallocated. In the Azure portal, the Stop button deallocates the VM.
If you shut down through the OS while logged in, the VM is stopped but not deallocated, so you will still be
charged.
Deleting a VM . If you delete a VM, the VHDs are not deleted. That means you can safely delete the VM without
losing data. However, you will still be charged for storage. To delete the VHD, delete the file from Blob storage. To
prevent accidental deletion, use a resource lock to lock the entire resource group or lock individual resources, such
as a VM.

Cost considerations
There are various options for VM sizes depending on the usage and workload. The range includes most
economical option of the Bs-series to the newest GPU VMs optimized for machine learning. For information about
the available options, see Azure Windows VM pricing.
For workloads with no predictable time of completion or resource consumption, consider the Pay as you go
option.
Consider using Azure Reservations if you can commit to using a virtual machine over a one-year or three-year
term. VM reservations can reduce costs up to 72 % compared to pay-as-you-go prices.
Use Azure Spot VMs to run workloads the can be interrupted and do not require completion within a
predetermined timeframe or an SLA. Azure deploys Spot VMs if there is available capacity and evicts when it
needs the capacity back. Costs associated with Spot virtual machines are significantly lower. Consider Spot VMs for
these workloads:
High-performance computing scenarios, batch processing jobs, or visual rendering applications.
Test environments, including continuous integration and continuous delivery workloads.
Large-scale stateless applications.
Use the [Azure Pricing Calculator][azure-pricing-calculator] to estimates costs.
For more information, see the cost section in Microsoft Azure Well-Architected Framework.

Security considerations
Use Azure Security Center to get a central view of the security state of your Azure resources. Security Center
monitors potential security issues and provides a comprehensive picture of the security health of your
deployment. Security Center is configured per Azure subscription. Enable security data collection as described in
Onboard your Azure subscription to Security Center Standard. When data collection is enabled, Security Center
automatically scans any VMs created under that subscription.
Patch management . If enabled, Security Center checks whether any security and critical updates are missing. Use
Group Policy settings on the VM to enable automatic system updates.
Antimalware . If enabled, Security Center checks whether antimalware software is installed. You can also use
Security Center to install antimalware software from inside the Azure portal.
Access control . Use role-based access control (RBAC) to control access to Azure resources. RBAC lets you assign
authorization roles to members of your DevOps team. For example, the Reader role can view Azure resources but
not create, manage, or delete them. Some permissions are specific to an Azure resource type. For example, the
Virtual Machine Contributor role can restart or deallocate a VM, reset the administrator password, create a new
VM, and so on. Other built-in RBAC roles that may be useful for this architecture include DevTest Labs User and
Network Contributor.

NOTE
RBAC does not limit the actions that a user logged into a VM can perform. Those permissions are determined by the
account type on the guest OS.

Audit logs . Use audit logs to see provisioning actions and other VM events.
Data encr yption . Use Azure Disk Encryption if you need to encrypt the OS and data disks.

DevOps considerations
Use infrastructure as Code (IaC) either by using a single Azure Resource Manager template for provisioning the
Azure resources (declarative approach) or by using a single PowerShell script (imperative approach). Since all the
resources are in the same virtual network, they are isolated in the same basic workload, that makes it easier to
associate the workload's specific resources to a DevOps team, so that the team can independently manage all
aspects of those resources. This isolation enables the DevOps Team and Services to perform continuous
integration and continuous delivery (CI/CD).
Also, you can use different Azure Resource Manager templates and integrate them with Azure DevOps Services to
provision different environments in minutes, for example to replicate production like scenarios or load testing
environments only when needed, saving cost.
For higher availability architecture see Windows N-tier application on Azure with SQL Server, the reference
architecture includes more than one VM and each VM is included in an availability set.
Consider using the Azure Monitor to Analyze and optimize the performance of your infrastructure, Monitor and
diagnose networking issues without logging into your virtual machines.
For more information, see the Operational Excellence section in Azure Well-Architected Framework.

Next steps
To provision a Windows VM, see Create and Manage Windows VMs with Azure PowerShell
For a complete N-tier architecture on Windows VMs, see Windows N-tier application on Azure with SQL Server.
Charon-SSP Solaris emulator on Azure VMs
11/2/2020 • 4 minutes to read • Edit Online

This reference architecture shows how an emulator called Charon-SSP from the Microsoft partner, Stromasys, can
run SPARC processor-based Solaris virtual machines (VMs) in Azure. Charon-SSP is a member of the Charon cross-
platform hardware virtualization product family. The emulator can create virtual replicas of Sun-4m, Sun-4u, or
Sun-4v SPARC family members on standard x86-64 Linux physical computers or hypervisors.
Running applications in an emulator on Azure has several benefits, such as reduced operational costs and energy
consumption. You can also run multiple application instances on a single x86-64 standard host or existing
virtualization infrastructure, giving you the advantages of consolidation while easing legacy system management
and maintenance.

Use cases
Enable low-friction "lift-and-shift" from on-premises workloads running on SPARC Solaris machines into Azure.
Continue to use applications that run on end-of-life SPARCstation or SPARCserver, without changes.
Manage multiple server hosts and child Solaris VMs from a single interface.
Allow mapping to low-cost Azure storage to archive tapes for regulatory and compliance purposes.

Architecture

1. The Charon-SSP Director allows managing multiple server hosts, each potentially running one or more child
Solaris VMs. This setup provides a single place of management as you scale out your farm of host VMs and
their Solaris child VMs. Charon-SSP Manager provides an easy-to-use and intuitive graphical management
interface.
2. The Charon-SSP Agent runs on Linux distributions on Azure VMs. This component runs the child Solaris VMs
and emulates the SPARC processor architecture.
3. The child Solaris VMs are based on the SPARC processor architecture.
4. The child Solaris VMs each get their own Azure network interface, and therefore have their own dedicated
private IP addresses. Optionally, you can easily set up Azure public IP addresses on the same network interfaces.
5. The Solaris VM images can reside on the solid-state drive (SSD) managed disk of the host Azure VM. Azure
Ultra SSD managed disks are also a potential option for even higher input/output operations per second (IOPS).
6. Azure Storage Account file shares mounted on the Linux VM allow mapping of the Charon-SSP Virtual Tape
Manager to a locally mounted device, which is backed by an Azure Files storage account in the cloud. This
mapping allows for low-cost storage of archived tapes for regulatory and compliance purposes.
7. The management VM that runs Charon-SSP Director and Manager can be either Windows-based, or Linux-
based with a graphic user interface like GNOME.
8. End users can secure-shell (SSH) connect directly to the Solaris VMs, which have their own dedicated network
interface cards and IP addresses.
XDMCP is available for desktop access to the Solaris VMs. XDMCP isn't an encrypted protocol, so the
recommended topology for accessing a Solaris VM via XDMCP is to create a Windows Server VM in Azure as a
"hop" server, in which an XDMCP client such as MobaXterm can be installed. With this configuration, all network
traffic occurs over the private Azure virtual network.

Components
Azure VMs are on-demand, scalable computing resources in Azure. An Azure VM gives you the flexibility of
virtualization without having to buy and maintain physical hardware. Azure VMs give you a choice of
operating systems including Windows and Linux.
Azure Virtual Network is the fundamental building block for private networks in Azure. Virtual networks let
Azure resources like VMs securely communicate with each other, the internet, and on-premises networks.
Azure Virtual Network is similar to a traditional network in your own datacenter, but provides the additional
scale, availability, and isolation benefits of the Azure infrastructure.
Azure Virtual Network interface cards enable an Azure VM to communicate with internet, Azure, and on-
premises resources. As shown in this architecture, you can add additional network interface cards to the
same Azure VM, which allows the Solaris child VMs to have their own dedicated network interface devices
and IP addresses.
Azure SSD managed disks are block-level storage volumes managed by Azure that are used with Azure VMs.
The available types of disks are ultra disks, premium SSDs, standard SSDs, and standard hard disk drives
(HDDs). For this architecture, we recommend either Premium SSDs or Ultra Disk SSDs.
Azure Files storage accounts offer fully managed file shares in the cloud that are accessible via the industry
standard Server Message Block (SMB) protocol. Azure file shares can be mounted concurrently by cloud and
on-premises deployments of Windows, Linux, and macOS.
Azure ExpressRoute lets you extend your on-premises networks into the Microsoft cloud over a private
connection facilitated by a connectivity provider. With ExpressRoute, you can establish connections to
Microsoft cloud services like Microsoft Azure and Microsoft 365.
Stromasys Charon-SSP emulator recreates the SPARC virtual hardware layer on industry standard x86-64
computer systems and VMs. The virtual SPARC virtual hardware layer is compatible with any Sun software
running on it, so there's no need for code conversion or source code. Charon-SSP is fully compatible with
SPARC storage, Ethernet, and serial I/O hardware.

Next steps
For more information, please contact [email protected].
See Charon-SSP on the Stromasys website.
Read the Charon-SSP Azure Setup Guide.
Migrate IBM applications with TmaxSoft OpenFrame
11/2/2020 • 5 minutes to read • Edit Online

TmaxSoft OpenFrame is a rehosting solution that makes it easy to lift existing IBM zSeries mainframe applications
and shift them to Microsoft Azure. This lift-and-shift operation uses a no-code approach. TmaxSoft quickly migrates
an existing application, as is, to a zSeries mainframe emulation environment on Azure.
This reference architecture illustrates how the TmaxSoft OpenFrame solution runs on Azure. The approach consists
of two virtual machines (VMs) running Linux in an active-active configuration. An Azure Load Balancer distributes
incoming traffic between the VMs. OpenFrame emulation software runs on the VMs and provides a zSeries
runtime and facilities. Working with the OpenFrame software is an Azure SQL Database. This modernized database
layer includes built-in business continuity features.

Potential use cases


Many scenarios can benefit from TmaxSoft OpenFrame lift and shift. Possibilities include the following cases:
Businesses seeking to modernize infrastructure and escape the high costs, limitations, and rigidity associated
with mainframes.
Organizations opting to move IBM zSeries mainframe workloads to the cloud without the side effects of a
complete redevelopment.
IBM zSeries mainframe customers who need to migrate mission-critical applications while maintaining
continuity with other on-premises applications.
Teams looking for the horizontal and vertical scalability that Azure offers.
Businesses that favor solutions offering disaster recovery options.

Architecture
At the center of the diagram are two virtual machines. Labeled boxes indicate that OpenFrame software runs on
the machines, and each box represents a different type of software. These programs migrate applications to Azure
and handle transaction processes. They also manage batch programs and provide security. A load balancer is
pictured above the virtual machines. Arrows show that it distributes incoming traffic between the machines. Below
the virtual machines, a file sharing system is pictured, and to the right is a database. From arrows, it's clear that the
virtual machines communicate with the file share and the database. A dotted line surrounds all these components.
Outside that line are on-premises users, Azure users, and disaster recovery services. Arrows show the users
interacting with the system.
1. On-premises users interact with OpenFrame applications by using 3270 WebTerminal, OFManager, and
OFStudio:
The web application 3270 WebTerminal runs in browsers. This app connects users with Customer
Information Control System (CICS) and Information Management System - Data Communications (IMS-
DC) applications. By providing access to these 3270 terminal screens, the 3270 WebTerminal app
eliminates the need for TN3270 terminal emulation software.
OFManager provides tools for executing, monitoring, and managing batch workloads. This web
application also monitors and manages datasets and security systems.
OFStudio provides an IDE for programming, debugging, and maintaining applications.
2. Azure ExpressRoute creates private connections between the on-premises infrastructure and Azure.
Transport Layer Security (TLS) connections that use port 443 provide access to web-based applications:
After migration, the web application presentation layer remains virtually unchanged. As a result, end
users require minimal retraining. Alternatively, the web application presentation layer can be updated to
align with UX goals.
Azure VM Bastion hosts work to maximize security. When giving administrators access to VMs, these
hosts minimize the number of open ports.
OpenFrame provides middleware integration. For instance, this functionality works with web services
and message queues (MQs).
3. The TmaxSoft solution uses two VMs. Each one runs an application server, and an Azure Load Balancer
manages approaching traffic. OpenFrame supports both active-active and active-passive configurations.
4. OpenFrame language compilers migrate COBOL, Assembler, PL/I, Easytrieve, and other mainframe
applications to Azure by recompiling the source.
5. OpenFrame Online provides tools and commands that replace CICS, IMS-DC, Application Development and
Maintenance (ADM), and Application Infrastructure and Middleware (AIM) technologies.
6. OpenFrame Batch provides tools for managing batch programs that replace the job entry subsystem (JES).
By supporting native Job Control Language (JCL) syntax and batch utilities, OpenFrame Batch minimizes
code updates.
7. Tmax Access Control Facility (TACF) Security provides authentication and authorization features in
OpenFrame by extracting and migrating mainframe security rules.
8. UnixODBC (Open Database Connectivity) connection drivers communicate with relational database
management systems (RDBMSs). Examples include Azure SQL Database, Microsoft SQL Server, Oracle, Db2
LUW, Tibero, Postgres, and MySQL.
9. Azure File Share is mounted on the Linux server VMs. As a result, COBOL programs have easy access to the
Azure Files repository for file processing. Load modules and various log files also use Azure File Share.
10. OpenFrame can integrate with any RDBMS. Examples include Azure SQL Database, SQL Server, Oracle, Db2
LUW, Tibero, Postgres, and MySQL. OpenFrame uses ODBC connection drivers to communicate with
installed databases.
11. Azure Site Recovery provides disaster recovery (DR) for the virtual machine components.

Components
Azure ExpressRoute extends on-premises networks into the Microsoft cloud. By using a connectivity
provider, ExpressRoute establishes private connections to Microsoft cloud services like Microsoft Azure and
Microsoft 365.
Azure VM Bastion provides secure and seamless Remote Desktop Protocol (RDP) and Secure Shell (SSH)
connectivity to VMs in a network. Instead of using a public IP address, users connect to the VMs directly
from the Azure portal.
Azure Load Balancer operates at layer four of the Open Systems Interconnection (OSI) model. As the single
point of contact for clients, Load Balancer distributes inbound traffic to back-end pool instances. It directs
traffic according to configured load-balancing rules and health probes. The back-end pool instances can be
Azure VMs or instances in a virtual machine scale set.
Azure VMs are one of several types of on-demand, scalable computing resources that are available with
Azure. An Azure VM provides the flexibility of virtualization. But it eliminates the maintenance demands of
physical hardware. Azure VMs offer a choice of operating systems, including Windows and Linux.
Azure Virtual Networks are the fundamental building blocks for private networks in Azure. These networks
provide a way for many types of Azure resources, such as Azure VMs, to securely communicate with each
other, the internet, and on-premises networks. An Azure virtual network is like a traditional network
operating in a data center. But an Azure virtual network also provides scalability, availability, isolation, and
other benefits of Azure's infrastructure.
Azure Files Storage Accounts and Azure File Shares are fully managed file shares in the cloud. Azure file
shares are accessible via the industry standard Server Message Block (SMB) protocol. They can be mounted
concurrently by cloud or on-premises deployments. Windows, Linux, and macOS clients can access these file
shares.
Azure SQL Database is an intelligent, scalable relational database service built for the cloud. With AI-
powered, automated features, Azure SQL Database handles database management functions like upgrading,
patching, backups, and monitoring.
Azure Site Recovery provides replication, failover, and recovery processes to help keep applications running
during outages.

Next steps
Contact [email protected] for more information.
See TmaxSoft OpenFrame on Azure Marketplace.
Read how to install TmaxSoft OpenFrame on Azure.

Related resources
Mainframe rehosting on Azure virtual machines
Lift-and-Shift Me Up: The Benefits of Mainframe Rehosting
Lift, shift, and modernize: proven mainframe modernization strategies that enable digital transformation
Run reservoir simulation software on Azure
11/2/2020 • 5 minutes to read • Edit Online

Reservoir simulation uses data-intensive computer models to predict complex flows of fluids such as oil, water, and
gas beneath the earth's surface. This example sets up reservoir simulation software on an Azure high-performance
computing (HPC) infrastructure. Azure makes it possible to run this type of workload with maximum performance,
scalability, and cost efficiency.
The architecture in this example supports OPM Flow, a popular open-source oil and gas reservoir simulation
package from the Open Porous Media (OPM) initiative. The OPM Flow software runs on Azure HPC virtual
machines (VMs) that deliver performance near or better than current on-premises infrastructures.
Users connect to a Linux head node VM to submit models to the HPC resources through PBS Pro 19.1 job-
scheduling software. The HPC resources run OPM Flow and send calculated results to a file share. In this example,
the file share is a 4-terabyte (TB) network file system (NFS) space on the head node VM. Depending on your model
and your input and output (I/O) requirements, you can use other storage options.
A Windows Azure VM running OPM ResInsight, an open-source visualization tool, accesses the file share to model
and visualize the calculated results. Users can connect to the VM via remote desktop protocol (RDP) to view the
visualizations.
Using an Azure VM spares the expense of a high-end visualization workstation. The OPM applications benefit from
HPC hardware and a shared storage location for the input and output files.

Relevant use cases


Do 3D reservoir modeling and visualization of seismic data.
Test INTERSECT, a high-resolution reservoir simulator from Schlumberger. You can see a sample INTERSECT
implementation on GitHub.
Test Nexus by Landmark-Halliburton using a similar setup on Azure.

Architecture
This diagram offers a high-level overview of the architecture used in the example. The workflow is as follows:
1. Users sign in to the head node via SSH to prepare their models for the compute resources.
2. PBS Pro 19.1 runs on the head node and schedules the jobs on the compute nodes.
3. OPM Flow runs on the compute nodes. The compute VMs are deployed as a virtual machine scale set, a
group of identical VMs that scale to meet the demands of the compute tasks.
4. OPM Flow sends calculated results to a file share on the head node. A premium disk is connected to the head
node and set up as an NFS server for the compute nodes and the visualization VM.
5. OPM ResInsight running on a Standard-NV6 Windows VM displays 3D visualizations of results. Users can
access the visualization VM through RDP.

Considerations
This example uses the HB-series of high-performance VMs. The HB-series is optimized for applications driven by
memory bandwidth, such as computational fluid dynamics (CFD), and the Standard_HB120rs_v2 VM is the latest in
the series. For Intel-based hardware, the Standard_HC44rs VM is an option.
To test this OPM Flow architecture on Azure, the GitHub example implementation installs the Norne case, an open-
benchmark case of a real Norwegian Sea oil field. To run this test case, you must:
Use Azure Key Vault for storing keys and secrets, a requirement of the GitHub setup scripts.
Install the Linear Algebra PACKage (LAPACK) libraries on all the compute nodes. The GitHub installation
scripts include this step.
Install HP Remote Graphics Software (RGS) on any computer you want to use as a receiver for the
visualizations. In this example, a user connects to the visualization VM to run ResInsight and view the Norne
case.
Job scheduler
Compute-intensive workloads benefit from HPC orchestration software that can deploy and manage the HPC
compute and storage infrastructure. The example architecture includes two ways to deploy compute: the azurehpc
framework or Azure CycleCloud.
Azure CycleCloud is a tool for creating, managing, operating, and optimizing HPC and big compute clusters on
Azure. You can use it to dynamically provision Azure HPC clusters and orchestrate data and jobs for hybrid and
cloud workflows. Azure CycleCloud also supports several workload managers for your HPC workloads on Azure,
such as Grid Engine, HPC Pack, HTCondor, LSF, PBS Pro, Slurm, and Symphony.
Network
This example workload deploys the VMs within different subnets. For additional security, you can define network
security groups for each subnet. For example, you can set security rules that allow or deny network traffic to or
from the various nodes. If you don't need this level of security, you don't need separate subnets for this
implementation.
Storage
Data storage and access needs vary widely, depending on workload scale. Azure supports several approaches for
managing the speed and capacity of HPC applications. The azurehpc GitHub repository includes example Azure
HPC scripts.
The following approaches are common in the oil and gas industry. Choose the solution best suited to your unique
I/O and capacity requirements.
For low-scale workloads like the current example, consider running NFS on the head node, using a
storage-optimized Lsv2-series VM with large ephemeral disks, or D-series VMs with Azure Premium
Storage, depending on your requirements. This solution suits workloads with 500 cores or fewer, throughput
of up to 1.5 gigabytes per second (GiB/s), and up to 19 TB RAM and 100 TB storage.
Medium to large-scale read-intensive workloads: Consider using Avere vFXT for Azure (6 to 24
nodes). This solution works for workloads of up to 50,000 cores, throughput up to 2 GiB/s for writes and up
to 14 GiB/s for reads, a cache of up to 192 TB, and a file server of up to 2 petabytes (PB).
Balanced or write-intensive medium-scale workloads: Consider using Azure NetApps Files for
workloads of up to 4,000 cores, with a throughput up to 6.5 GiB/s, storage up to 100 TB/volume, and a
maximize file size of 12 TB.
Large-scale workloads: Use an orchestrated parallel file service, such as Lustre or BeeGFS. This approach
works for up to 50,000 cores, with read/write rates up to 50 GiB/s, and 500 TB storage. For even larger
clusters, a bare-metal approach may be more cost-effective. For example, Cray ClusterStor is a managed HPC
storage solution with the flexibility to support larger elastic clusters on the fly.

Deployment
Get an example implementation of this OPM Flow architecture on GitHub.

Next steps
Read the HPC: Oil and Gas in Azure blog.
Get an overview of Avere vFXT for Azure.
Review the Azure Storage performance and scalability checklist.
Explore an example workload for computer-aided engineering (CAE) on Azure.
Learn about HPC on Azure.
Running computational fluid dynamics (CFD)
simulations on Azure
11/2/2020 • 4 minutes to read • Edit Online

Computational Fluid Dynamics (CFD) simulations require significant compute time along with specialized
hardware. As cluster usage increases, simulation times and overall grid use grow, leading to issues with spare
capacity and long queue times. Adding physical hardware can be expensive, and may not align to the usage peaks
and valleys that a business goes through. By taking advantage of Azure, many of these challenges can be overcome
with no capital expenditure.
Azure provides the hardware you need to run your CFD jobs on both GPU and CPU virtual machines. RDMA
(Remote Direct Memory Access) enabled VM sizes have FDR InfiniBand-based networking which allows for low
latency MPI (Message Passing Interface) communication. Combined with the Avere vFXT, which provides an
enterprise-scale clustered file system, customers can ensure maximum throughput for read operations in Azure.
To simplify the creation, management, and optimization of HPC clusters, Azure CycleCloud can be used to provision
clusters and orchestrate data in both hybrid and cloud scenarios. By monitoring the pending jobs, CycleCloud will
automatically launch on-demand compute, where you only pay for what you use, connected to the workload
scheduler of your choice.

Relevant use cases


Other relevant industries for CFD applications include:
Aeronautics
Automotive
Building HVAC
Oil and gas
Life sciences

Architecture
This diagram shows a high-level overview of a typical hybrid design providing job monitoring of the on-demand
nodes in Azure:
1. Connect to the Azure CycleCloud server to configure the cluster.
2. Configure and create the cluster head node, using RDMA enabled machines for MPI.
3. Add and configure the on-premises head node.
4. If there are insufficient resources, Azure CycleCloud will scale up (or down) compute resources in Azure. A
predetermined limit can be defined to prevent over allocation.
5. Tasks allocated to the execute nodes.
6. Data cached in Azure from on-premises NFS server.
7. Data read in from the Avere vFXT for Azure cache.
8. Job and task information relayed to the Azure CycleCloud server.
Components
Azure CycleCloud a tool for creating, managing, operating, and optimizing HPC and Big Compute clusters in
Azure.
Avere vFXT on Azure is used to provide an enterprise-scale clustered file system built for the cloud.
Azure Virtual Machines (VMs) are used to create a static set of compute instances.
Virtual machine scale sets provide a group of identical VMs capable of being scaled up or down by Azure
CycleCloud.
Azure Storage accounts are used for synchronization and data retention.
Virtual Networks enable many types of Azure resources, such as Azure Virtual Machines (VMs), to securely
communicate with each other, the internet, and on-premises networks.
Alternatives
Customers can also use Azure CycleCloud to create a grid entirely in Azure. In this setup, the Azure CycleCloud
server is run within your Azure subscription.
For a modern application approach where management of a workload scheduler is not needed, Azure Batch can
help. Azure Batch can run large-scale parallel and high-performance computing (HPC) applications efficiently in the
cloud. Azure Batch allows you to define the Azure compute resources to execute your applications in parallel or at
scale without manually configuring or managing infrastructure. Azure Batch schedules compute-intensive tasks and
dynamically adds and removes compute resources based on your requirements.
Scalability, and Security
Scaling the execute nodes on Azure CycleCloud can be accomplished either manually or using autoscaling. For
more information, see CycleCloud Autoscaling.
For general guidance on designing secure solutions, see the Azure security documentation.

Deploy the scenario


Prerequisites
Follow these steps before deploying the Resource Manager template:
1. Create a service principal for retrieving the appId, displayName, name, password, and tenant.
2. Generate an SSH key pair to sign in securely to the CycleCloud server.
3. Click the link below to deploy the solution.

4. Log into the CycleCloud server to configure and create a new cluster.
5. Create a cluster.
The Avere Cache is an optional solution that can drastically increase read throughput for the application job data.
Avere vFXT for Azure solves the problem of running these enterprise HPC applications in the cloud while leveraging
data stored on-premises or in Azure Blob storage.
For organizations that are planning for a hybrid infrastructure with both on-premises storage and cloud computing,
HPC applications can "burst" into Azure using data stored in NAS devices and spin up virtual CPUs as needed. The
data set is never moved completely into the cloud. The requested bytes are temporarily cached using an Avere
cluster during processing.
To set up and configure an Avere vFXT installation, follow the Avere Setup and Configuration guide.

Pricing
The cost of running an HPC implementation using CycleCloud server will vary depending on a number of factors.
For example, CycleCloud is charged by the amount of compute time that is used, with the Primary and CycleCloud
server typically being constantly allocated and running. The cost of running the Execute nodes will depend on how
long these are up and running as well as what size is used. The normal Azure charges for storage and networking
also apply.
This scenario shows how CFD applications can be run in Azure, so the machines will require RDMA functionality,
which is only available on specific VM sizes. The following are examples of costs that could be incurred for a scale
set that is allocated continuously for eight hours per day for one month, with data egress of 1 TB. It also includes
pricing for the Azure CycleCloud server and the Avere vFXT for Azure install:
Region: North Europe
Azure CycleCloud Server: 1 x Standard D3 (4 x CPUs, 14 GB Memory, Standard HDD 32 GB)
Azure CycleCloud Primary Server: 1 x Standard D12 v (4 x CPUs, 28 GB Memory, Standard HDD 32 GB)
Azure CycleCloud Node Array: 10 x Standard H16r (16 x CPUs, 112 GB Memory)
Avere vFXT on Azure Cluster: 3 x D16s v3 (200 GB OS, Premium SSD 1-TB data disk)
Data Egress: 1 TB
Review this price estimate for the hardware listed above.
Next steps
Once you've deployed the sample, learn more about Azure CycleCloud.

Related resources
RDMA Capable Machine Instances
Customizing an RDMA Instance VM
Azure Kubernetes Service solution journey
11/2/2020 • 3 minutes to read • Edit Online

Kubernetes is an open-source system for automating deployment, scaling, and management of containerized
applications. Azure Kubernetes Service (AKS) makes it simple to deploy a managed Kubernetes cluster in Azure.
Organizations are at various points in their understanding, rationalizing, and adoption of Kubernetes on Azure.
Your organization's journey will likely follow a similar path to many other technologies you've adopted; learning,
aligning your organization around roles & responsibilities, and deploying production-ready workloads. From
there, you'll iterate; growing your product as your customer and business demands change.

Best
Practices
AKS Baseline Cluster Your Workload Operate

Align

Learn

Iterate

Learn about Azure Kubernetes Service (AKS)


If you're new to Kubernetes or AKS, the best place to learn about the service is with Microsoft Learn. Microsoft
Learn is a free, online training platform that provides interactive learning for Microsoft products and more. The
Introduction to Kubernetes on Azure learning path will provide you with foundational knowledge that will
take you through core concepts of containers, AKS cluster management, and workload deployment.
Introduction to Kubernetes on Azure

Organizational readiness
As organizations such as yours have adopted Azure, the Cloud Adoption Framework provides them prescriptive
guidance as they move between the phases of the cloud adoption lifecycle. The Cloud Adoption Framework
includes tools, programs, and content to simplify adoption of Kubernetes and related cloud-native practices at
scale.
Kubernetes in the Cloud Adoption Framework

Path to production
You understand the benefits and trade-offs of Kubernetes, and have decided that AKS is the best Azure compute
platform for your workload. Your organizational controls have been put into place; you're ready to learn how to
deploy production-ready clusters for your workload.
Microsoft's AKS Baseline Cluster is the starting point to help you build production-ready AKS clusters. We
recommend you start from this baseline implementation and modify it to align to your workload's specific needs
and Well-Architected Framework priorities.
Microsoft's AKS Baseline Cluster
Best practices
As part of on going operations, you may wish to spot check your cluster against current recommended best
practices. The best place to start is to ensure your cluster is aligned with Microsoft's AKS Baseline Cluster.
See Best Practices for Cluster Operations and Best Practices for AKS Workloads.

You may also consider evaluating a community-driven utility like The AKS Checklist as a way of organizing and
tracking your alignment to these best practices.

Operations Guide
Getting your workload deployed on AKS is a great milestone and this is when day-2 operations are going to be
top-of-mind. Microsoft's AKS Day 2 Operations Guide was built for your ease of reference. This will help
ensure you are ready to meet the demands of your customers and ensure you are prepared for break-fix situations
via optimized triage processes.
Microsoft's AKS Day 2 Operations Guide

Stay current with AKS


Kubernetes and AKS are both moving fast. The platform is evolving and just knowing what's on the roadmap
might help you make architectural decisions and understand planned deprecations; consider bookmarking it.
AKS product roadmap

Additional resources
The typical AKS solution journey depicted above ranges from learning about AKS to growing your existing clusters
to meet new product and customer demands. However, you might also just be looking for additional reference and
supporting material to help along the way for your specific situation.
Example solutions
If you're seeking additional reference material that use AKS as their foundation, here are a few to consider.
Microservices architecture on AKS
Secure DevOps for AKS
Building a telehealth system
CI/CD pipeline for container-based workloads
Azure Arc
Azure Kubernetes Service offers you a managed Kubernetes experience on Azure, however there are workloads or
situations that might be best suited for placing your own Kubernetes clusters under Azure Arc management. This
includes your clusters such as RedHat OpenShift, RedHat RKE, and Canonical Charmed Kubernetes. Azure Arc
management should also be used for AKS Engine clusters running in your datacenter, in another cloud, or on
Azure Stack Hub.
Azure Arc enabled Kubernetes
Managed service provider
If you're a managed service provider, you already use Azure Lighthouse to manage resources for multiple
customers. Azure Kubernetes Service supports Azure Lighthouse so that you can manage hosted Kubernetes
environments and deploy containerized applications within your customers' tenants.
AKS with Azure Lighthouse
AWS or GCP professionals
These articles provide service mapping and comparison between Azure and other cloud services. This reference
can help you ramp up quickly on Azure.
Containers and container orchestrators for AWS Professionals
Containers and container orchestrators for GCP Professionals
Azure Kubernetes Service (AKS) production baseline
11/2/2020 • 38 minutes to read • Edit Online

In this reference architecture, we’ll build a baseline infrastructure that deploys an Azure Kubernetes Service (AKS)
cluster. This article includes recommendations for networking, security, identity, management, and monitoring of
the cluster based on an organization’s business requirements. The requirements are assessed by using Azure
Well-Architected Framework.

An implementation of this architecture is available on GitHub: Azure Kubernetes Service (AKS) Secure
Baseline Reference Implementation. You can use it as a starting point and configure it as per your needs.
Recommended content
This reference architecture requires knowledge of Kubernetes and its concepts. If you need a refresher,
complete this workshop to deploy a multi-container application to Kubernetes on Azure Kubernetes Service
(AKS).

Azure Kubernetes Service Workshop


Architecture choices
Networking configuration
Network topology
Plan the IP addresses
Deploy Ingress resources
Cluster compute
Compute for the base cluster
Identity management
Integrate Azure AD for the cluster
Integrate Azure AD for the workload
Secure data flow
Secure the network flow
Add secret management
Business continuity
Scalability
Cluster and node availability
Availability and multi-region support
Operations
Cluster and workload CI/CD pipelines
Cluster health and metrics
Cost management and reporting

Network topology
This architecture uses a hub-spoke network topology. The hub and spoke(s) are deployed in separate virtual
networks connected through peering. Some advantages of this topology are:
Segregated management. It allows for a way to apply governance and control the blast radius. It also
supports the concept of landing zone with separation of duties.
Minimizes direct exposure of Azure resources to the public internet.
Organizations often operate with regional hub-spoke topologies. Hub-spoke network topologies can be
expanded in the future and provide workload isolation.
All web applications should require a web application firewall (WAF) service to help govern HTTP traffic
flows.
A natural choice for workloads that span multiple subscriptions.
It makes the architecture extensible. To accommodate new features or workloads, new spokes can be added
instead of redesigning the network topology.
Certain resources, such as a firewall and DNS can be shared across networks.

Hub
Su b n et Su b n et Su b n et

On-premises
(placeholder)
Azure Firewall Azure Bastion Gateway
(Outbound) (Management) (to on-premises)

Vnet-peered
Spoke
(another business unit)
Spoke
Vnet-peered
Su b n et Su b n et

Internal Load Balancer Azure Application


(AKS-managed) Gateway WAF Public Internet

Su b n et

Cluster nodes No de p o o l

Azure Key Vault


No de p o o l Kubernetes user nodes
Po d(s) Private Link
Traefik ingress controller
Kubernetes
system nodes Po d(s)
Azure Container
Workload Registry

Hub
The hub virtual network is the central point of connectivity and observability. Within the network, three subnets
are deployed.
Subnet to host Azure Firewall
Azure Firewall is firewall as a service. The firewall instance secures outbound network traffic. Without this layer of
security, the flow might communicate with a malicious third-party service that could exfiltrate sensitive company
data.
Subnet to host a gateway
This subnet is a placeholder for a VPN or ExpressRoute gateway. The gateway provides connectivity between the
routers in the on-premises network and the virtual network.
Subnet to host Azure Bastion
This subnet is a placeholder for Azure Bastion. You can use Bastion to securely access Azure resources without
exposing the resources to the internet. This subnet is used for management and operations only.
Spoke
The spoke virtual network will contain the AKS cluster and other related resources. The spoke has three subnets:
Subnet to host Azure Application Gateway
Azure Application Gateway is a web traffic load balancer operating at Layer 7. The reference implementation uses
the Application Gateway v2 SKU that enables Web Application Firewall (WAF). WAF secures incoming traffic from
common web traffic attacks. The instance has a public frontend IP configuration that receives user requests. By
design, Application Gateway requires a dedicated subnet.
Subnet to host the ingress resources
To route and distribute traffic, Traefik is the ingress controller that is going to fulfill the Kubernetes ingress
resources. The Azure internal load balancers exist in this subnet.
Subnet to host the cluster nodes
AKS maintains two separate groups of nodes (or node pools). The system node pool hosts pods that run core
cluster services. The user node pool runs the Contoso workload and the ingress controller to facilitate inbound
communication to the workload. The workload is a simple ASP.NET application.
For additional information, Hub-spoke network topology in Azure.

Plan the IP addresses

The address space of the virtual network should be large enough to hold all subnets. Account for all entities that
will receive traffic. IP addresses for those entities will be allocated from the subnet address space. Consider these
points.
Upgrade
AKS updates nodes regularly to make sure the underlying virtual machines are up to date on security
features and other system patches. During an upgrade process, AKS creates a node that temporarily hosts
the pods, while the upgrade node is cordoned and drained. That temporary node is assigned an IP address
from the cluster subnet.
For pods, you might need additional addresses depending on your strategy. For rolling updates, you'll need
addresses for the temporary pods that run the workload while the actual pods are updated. If you use the
replace strategy, pods are removed, and the new ones are created. So, addresses associated with the old
pods are reused.
Scalability
Take into consideration the node count of all system and user nodes and their maximum scalability limit.
Suppose you want to scale out by 400%. You'll need four times the number of addresses for all those
scaled-out nodes.
In this architecture, each pod can be contacted directly. So, each pod needs an individual address. Pod
scalability will impact the address calculation. That decision will depend on your choice about the number
of pods you want to grow.
Azure Private Link addresses
Factor in the addresses that are required for communication with other Azure services over Private Link. In
this architecture, we have two addresses assigned for the links to Azure Container Registry and Key Vault.
Certain addresses are reserved for use by Azure. They can't be assigned.
The preceding list isn't exhaustive. If your design has other resources that will impact the number of available IP
addresses, accommodate those addresses.
This architecture is designed for a single workload. For multiple workloads, you may want to isolate the user node
pools from each other and from the system node pool. That choice may result in more subnets that are smaller in
size. Also, the ingress resource might be more complex. You might need multiple ingress controllers that will
require extra addresses.
For the complete set of considerations for this architecture, see AKS baseline Network Topology.
For information related to planning IP for an AKS cluster, see Plan IP addressing for your cluster.

Configure compute for the base cluster


In AKS, each node pool maps to a virtual machine scale set. Nodes are VMs in each node pool. Consider using a
smaller VM size for the system node pool to minimize costs. This reference implementation deploys the system
node pool with three DS2_v2 nodes. That size is sufficient to meet the expected load of the system pods. The OS
disk is 512 GB.
For the user node pool, here are some considerations:
Choose larger node sizes to pack the maximum number of pods set on a node. It will minimize the
footprint of services that run on all nodes, such as monitoring and logging.
Deploy at least two nodes. That way, the workload will have a high availability pattern with two replicas.
With AKS, you can change the node count without recreating the cluster.
Actual node sizes for your workload will depend on the requirements determined by the design team.
Based on the business requirements, we've chosen DS4_v2 for the production workload. To lower costs one
could drop the size to DS3_v2, which is the minimum recommendation.
When planning capacity for your cluster, assume that your workload can consume up to 80% of each node;
the remaining 20% is reserved for AKS services.
The maximum pods per node, is set to 30, which is also the default. Increasing this value can impact
performance because of an unexpected node failure or expected node maintenance events.

Integrate Azure Active Directory for the cluster


Securing access to and from the cluster is critical. Think from the cluster's perspective when you're making
security choices:
Outside-in access. Authorize only those external entities that are allowed access to the Kubernetes API
server and Azure Resource Manager.
Inside-out access. Authorize only those resources that the cluster is allowed access.
There are two ways to manage access through Azure Active Directory (Azure AD): service principals or managed
identities for Azure resources.
Of the two ways, managed identities is recommended. With service principals, you are responsible for managing
and rotating secrets, either manually or programmatically. With managed identities, Azure AD manages and
performs the authentication and timely rotation of secrets for you.
It’s recommended that managed identities is enabled so that the cluster can interact with external Azure resources
through Azure AD. You can enable this setting only during cluster creation. Even if Azure AD isn't used
immediately, you can incorporate it later.
As an example for the inside-out case, let’s study the use of managed identities when the cluster needs to pull
images from a container registry. This action requires the cluster to get the credentials of the registry. One way is
to store that information in the form of Kubernetes Secrets object and use imagePullSecrets to retrieve the secret.
That approach isn't recommended because of security complexities. Not only do you need prior knowledge of the
secret but also disclosure of that secret through the DevOps pipeline. Another reason is the operational overhead
of managing the rotation of the secret. Instead, grant acrPull access to the managed identity of the cluster to
your registry. This approach addresses those concerns.
In this architecture, the cluster accesses Azure resources that are secured by Azure AD and perform operations
that support managed identities. Assign role-based access control (RBAC) and permissions to the cluster’s
managed identities, depending on the operations that the cluster intends to do. The cluster will authenticate itself
to Azure AD and then be allowed or denied access based on the roles it has been assigned. Here are some
examples from this reference implementation where Azure RBAC built-in roles have been assigned to the cluster:
Network Contributor. The cluster’s ability to control the spoke virtual network. This role assignment allows
AKS cluster system assigned identity to work with the dedicated subnet for the Internal Ingress Controller
services.
Monitoring Metrics Publisher. The cluster’s ability to send metrics to Azure Monitor.
AcrPull. The cluster’s ability to pull images from the specified Azure Container Registries.
Azure AD integration also simplifies security for outside-in access. Suppose a user wants to use kubectl. As an
initial step, sends the az aks get-credentials command to get the credentials of the cluster. Azure AD will
authenticate the user’s identity against the Azure Resource Manager RBAC roles that are allowed to get cluster
credentials. For more information, see Available cluster roles permissions.
Associate Kubernetes RBAC to Azure Active Directory
Kubernetes supports role-based access control (RBAC) through:
A set of permissions. Defined by a Role or ClusterRole object for cluster-wide permissions.
Bindings that assign users and groups who are allowed to do the actions. Defined by a RoleBinding or
CluserRoleBinding object.

Kubernetes has some built-in roles such as cluster-admin, edit, view, and so on. Bind those roles to Azure Active
Directory users and groups to use enterprise directory to manage access. For more information, see Use
Kubernetes RBAC with Azure AD integration.
There’s also an option of using Azure RBAC roles instead of the Kubernetes built-in roles. For more information,
see Azure RBAC roles.

Integrate Azure Active Directory for the workload


Similar to having Azure Managed Identities for the entire cluster, you can assign managed identities at the pod
level. A pod managed identity allows the hosted workload to access resources through Azure Active Directory. For
example, the workload stores files in the Azure Storage. When it needs to access those files, the pod will
authenticate itself against the resource.
In this reference implementation, managed pod identities is facilitated through aad-pod-identity.

Deploy Ingress resources


Kubernetes Ingress resources route and distribute incoming traffic to the cluster. There are two portions of Ingress
resources:
Internal load balancer. Managed by AKS. This load balancer exposes the ingress controller through a private
static IP address. It serves as single point of contact that receives inbound flows.
In this architecture, Azure Load Balancer is used. It’s placed outside the cluster in a subnet dedicated for
ingress resources. It receives traffic from Azure Application Gateway and that communication is over TLS.
For information about TLS encryption for inbound traffic, see Ingress traffic flow.
Ingress controller. We have chosen Traefik. It runs in the user node pool in the cluster. It receives traffic from
the internal load balancer, terminates TLS, and forwards it to the workload pods over HTTP.
The ingress controller is a critical component of cluster. Consider these points when configuring this
component.
As part of your design decisions, choose a scope within which the ingress controller will be allowed
operate. For example, you might allow the controller to only interact with the pods that run a specific
workload.
Avoid placing replicas on the same node to spread out the load and ensure business continuity if a node
does down. Use podAntiAffinity for this purpose.
Constrain pods to be scheduled only on the user node pool by using nodeSelectors . This setting will
isolate workload and system pods.
Open ports and protocols that allow specific entities to send traffic to the ingress controller. In this
architecture, Traefik only receives traffic from Azure Application Gateway.
Ingress controller should send signals that indicate the health of pods. Configure readinessProbe and
livenessProbe settings that will monitor the health of the pods at the specified interval.

Consider restricting the ingress controller’s access to specific resources and the ability to perform certain
actions. That restriction can be implemented through Kubernetes RBAC permissions. For example, in this
architecture, Traefik has been granted permissions to watch, get, and list services and endpoints by using
rules in the Kubernetes ClusterRole object.
Router settings
The ingress controller uses routes to determine where to send traffic. Routes specify the source port at which the
traffic is received and information about the destination ports and protocols.
Here’s an example from this architecture:
Traefik uses the Kubernetes provider to configure routes. The annotations , tls , and entrypoints indicate that
routes will be served over HTTPS. The middlewares specifies that only traffic from the Azure Application Gateway
subnet is allowed. The responses will use gzip encoding if the client accepts. Because Traefik does TLS termination,
communication with the backend services is over HTTP.

apiVersion:networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: aspnetapp-ingress
namespace: a0008
annotations:
kubernetes.io/ingress.allow-http: "false"
kubernetes.io/ingress.class: traefik-internal
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.tls: "true"
traefik.ingress.kubernetes.io/router.tls.options: default
traefik.ingress.kubernetes.io/router.middlewares: app-gateway-snet@file, gzip-compress@file
spec:
tls:
- hosts:
- bu0001a0008-00.aks-ingress.contoso.com
rules:
- host: bu0001a0008-00.aks-ingress.contoso.com
http:
paths:
- path: /
backend:
serviceName: aspnetapp-service
servicePort: http

Secure the network flow


Network flow, in this context, can be categorized as:
Ingress traffic . From the client to the workload running in the cluster.
Egress traffic . From a pod or node in the cluster to an external service.
Pod-to-pod traffic . Communication between pods. This traffic includes communication between the
ingress controller and the workload. Also, if your workload is composed of multiple applications deployed
to the cluster, communication between those applications would fall into this category.
Management traffic . Traffic that goes between the client and the Kubernetes API server.
This architecture has several layers of security to secure all types of traffic.
Ingress traffic flow
The architecture only accepts TLS encrypted requests from the client. TLS v1.2 is the minimum allowed version
with a restricted set of cyphers. Server Name Indication (SNI) strict is enabled. End-to-end TLS is set up through
Application Gateway by using two different TLS certificates, as shown in this image.

1. The client sends an HTTPS request to the domain name: bicycle.contoso.com. That name is associated with
through a DNS A record to the public IP address of Azure Application Gateway. This traffic is encrypted to
make sure that the traffic between the client browser and gateway cannot be inspected or changed.
2. Application Gateway has an integrated web application firewall (WAF) and negotiates the TLS handshake
for bicycle.contoso.com, allowing only secure ciphers. Application Gateway is a TLS termination point, as it's
required to process WAF inspection rules, and execute routing rules that forward the traffic to the
configured backend. The TLS certificate is stored in Azure Key Vault. It’s accessed using a user-assigned
managed identity integrated with Application Gateway. For information about that feature, see TLS
termination with Key Vault certificates.
3. The traffic moves from Application Gateway to the backend, the traffic is encrypted again with another TLS
certificate (wildcard for *.aks-ingress.contoso.com) as it’s forwarded to the internal load balancer. This re-
encryption makes sure traffic that is not secure doesn’t flow into the cluster subnet.
4. The ingress controller receives the encrypted traffic through the load balancer. The controller is another TLS
termination point for *.aks-ingress.contoso.com and forwards the traffic to the workload pods over HTTP.
The certificates are stored in Azure Key Vault and mounted into the cluster using the Container Storage
Interface (CSI) driver. For more information, see Add secret management.
You can implement end-to-end TLS traffic all at every hop the way through to the workload pod. Be sure to
measure the performance, latency, and operational impact when making the decision to secure pod-to-pod traffic.
For most single-tenant clusters, with proper control plane RBAC and mature Software Development Lifecycle
practices, it's sufficient to TLS encrypt up to the ingress controller and protect with Web Application Firewall
(WAF). That will minimize overhead in workload management and network performance impacts. Your workload
and compliance requirements will dictate where you perform TLS termination.
Egress traffic flow
For zero-trust control and the ability to inspect traffic, all egress traffic from the cluster moves through Azure
Firewall. You can implement that choice using user-defined routes (UDRs). The next hop of the route is the private
IP address of the Azure Firewall. Here, Azure Firewall decides whether to block or allow the egress traffic. That
decision is based on the specific rules defined in the Azure Firewall or the built-in threat intelligence rules.

NOTE
If you use a public load balancer as your public point for ingress traffic and egress through Azure Firewall using UDRs, you
might see an asymmetric routing situation. This architecture uses internal load balancers in a dedicated ingress subnet
behind the Application Gateway. This design choice not only enhances security but also eliminates asymmetric routing
concerns. Alternatively, you could route ingress traffic through your Azure Firewall before or after your Application Gateway.
That approach isn't necessary or recommended for most situations. For more information about asymmetric routing, see
Integrate Azure Firewall with Azure Standard Load Balancer.

An exception to the zero-trust control is when the cluster needs to communicate with other Azure resources. For
instance, the cluster needs to pull an updated image from the container registry. The recommended approach is
by using Azure Private Link. The advantage is that specific subnets reach the service directly. Also, traffic between
the cluster and the service isn't exposed to public internet. A downside is that Private Link needs additional
configuration instead of using the target service over its public endpoint. Also, not all Azure services or SKUs
support Private Link. For those cases, consider enabling a Service Endpoint on the subnet to access the service.
If Private Link or Service Endpoints aren't an option, you can reach other services through their public endpoints,
and control access through Azure Firewall rules and the firewall built into the target service. Because this traffic
will go through the static IP address of the firewall, that address can be added the service’s IP allow list. One
downside is that Azure Firewall will need to have additional rules to make sure only traffic from specific subnet is
allowed.
Pod-to -pod traffic
By default, a pod can accept traffic from any other pod in the cluster. Kubernetes NetworkPolicy is used to restrict
network traffic between pods. Apply policies judiciously, otherwise you might have a situation where a critical
network flow is blocked. Only allow specific communication paths, as needed, such as traffic between the ingress
controller and workload. For more information, see Network policies.
Enable network policy when the cluster is provisioned because it can't be added later. There are a few choices for
technologies that implement NetworkPolicy . Azure Network Policy is recommended, which requires Azure
Container Networking Interface (CNI), see the note below. Other options include Calico Network Policy, a well-
known open-source option. Consider Calico if you need to manage cluster-wide network policies. Calico isn't
covered under standard Azure support.
For information, see Differences between Azure Network Policy and Calico policies and their capabilities.

NOTE
AKS supports these networking models: kubenet and Azure Container Networking Interface (CNI). CNI is more advanced of
the two models and is required for enabling Azure Network Policy. In this model, every pod gets an IP address from the
subnet address space. Resources within the same network (or peered resources) can access the pods directly through their
IP address. NAT isn't needed for routing that traffic. So, CNI is performant because there aren’t additional network overlays.
It also offers better security control because it enables the use Azure Network Policy. In general, CNI is recommended. CNI
offers granular control by teams and the resources they control. Also, CNI allows for more scaled pods than kubenet.
Carefully consider this choice otherwise, the cluster will need to be redeployed. For information about the models, see
Compare network models.

Management traffic
As part of running the cluster, the Kubernetes API server will receive traffic from resources that want to do
management operations on the cluster, such as requests to create resources or the scale the cluster. Examples of
those resources include the build agent pool in a DevOps pipeline, a Bastion subnet, and node pools themselves.
Instead of accepting this management traffic from all IP addresses, use AKS’s Authorized IP Ranges feature to only
allow traffic from your authorized IP ranges to the API server.
For more information, see Define API server authorized IP ranges.

Add secret management


Store secrets in a managed key store, such as Azure Key Vault. The advantage is that the managed store handles
rotation of secrets, offers strong encryption, provides an access audit log, and keeps core secrets out of the
deployment pipeline.
Azure Key Vault is well integrated with other Azure services. Use the built-in feature of those services to access
secrets. For an example about how Azure Application Gateway accesses TLS certificates for the ingress flow, see
the Ingress traffic flow section.
Accessing cluster secrets
You'll need to use pod managed identities to allow a pod to access secrets from a specific store.
To facilitate the retrieval process, use a Secrets Store CSI driver. When the pod needs a secret, the driver connects
with the specified store, retrieves secret on a volume, and mounts that volume in the cluster. The pod can then get
the secret from the volume file system.
The CSI driver has many providers to support various managed stores. In this implementation, we’ve chosen the
Azure Key Vault with Secrets Store CSI Driver to retrieve the TLS certificate from Azure Key Vault and load it in the
pod running the ingress controller. It's done during pod creation and the volume stores both public and the
private keys.

Workload storage
The workload used in this architecture is stateless. If you need to store state, persisting it outside the cluster is
recommended. Guidance for workload state is outside the scope of this article.
To learn more about storage options, see Storage options for applications in Azure Kubernetes Service (AKS).

Node and pod scalability


With increasing demand, Kubernetes can scale out by adding more pods to existing nodes, through horizontal
pod autoscaling (HPA). When additional pods can no longer be scheduled, the number of nodes must be increased
through AKS cluster autoscaling. A complete scaling solution must have ways to scale both pod replicas and the
node count in the cluster.
There are two approaches: autoscaling or manual scaling.
The manual or programmatic way requires you to monitor and set alerts on CPU utilization or custom metrics.
For pod scaling, the application operator can increase or decrease the number of pod replicas by adjusting the
ReplicaSet through Kubernetes APIs. For cluster scaling, one way is to get notified when the Kubernetes
scheduler fails. Another way is to watch for pending pods over time. You can adjust the node count through Azure
CLI or the portal.
Autoscaling is the approach because some of those manual mechanisms are built into the autoscaler.
As a general approach, start by performance testing with a minimum number of pods and nodes. Use those
values to establish the baseline expectation. Then use a combination of performance metrics and manual scaling
to locate bottlenecks and understand the application’s response to scaling. Finally, use this data to set the
parameters for autoscaling. For information about a performance tuning scenario using AKS, see Performance
tuning scenario: Distributed business transactions.
Horizontal Pod Autoscaler
The Horizontal Pod Autoscaler (HPA) is a Kubernetes resource that scales the number of pods.
In the HPA resource, setting the minimum and maximum replica count is recommended. Those values constrain
the autoscaling bounds.
HPA can scale based on the CPU utilization, memory usage, and custom metrics. Only CPU utilization is provided
out of the box. The HorizontalPodAutoscaler definition specifies target values for those metrics. For instance, the
spec sets a target CPU utilization. While pods are running, the HPA controller uses Kubernetes Metrics API to check
each pod’s CPU utilization. It compares that value against the target utilization and calculates a ratio. It then uses
the ratio to determine whether pods are overallocated or underallocated. It relies on the Kubernetes scheduler to
assign new pods to nodes or remove pods from nodes.
There might be a race condition where (HPA) checks before a scaling operation is complete. The outcome might
be an incorrect ratio calculation. For details, see Cooldown of scaling events.
If your workload is event-driven, a popular open-source option is KEDA. Consider KEDA if your workload is driven
by an event source, such as message queue, rather than being CPU- or memory-bound. KEDA supports many
event sources (or scalers). You can find the list of supported KEDA scalers here including the Azure Monitor scaler;
a convenient way to scale KEDA workloads based on Azure Monitor metrics.
Cluster Autoscaler
The cluster autoscaler is an AKS add-on component that scales the number of nodes in a node pool. It should be
added during cluster provisioning. You need a separate cluster autoscaler for each user node pool.
The cluster autoscaler is triggered by the Kubernetes scheduler. When the Kubernetes scheduler fails to schedule a
pod because of resource constraints, the autoscaler automatically provisions a new node in the node pool.
Conversely, the cluster autoscaler checks the unused capacity of the nodes. If the node is not running at an
expected capacity, the pods are moved to another node, and the unused node is removed.
When you enable autoscaler, set the maximum and minimum node count. The recommended values depend on
the performance expectation of the workload, how much you want the cluster to grow, and cost implications. The
minimum number is the reserved capacity for that node pool. In this reference implementation, the minimum
value is set to 2 because of the simple nature of the workload.
For the system node pool, the recommended minimum value is 3.

Business continuity decisions


To maintain business continuity, define the Service Level Agreement for the infrastructure and your application.
For information about monthly uptime calculation, see SLA for Azure Kubernetes Service (AKS).
Cluster nodes
To meet the minimum level of availability for workloads, multiple nodes in a node pool are needed. If a node goes
down, another node in the node pool in the same cluster can continue running the application. For reliability, three
nodes are recommended for the system node pool. For the user node pool, start with no less than two nodes. If
you need higher availability, provision more nodes.
Isolate your application from the system services by placing it in a separate node pool. This way, Kubernetes
services run on dedicated nodes and don’t compete with your workload. Use of tags, labels, and taints is
recommended to identify the node pool to schedule your workload.
Regular upkeep of your cluster such as timely updates is crucial for reliability. Also monitoring the health of the
pods through probes is recommended.
Pod availability
Ensure pod resources . It’s highly recommended that deployments specify pod resource requirements. The
scheduler can then appropriately schedule the pod. Reliability will significantly deprecate if pods cannot be
scheduled.
Set pod disruption budgets . This setting determines how many replicas in a deployment can come down
during an update or upgrade event. For more information, see Pod disruption budgets.
Configure multiple replicas in the deployment to handle disruptions such as hardware failures. For planned events
such as updates and upgrades, a disruption budget can ensure the required number of pod replicas exist to
handle expected application load.
Set resource quotas on the workload namespaces . The resource quota on a namespace will ensure pod
requests and limits are properly set on a deployment. For more information, see Enforce resource quotas.

NOTE
Setting resources quotas at the cluster level can cause problem when deploying third-party workloads that do not have
proper requests and limits.

Set pod requests and limits . Setting these limits allows Kubernetes to efficiently allocate CPU and, or memory
resources to the pods and have higher container density on a node. Limits can also increase reliability with
reduced costs because of better hardware utilization.
To estimate the limits, test and establish a baseline. Start with equal values for requests and limits. Then, gradually
tune those values until you have established a threshold that can cause instability in the cluster.
Those limits can be specified in your deployment manifests. For more information, see Set pod requests and
limits.
Availability zones and multi-region support
If your SLA requires a higher uptime, protect against loss in a zone. You can use availability zones if the region
supports them. The nodes in the user node pool are then able to spread across zones. If an entire zone is
unavailable, a node in another zone within the region is still available. Each node pool maps to a separate virtual
machine scale set, which manages node instances and scalability. Scale set operations and configuration managed
by the AKS service. Here are some considerations when enabling multizone:
Entire infrastructure. Choose a region that supports availability zones. For more information, see
Limitations and region availability. If you want to buy an Uptime SLA, choose a region that supports that
option.
Cluster . Availability zones can only be set when the node pool is created and can't be changed later. The
node sizes should be supported in all zones so that the expected distribution is possible. The underlying
virtual machine scale set provides the same hardware configuration across zones.
Multizone support only applies to node pools. The AKS API server is in a single zone. If the API server
becomes unavailable as part of a zone failure, pods deployed on node pools will continue to run, however
Kubernetes will lose orchestration capabilities, and workload might be affected.
Dependent resources . For complete zonal benefit, all service dependencies must also support zones. If a
dependent service doesn't support zones, it's possible that a zone failure could cause that service to fail.
For example, a managed disk is available in the zone in which it's provisioned. In case of a failure, the node might
move to another zone, but the managed disk won’t move with the node to that zone.
For simplicity, in this architecture AKS is deployed to a single region with node pools spanning availability zones 1,
2, and 3. Other resources of the infrastructure, such as Azure Firewall and Application Gateway are deployed to
the same region also with multizone support. Geo-replication is enabled for Azure Container Registry.
Multiple regions
Enabling availability zones won’t be enough if the entire region goes down. To have higher availability, run
multiple AKS clusters, in different regions.
Use paired regions. Consider using a CI/CD pipeline that is configured to use a paired region to recover
from region failures. A benefit of using paired regions is reliability during updates. Azure makes sure that
only one region in the pair is updated at a time. Certain DevOps tools such as flux can make the multi-
region deployments easier.
If an Azure resource supports geo-redundancy, provide the location where the redundant service will have
its secondary. For example, enabling geo-replication for Azure Container Registry will automatically
replicate images to the selected Azure regions, and will provide continued access to images even if a region
were experiencing an outage.
Choose a traffic router that can distribute traffic across zones or regions, depending on your requirement.
This architecture deploys Azure Load Balancer because it can distribute non-web traffic across zones. If you
need to distribute traffic across regions, Azure Front Door should be considered. For other considerations,
see Choose a load balancer.
Disaster Recovery
In case of failure in the primary region, you should be able to quickly create a new instance in another region.
Here are some recommendations:
Use paired regions.
A non-stateful workload can be replicated efficiently. If you need to store state in the cluster (not
recommended), make sure you back up the data frequently in the paired region.
Integrate the recovery strategy, such as replicating to another region, as part of the DevOps pipeline to
meet your Service Level Objectives (SLO).
When provisioning each Azure service, choose features that support disaster recovery. For example, in this
architecture, Azure Container Registry is enabled for geo-replication. If a region goes down, you can still
pull images from the replicated region.
Kubernetes API Server Uptime SLA
AKS can be used as a free service, but that tier doesn't offer a financially backed SLA. To obtain that SLA, you must
choose to add an Uptime SLA to your purchase. We recommend all production clusters use this option. Reserve
clusters without this option for pre-production clusters. When combined with Azure Availability Zones, the
Kubernetes API server SLA is increased to 99.95%. Your node pools, and other resources are covered under their
own SLA.
Tradeoff
There’s a cost-to-availability tradeoff for deploying the architecture across zones and especially regions. Some
replication features, such as geo-replication in Azure Container Registry, are available in premium SKUs, which is
more expensive. The cost will also increase because bandwidth charges that are applied when traffic moves across
zones and regions.
Also, expect additional network latency in node communication between zones or regions. Measure the impact of
this architectural decision on your workload.
Test with simulations and forced failovers
Ensure reliability through forced failover testing with simulated outages such as bring down a node, bringing
down all AKS resources in a particular zone to simulate a zonal failure, or bringing down an external dependency.

Monitor and collect metrics


The Azure Monitor for containers feature is the recommended tool for monitoring and logging because you can
view events in real time. It captures container logs from the running pods and aggregates them for viewing. It also
collects information from Metrics API about memory and CPU utilization to monitor the health of running
resources and workloads. You can use it to monitor performance as the pods scale. Another advantage is that you
can easily use Azure portal to configure charts and dashboards. It has the capability to create alerts that trigger
Automation Runbooks, Azure Functions, and others.
Most workloads hosted in pods emit Prometheus metrics. Azure Monitor is capable of scraping Prometheus
metrics and visualizing them.
There are some third-party utilities integrated with Kubernetes. Take advantage of log and metrics platforms such
as Grafana or Datadog, if your organization already uses them.
With AKS, Azure manages some core Kubernetes services. Logs from those services should only be enabled per
request from customer support. However, it is recommended that you enable these log sources as they can help
you troubleshoot cluster issues:
Logging on the ClusterAutoscaler to gain observability into the scaling operations. For more information, see
Retrieve cluster autoscaler logs and status.
KubeControllerManager to have observability into pod scheduler.
KubeAuditAdmin to have observability into activities that modify your cluster.
Enable self-healing
Monitor the health of pods by setting Liveness and Readiness probes. If an unresponsive pod is detected,
Kubernetes restarts the pod. Liveness probe determines if the pod is healthy. If it does not respond, Kubernetes
will restart the pod. Readiness probe determines if the pod is ready to receive requests/traffic.
NOTE
AKS provides built-in self-healing of infrastructure nodes using Node Auto-Repair.

Security updates
Keep the Kubernetes version up to date with the supported N-2 versions. Upgrading to the latest version of
Kubernetes is critical because new versions are released frequently.
For more information, see Regularly update to the latest version of Kubernetes and Upgrade an Azure Kubernetes
Service (AKS) cluster.
Weekly updates
AKS provides new node images that have the latest OS and runtime updates. These new images are not
automatically applied. You are responsible for deciding how often the images should get updated. It's
recommended that you have a process to upgrade your node pools' base image weekly. For more information,
see Azure Kubernetes Service (AKS) node image upgrade the AKS Release Notes.
Daily updates
Between image upgrades, AKS nodes download and install OS and runtime patches, individually. An installation
might require the node VMs to be rebooted. AKS will not reboot nodes due to pending updates. Have a process
that monitors nodes for the applied updates that require a reboot and performs the reboot of those nodes in a
controlled manner. An open-source option is Kured (Kubernetes reboot daemon).
Keeping your node images in sync with the latest weekly release will minimize these occasional reboot requests
while maintaining an enhanced security posture. Relying just on node image upgrades will ensure AKS
compatibility and weekly security patching. Whereas, applying daily updates will fix security issues faster, they
haven't necessarily been tested in AKS. Where possible, use node image upgrade as your primary weekly security
patching strategy.
Security monitoring
Monitor your container infrastructure for both active threats and potential security risks:
Enable Azure Defender for Kubernetes for threat detection on your Kubernetes clusters.
Use Azure Security Center (ASC) to monitor Kubernetes security posture.
For information about security hardening applied to AKS virtual machine hosts, see Security Hardening in host
OS.

Cluster and workload operations (DevOps)


Here are some considerations. For more information, see the Operational Excellence pillar.
Isolate workload responsibilities
Divide the workload by teams and types of resources to individually manage each portion.
Start with a basic workload that contains the fundamental components and build on it. An initial task would be to
configure networking. Provision virtual networks for the hub and spoke and subnets within those networks. For
instance, the spoke has separate subnets for system and user node pools, and ingress resources. A subnet for
Azure Firewall in the hub.
Another portion could be to integrate the basic workload with Azure Active Directory.
Use Infrastructure as Code (IaC )
Choose an idempotent declarative method over an imperative approach, where possible. Instead of writing a
sequence of commands that specify configuration options, use declarative syntax that describes the resources and
their properties. One option is an Azure Resource Manager (ARM) templates another is Terraform.
Make sure as you provision resources as per the governing policies. For example, when selecting the right VM
sizes, stay within the cost constraints, availability zone options to match the requirements of your application.
If you need to write a sequence of commands, use Azure CLI. These commands cover a range of Azure services
and can be automated through scripting. Azure CLI is supported on Windows and Linux. Another cross-platform
option is Azure PowerShell. Your choice will depend on preferred skillset.
Store and version scripts and template files in your source control system.
Workload CI/CD
Pipelines for workflow and deployment must have the ability to build and deploy applications continuously.
Updates must be deployed safely and quickly and rolled back in case there are issues.
Your deployment strategy must include a reliable and an automated continuous delivery (CD) pipeline. Changes
to your workload container images should be automatically deployed to the cluster.
In this architecture, we've chosen GitHub Actions for managing the workflow and deployment. Other popular
options include Azure DevOps Services and Jenkins.
Cluster CI/CD

Instead of using an imperative approach like kubectl, use tools that automatically synchronize cluster and
repository changes. To manage the workflow, such as release of a new version and validation of that version
before deploying to production, consider a GitOps flow. An agent is deployed in the cluster to make sure that the
state of the cluster is coordinated with configuration stored in your private Git repo. Kubernetes and AKS do not
support that experience natively. A recommended option is flux. It uses one or more operators in the cluster to
trigger deployments inside Kubernetes. flux does these tasks:
Monitors all configured repositories.
Detects new configuration changes.
Triggers deployments.
Updates the desired running configuration based on those changes.
You can also set policies that govern how those changes are deployed.
Here’s an example from the reference implementation that shows how to automate cluster configuration with
GitOps and Flux.
1. A developer commits changes to source code, such as configuration YAML files, which are stored in a git
repository. The changes are then pushed to a git server.
2. flux runs in pod in alongside the workload. flux has read-only access to the git repository to make sure that
flux is only applying changes as requested by developers.
3. flux recognizes changes in configuration and applies those changes using kubectl commands.
4. Developers do not have direct access to the Kubernetes API through kubectl. Have branch policies on your
git server. That way, multiple developers can approve a change before it’s applied to production.
Workload and cluster deployment strategies
Deploy any change (architecture components, workload, cluster configuration), to at least one pre-production AKS
cluster. Doing so will simulate the change might unravel issues before deploying to production.
Run tests/validations at each stage before moving on to the next to make sure you can push updates to the
production environment in a highly controlled way and minimize disruption from unanticipated deployment
issues. This deployment should follow a similar pattern as production, using the same GitHub Actions pipeline or
Flux operators.
Advanced deployment techniques such as Blue-green deployment, A/B testing, and Canary releases, will require
additional process and potentially tooling. Flagger is a popular open-source solution to help solve for your
advanced deployment scenarios.

Cost management
Use the Azure pricing calculator to estimate costs for the services used in the architecture. Other best practices are
described in the Cost Optimization section in Microsoft Azure Well-Architected Framework.
Provision
There are no costs associated for AKS in deployment, management, and operations of the Kubernetes
cluster. The main cost driver is the virtual machine instances, storage, and networking resources consumed
by the cluster. Consider choosing cheaper VMs for system node pools. The recommended SKU is DS2_v2.
Don’t have the same configuration for dev/test and production environments. Production workloads have
extra requirements for high availability and will be more expensive. It may not be necessary in the dev/test
environment.
For production workloads, add an Uptime SLA. However, there are savings for clusters designed for
dev/test or experimental workloads where availability is not required to be guaranteed. For instance, the
SLO is sufficient. Also, if your workload supports it, consider using dedicated spot node pools that run Spot
VMs.
For non-production workloads that include Azure SQL Database or Azure App Service as part of the AKS
workload architecture, evaluate if you are eligible to use Azure Dev/Test subscriptions to receive service
discounts.
Instead of starting with an oversized cluster to meet the scaling needs, provision a cluster with minimum
number of nodes and enable the cluster autoscaler to monitor and make sizing decisions.
Set pod requests and limits to allow Kubernetes to allocate node resources with higher density so that
hardware is utilized to capacity.
Enabling diagnostics on the cluster can increase the cost.
If your workload is expected exist for a long period, you can commit to one- or three-year Reserved Virtual
Machine Instances to reduce the node costs. For more information, see Reserved VMs.
Use tags when you create node pools. Tags are useful in creating custom reports to track the incurred costs.
Tags give the ability to track the total of expenses and map any cost to a specific resource or team. Also, if
the cluster is shared between teams, build chargeback reports per consumer to identify metered costs for
shared cloud services. For more information, see Specify a taint, label, or tag for a node pool.
Data transfers within availability zones of a region are not free. If your workload is multi-region or there
are transfers across billing zones, then expect additional bandwidth cost. For more information, see Traffic
across billing zones and regions.
Create budgets to stay within the cost constraints identified by the organization. One way is to create
budgets through Azure Cost Management. You can also create alerts to get notifications when certain
thresholds are exceeded. For more information, see Create a budget using a template.
Monitor
In order to monitor cost of the entire cluster, along with compute cost also gather cost information about storage,
bandwidth, firewall, and logs. Azure provides various dashboards to monitor and analyze cost:
Azure Advisor
Azure Cost Management
Ideally, monitor cost in real time or at least at a regular cadence to take action before the end of the month when
costs are already calculated. Also monitor the monthly trend over time to stay in the budget.
To make data-driven decisions, pinpoint which resource (granular level) incurs most cost. Also have a good
understanding of the meters that are used to calculate usage of each resource. By analyzing metrics, you can
determine if the platform is over-sized for instance. You can see the usage meters in Azure Monitor metrics.
Optimize
Act on recommendations provided by Azure Advisor. There are other ways to optimize:
Enable the cluster autoscaler to detect and remove underutilized nodes in the node pool.
Choose a lower SKU for the node pools, if your workload supports it.
If the application doesn’t require burst scaling, consider sizing the cluster to just the right size by analyzing
performance metrics over time.
For other cost-related information, see AKS pricing.

Next Steps
To learn about hosting Microservices on AKS, see Microservices architecture on Azure Kubernetes Service
(AKS).
The see the AKS product roadmap, see Azure Kubernetes Service Roadmap on GitHub.
Azure Kubernetes Services (AKS) Day 2 Operations
Guide
11/2/2020 • 2 minutes to read • Edit Online

After releasing your Azure Kubernetes Service-hosted application, prepare for day-2 operations. The term refers to
the ongoing maintainence of the deployed assets and rollout of upgrades. The operations can help you:
Keep up to date with your service-level agreement (SLA) or Service-level objective (SLO) requirements.
Troubleshoot customer support requests.
Stay current with the latest platform features and security updates.
Plan for future growth.

Prerequisites
The best practices for day-2 operations assume that you've deployed the Azure Kubernetes Service (AKS) Baseline
architecture as an example of a production cluster.

Triage practices
It's often challenging to do root-cause analysis given the different aspects of an AKS cluster. When triaging issues,
consider a top-down approach on the cluster hierarchy. Start at the cluster level and drill down if necessary.
Cluster Node pools Nodes Pods Containers

In this series, we'll walk you through the thought process of this approach. The articles show examples using a set
of tools and dashboards, and how they can highlight some symptoms.
Common causes addressed in this series include:
Network and connectivity problems caused by improper configuration.
Control plane to node communication is broken.
Kubelet pressures caused by insufficient compute, memory, or storage resources.
DNS resolution issues.
Node health is running out of disk IOPS.
Admission control pipeline is blocking a large number of requests to the API server.
The cluster doesn't have permissions to pull from the appropriate container registry.
It’s not intended to resolve specific issues. For information about troubleshooting specific issues, AKS Common
Issues.

In this series
ST EP DESC RIP T IO N

1- Check the AKS cluster health Start by checking the cluster the health of the overall cluster
and networking.

2- Examine the node and pod health Check the health of the AKS worker nodes.
ST EP DESC RIP T IO N

3- Check the workload deployments Check to see that all deployments and daemonSets are
running.

4- Validate the admission controllers Check whether the admission controllers are working as
expected.

5- Verify the connection to the container registry Verify the connection to the container registry.

Related links
Day-2 operations
AKS periscope
AKS roadmap
AKS documentation
Check the AKS cluster health
11/2/2020 • 2 minutes to read • Edit Online

Start by checking the health of the overall cluster and networking.


This article is part of a series. Read the introduction here.
Tools
AKS Diagnostics . In Azure portal, navigate to the AKS cluster resource. Select Diagnose and solve problems .

Diagnostics shows a list of results from various test runs. If there are any issues found, More info can show you
information about the underlying issue.
This image indicates that network and connectivity issues are caused by Azure CNI subnet configuration.
Next steps
Examine the node and pod health
Examine the node and pod health
11/2/2020 • 5 minutes to read • Edit Online

If the cluster checks are clear, check the health of the AKS worker nodes. Determine the reason for the unhealthy
node and resolve the issue.
This article is part of a series. Read the introduction here.

1- Check the overall health of the worker nodes


A node can be unhealthy because of various reasons. A common reason is when the control plane to node
communication is broken as a result of misconfiguration of routing and firewall rules. As a fix, you can allow the
necessary ports and fully qualified domain names through the firewall according to the AKS egress traffic
guidance. Another reason can be kubelet pressures. In that case, add more compute, memory, or storage
resources.
Tools
You can check node health in one of these ways:
Azure Monitor - Containers health view . In Azure portal, open Azure Monitor. Select Containers . On
the right pane, select Monitored clusters . Select the cluster to view the health of the nodes, user pods, and
system pods.

AKS - Nodes view - In Azure portal, open navigate to the cluster. Select Insights under Monitoring .
View Nodes on the right pane.
Prometheus and Grafana Dashboard . Open the Node Conditions dashboard.

2- Verify the control plane and worker node connectivity


If worker nodes are healthy, examine the connectivity between the managed AKS control plane and the worker
nodes. Depending on the age and type of cluster configuration, the connectivity pods are either tunnelfront or
aks-link , and located in the kube-system namespace.
Tools
kubectl
Azure Monitor for Containers

If tunnelfront or aks-link connectivity is not working, establish connectivity after checking that the appropriate
AKS egress traffic rules have been allowed. Here are the steps:
1. Restart tunnelfront or aks-link .

kubectl rollout restart deploy aks-link

If restarting the pods doesn't fix the connection, continue to the next step.
2. Check the logs and look for abnormalities. This output shows logs for a working connection.

kubectl logs -l app=aks-link -c openvpn-client --tail=50

You can also retrieve those logs by searching the container logs in the logging and monitoring service. This
example searches Azure Monitor for Containers to check for aks-link connectivity errors.

let ContainerIDs = KubePodInventory


| where ClusterId =~
"/subscriptions/YOUR_SUBSCRIPTION_ID/resourceGroups/RESOURCE_GROUP/providers/Microsoft.ContainerService/manage
dClusters/YOUR_CLUSTER_ID"
| where Name has "aks-link"
| distinct ContainerID;
ContainerLog
| where ContainerID in (ContainerIDs)
| project LogEntrySource, LogEntry, TimeGenerated, Computer, Image, Name, ContainerID
| order by TimeGenerated desc
| limit 200

Here's another example query to check for tunnelfront connectivity errors.

let ContainerIDs = KubePodInventory


| where ClusterId =~
"/subscriptions/YOUR_SUBSCRIPTION_ID/resourceGroups/RESOURCE_GROUP/providers/Microsoft.ContainerService/manage
dClusters/YOUR_CLUSTER_ID"
| where Name has "tunnelfront"
| distinct ContainerID;
ContainerLog
| where ContainerID in (ContainerIDs)
| where LogEntry has "ssh to tunnelend is not connected"
| project LogEntrySource, LogEntry, TimeGenerated, Computer, Image, Name, ContainerID
| order by TimeGenerated desc
| limit 200

If you can't get the logs through the kubectl or queries, use SSH into the node. This example finds the tunnelfront
pod after connecting to the node through SSH.

kubectl pods -n kube-system -o wide | grep tunnelfront


ssh azureuser@<agent node pod is on, output from step above>
docker ps | grep tunnelfront
docker logs …
nslookup <ssh-server_fqdn>
ssh -vv azureuser@<ssh-server_fqdn> -p 9000
docker exec -it <tunnelfront_container_id> /bin/bash -c "ping bing.com"
kubectl get pods -n kube-system -o wide | grep <agent_node_where_tunnelfront_is_running>
kubectl delete po <kube_proxy_pod> -n kube-system
3- Validate DNS resolution when restricting egress
DNS resolution is a critical component of your cluster. If DNS resolution isn't working, then control plane errors or
container image pull failures may occur.
Tools
nslookup
dig

Follow these steps to make sure that DNS resolution is working.


1. Exec into the pod to examine and use nslookup or dig if those tools are installed on the pod.
2. If the pod doesn't have those tools, start a utility pod in the same namespace and retry with the tools.
3. If those steps don't show insights, SSH to one of the nodes and try resolution from there. This step will help
determine if the issue is related to AKS related or networking configuration.
4. If DNS resolves from the node, then the issue is related to Kubernetes DNS and not a networking issue. Restart
Kubernetes DNS and check whether the issue is resolved. If not, open a Microsoft support ticket.
5. If DNS doesn't resolve from the node, then check the networking setup again to make sure that the appropriate
routing paths and ports are open.

4- Check for kubelet errors


Check the kubelet process running on each worker node and make sure it's not experiencing any pressures. Those
pressures can be related to CPU, memory, or storage.
Tools
AKS - Kubelet Workbook
Prometheus and Grafana Dashboard - Kubelet Dashboard

The pressure increases when kubelet restarts and causes some sporadic, unpredictable behavior. Make sure that
the error count isn't continuously growing. An occasional error is acceptable but a constant growth indicates an
underlying issue that needs to be investigated and resolved.

5- Check disk IOPS for throttling


Check to see that file operations (IOPS) are not getting throttled and impacting services in the cluster.
Tools
Azure Monitor for Containers Disk IO Workbook
Prometheus and Grafana Dashboard - Node Disk Dashboard

Physical storage devices have limitations, bandwidth, and total number of file operations. Azure Disks are used to
store the OS running on the AKS nodes. They are subject to the same physical storage limitations.
Another way is to look at the throughput. IOPSs is measured with the average IO size * IOPS as the throughput in
MB/s. Large IO sizes will lead to lower IOPS because the throughput of a disk doesn't change.
When a workload exceeds Azure Disks' service limits on Max IOPS, the cluster becomes unresponsive and blocked
in IO Wait. Everything on Linux is a file. This includes network sockets, CNI, Docker, and other services that use
network I/O. All of those files will fail if they're unable to read the disk.
The events that can trigger IOPS throttle include:
High volume of containers running on the nodes. Docker IO is shared on the OS disk.
Custom or third-party tools used for security, monitoring, logging that are running on the OS disk.
Node failover events, and periodic jobs. As the load increases or the pods are scaled, this throttling occurs more
frequently until all nodes go to NotReady state while the IO completes.
Related links
Virtual machine disk limits
Relationship between Virtual Machine and Disk Performance

Next steps
Check the workload deployments
Check the workload deployments
11/2/2020 • 2 minutes to read • Edit Online

Check to see that all deployments and daemonSets are running. The Ready and the Available matches the
expected count.
This article is part of a series. Read the introduction here.
Tools
AKS - Workloads . In Azure portal, navigate to the AKS cluster resource. Select Workloads .

Prometheus and Grafana Dashboard . Deployment Status Dashboard. This image is from Grafana

Community Chart 10856.

Next steps
Validate the admission controllers
Validate the admission controllers are working as
expected
11/2/2020 • 2 minutes to read • Edit Online

Check whether the admission controllers are working as expected.


This article is part of a series. Read the introduction here.
Issues because of admission controllers are an edge case but they should be considered. Here are some examples:
Mutating and validating webhooks. Be careful when you add mutating and validating webhooks in your
cluster. Make sure that they're highly available so that an unhealthy node doesn't cause API server requests
to be blocked. AKS Policy, also known as OPA Gatekeeper, is an example of this type of webhooks. If there
are problems in the admission control pipeline, it can block a large number of requests to the API server.
Service meshes. They use admission controllers to automatically inject sidecars for example.
Tools kubectl

These commands check if AKS Policy is running in your cluster and how to validate that all of the admission
controllers are functioning as expected.

# Check AKS Policy pods are running.


kubectl get po -n gatekeeper-system

# Sample Output
...
NAME READY STATUS RESTARTS AGE
gatekeeper-audit-65844778cb-rkflg 1/1 Running 0 163m
gatekeeper-controller-78797d4687-4pf6w 1/1 Running 0 163m
gatekeeper-controller-78797d4687-splzh 1/1 Running 0 163m
...

If this command doesn't run as expected, it could indicate that an admission controller, API service, or CRD isn't
functioning correctly.

# Check that all API Resources are working correctly.


kubectl api-resources

# Sample Output
...
NAME SHORTNAMES APIGROUP NAMESPACED KIND
bindings true Binding
componentstatuses cs false
ComponentStatus
configmaps cm true ConfigMap
...

Next steps
Verify the connection to the container registry
Verify the connection to the container registry
11/2/2020 • 2 minutes to read • Edit Online

Make sure that the worker nodes have the correct permission to pull the necessary container images from the
container registry.
This article is part of a series. Read the introduction here.
A common symptom of this issue is receiving ImagePullBackoff errors when getting or describing a pod. Be sure
that the registry and image name are correct. Also, the cluster has permissions to pull from the appropriate
container registry.
If you are using Azure Container Registry (ACR), the cluster service principal or managed identity should be
granted AcrPull permissions against the registry.
One way is to run this command using the managed identity of the AKS cluster node pool. This command gets a
list of its permissions.

# Get Kubelet Identity (Nodepool MSI)


ASSIGNEE=$(az aks show -g $RESOURCE_GROUP -n $NAME --query identityProfile.kubeletidentity.clientId -o tsv)
az role assignment list --assignee $ASSIGNEE --all -o table

# Expected Output
...
e5615a90-1767-4a4f-83b6-cecfa0675970 AcrPull
/subscriptions/.../providers/Microsoft.ContainerRegistry/registries/akskhacr
...

If you're using another container registry, check the appropriate ImagePullSecret credentials for the registry.

Related links
Import container images to a container registry
AKS Roadmap
Microservices architecture on Azure Kubernetes
Service (AKS)
11/2/2020 • 20 minutes to read • Edit Online

This reference architecture shows a microservices application deployed to Azure Kubernetes Service (AKS). It
describes a basic AKS configuration that can be the starting point for most deployments. This article assumes
basic knowledge of Kubernetes. The article focuses mainly on the infrastructure and DevOps considerations of
running a microservices architecture on AKS. For guidance on how to design microservices, see Building
microservices on Azure.

A reference implementation of this architecture is available on GitHub.

Download a Visio file of this architecture.

Architecture
The architecture consists of the following components.
Azure Kubernetes Ser vice (AKS). AKS is an Azure service that deploys a managed Kubernetes cluster.
Kubernetes cluster . AKS is responsible for deploying the Kubernetes cluster and for managing the Kubernetes
API server. You only manage the agent nodes.
Vir tual network . By default, AKS creates a virtual network to deploy the agent nodes into. For more advanced
scenarios, you can create the virtual network first, which lets you control things like how the subnets are
configured, on-premises connectivity, and IP addressing. For more information, see Configure advanced
networking in Azure Kubernetes Service (AKS).
Ingress . An ingress exposes HTTP(S) routes to services inside the cluster. For more information, see the section
API Gateway below.
Azure Load Balancer . An Azure Load Balancer is created when the NGINX ingress controller is deployed. The
load balancer routes internet traffic to the ingress.
External data stores . Microservices are typically stateless and write state to external data stores, such as Azure
SQL Database or Cosmos DB.
Azure Active Director y . AKS uses an Azure Active Directory (Azure AD) identity to create and manage other
Azure resources such as Azure load balancers. Azure AD is also recommended for user authentication in client
applications.
Azure Container Registr y . Use Container Registry to store private Docker images, which are deployed to the
cluster. AKS can authenticate with Container Registry using its Azure AD identity. Note that AKS does not require
Azure Container Registry. You can use other container registries, such as Docker Hub.
Azure Pipelines . Pipelines is part of Azure DevOps Services and runs automated builds, tests, and deployments.
You can also use third-party CI/CD solutions such as Jenkins.
Helm . Helm is as a package manager for Kubernetes — a way to bundle Kubernetes objects into a single unit that
you can publish, deploy, version, and update.
Azure Monitor . Azure Monitor collects and stores metrics and logs, including platform metrics for the Azure
services in the solution and application telemetry. Use this data to monitor the application, set up alerts and
dashboards, and perform root cause analysis of failures. Azure Monitor integrates with AKS to collect metrics from
controllers, nodes, and containers, as well as container and node logs.

Design considerations
This reference architecture is focused on microservices architectures, although many of the recommended
practices will apply to other workloads running on AKS.
Microservices
A microservice is a loosely coupled, independently deployable unit of code. Microservices typically communicate
through well-defined APIs and are discoverable through some form of service discovery. The service should
always be reachable even when the pods move around. The Kubernetes Ser vice object is a natural way to model
microservices in Kubernetes.
API gateway
API gateways are a general microservices design pattern. An API gateway sits between external clients and the
microservices. It acts as a reverse proxy, routing requests from clients to microservices. It may also perform
various cross-cutting tasks such as authentication, SSL termination, and rate limiting. For more information, see
Using API gateways in microservices
Choosing a gateway technology
In Kubernetes, the functionality of an API gateway is mostly handled by the Ingress resource and the Ingress
controller . The considerations are described in the Ingress section.
Data storage
In a microservices architecture, services should not share data storage. Each service should own its own private
data in a separate logical storage, to avoid hidden dependencies among services. The reason is to avoid
unintentional coupling between services, which can happen when services share the same underlying data
schemas. Also, when services manage their own data stores, they can use the right data store for their particular
requirements. For more information, see Designing microservices: Data considerations.
Avoid storing persistent data in local cluster storage, because that ties the data to the node. Instead,
Use an external service such as Azure SQL Database or Cosmos DB, or
Mount a persistent volume using Azure Disks or Azure Files. Use Azure Files if the same volume needs to
be shared by multiple pods.

Service object
The Kubernetes Ser vice object provides a set of capabilities that match the microservices requirements for
service discoverability:
IP address. The Service object provides a static internal IP address for a group of pods (ReplicaSet). As pods
are created or moved around, the service is always reachable at this internal IP address.
Load balancing. Traffic sent to the service's IP address is load balanced to the pods.
Service discovery. Services are assigned internal DNS entries by the Kubernetes DNS service. That means
the API gateway can call a backend service using the DNS name. The same mechanism can be used for
service-to-service communication. The DNS entries are organized by namespace, so if your namespaces
correspond to bounded contexts, then the DNS name for a service will map naturally to the application
domain.
The following diagram shows the conceptual relation between services and pods. The actual mapping to endpoint
IP addresses and ports is done by kube-proxy, the Kubernetes network proxy.

Ingress
In Kubernetes, the Ingress controller might implement the API gateway pattern. In that case, Ingress and
Ingress controller work in conjunction to provide these features:
Route client requests to the right backend services. This provides a single endpoint for clients, and helps to
decouple clients from services.
Aggregate multiple requests into a single request, to reduce chattiness between the client and the backend.
Offload functionality from the backend services, such as SSL termination, authentication, IP restrictions, or
client rate limiting (throttling).
Ingress abstracts the configuration settings for a proxy server. You also need an Ingress controller, which provides
the underlying implementation of the Ingress. There are Ingress controllers for Nginx, HAProxy, Traefik, and Azure
Application Gateway, among others.
The Ingress resource can be fulfilled by different technologies. To work together, they need to be deployed as the
Ingress controller inside the cluster. It operates as the edge router or reverse proxy. A reverse proxy server is a
potential bottleneck or single point of failure, so always deploy at least two replicas for high availability.
Often, configuring the proxy server requires complex files, which can be hard to tune if you aren't an expert. So,
the Ingress controller provides a nice abstraction. The Ingress controller also has access to the Kubernetes API, so
it can make intelligent decisions about routing and load balancing. For example, the Nginx ingress controller
bypasses the kube-proxy network proxy.
On the other hand, if you need complete control over the settings, you may want to bypass this abstraction and
configure the proxy server manually. For more information, see Deploying Nginx or HAProxy to Kubernetes.

For AKS, you can also use Azure Application Gateway, using the Application Gateway Ingress Controller. This
option requires CNI networking to be enabled when you configure the AKS cluster, because Application
Gateway is deployed into a subnet of the AKS virtual network. Azure Application Gateway can perform layer-7
routing and SSL termination. It also has built-in support for web application firewall (WAF).

For information about load-balancing services in Azure, see Overview of load-balancing options in Azure.
TLS/SSL encryption
In common implementations, the Ingress controller is used for SSL termination. So, as part of deploying the
Ingress controller, you need to create a TLS certificate. Only use self-signed certificates for dev/test purposes. For
more information, see Create an HTTPS ingress controller and use your own TLS certificates on Azure Kubernetes
Service (AKS).
For production workloads, get signed certificates from trusted certificate authorities (CA). For information about
generating and configuring Let's Encrypt certificates, see Create an ingress controller with a static public IP
address in Azure Kubernetes Service (AKS).
You may also need to rotate your certificates as per the organization's policies. For information, see, Rotate
certificates in Azure Kubernetes Service (AKS).

Namespaces
Use namespaces to organize services within the cluster. Every object in a Kubernetes cluster belongs to a
namespace. By default, when you create a new object, it goes into the default namespace. But it's a good practice
to create namespaces that are more descriptive to help organize the resources in the cluster.
First, namespaces help prevent naming collisions. When multiple teams deploy microservices into the same
cluster, with possibly hundreds of microservices, it gets hard to manage if they all go into the same namespace. In
addition, namespaces allow you to:
Apply resource constraints to a namespace, so that the total set of pods assigned to that namespace cannot
exceed the resource quota of the namespace.
Apply policies at the namespace level, including RBAC and security policies.
For a microservices architecture, considering organizing the microservices into bounded contexts, and creating
namespaces for each bounded context. For example, all microservices related to the "Order Fulfillment" bounded
context could go into the same namespace. Alternatively, create a namespace for each development team.
Place utility services into their own separate namespace. For example, you might deploy Elasticsearch or
Prometheus for cluster monitoring, or Tiller for Helm.

Autoscaling
Kubernetes supports scale-out at two levels:
Scale the number of pods allocated to a deployment.
Scale the nodes in the cluster, to increase the total compute resources available to the cluster.
Although you can scale out pods and nodes manually, we recommend using autoscaling, to minimize the chance
that services will become resource starved under high load. An autoscaling strategy must take both pods and
nodes into account. If you just scale out the pods, eventually you will reach the resource limits of the nodes.
Pod autoscaling
The Horizontal Pod Autoscaler (HPA) scales pods based on observed CPU, memory, or custom metrics. To
configure horizontal pod scaling, you specify a target metric (for example, 70% of CPU), and the minimum and
maximum number of replicas. You should load test your services to derive these numbers.
A side-effect of autoscaling is that pods may be created or evicted more frequently, as scale-out and scale-in
events happen. To mitigate the effects of this:
Use readiness probes to let Kubernetes know when a new pod is ready to accept traffic.
Use pod disruption budgets to limit how many pods can be evicted from a service at a time.
Cluster autoscaling
The cluster autoscaler scales the number of nodes. If pods can't be scheduled because of resource constraints, the
cluster autoscaler will provision more nodes.
Whereas HPA looks at actual resources consumed or other metrics from running pods, the cluster autoscaler is
provisioning nodes for pods that aren't scheduled yet. Therefore, it looks at the requested resources, as specified
in the Kubernetes pod spec for a deployment. Use load testing to fine-tune these values.
You can't change the VM size after you create the cluster, so you should do some initial capacity planning to
choose an appropriate VM size for the agent nodes when you create the cluster.

Health probes
Kubernetes defines two types of health probe that a pod can expose:
Readiness probe: Tells Kubernetes whether the pod is ready to accept requests.
Liveness probe: Tells Kubernetes whether a pod should be removed and a new instance started.
When thinking about probes, it's useful to recall how a service works in Kubernetes. A service has a label selector
that matches a set of (zero or more) pods. Kubernetes load balances traffic to the pods that match the selector.
Only pods that started successfully and are healthy receive traffic. If a container crashes, Kubernetes kills the pod
and schedules a replacement.
Sometimes, a pod may not be ready to receive traffic, even though the pod started successfully. For example, there
may be initialization tasks, where the application running in the container loads things into memory or reads
configuration data. To indicate that a pod is healthy but not ready to receive traffic, define a readiness probe.
Liveness probes handle the case where a pod is still running, but is unhealthy and should be recycled. For
example, suppose that a container is serving HTTP requests but hangs for some reason. The container doesn't
crash, but it has stopped serving any requests. If you define an HTTP liveness probe, the probe will stop
responding and that informs Kubernetes to restart the pod.
Here are some considerations when designing probes:
If your code has a long startup time, there is a danger that a liveness probe will report failure before the
startup completes. To prevent this, use the initialDelaySeconds setting, which delays the probe from
starting.
A liveness probe doesn't help unless restarting the pod is likely to restore it to a healthy state. You can use a
liveness probe to mitigate against memory leaks or unexpected deadlocks, but there's no point in restarting
a pod that's going to immediately fail again.
Sometimes readiness probes are used to check dependent services. For example, if a pod has a dependency
on a database, the probe might check the database connection. However, this approach can create
unexpected problems. An external service might be temporarily unavailable for some reason. That will
cause the readiness probe to fail for all the pods in your service, causing all of them to be removed from
load balancing, and thus creating cascading failures upstream. A better approach is to implement retry
handling within your service, so that your service can recover correctly from transient failures.

Resource constraints
Resource contention can affect the availability of a service. Define resource constraints for containers, so that a
single container cannot overwhelm the cluster resources (memory and CPU). For non-container resources, such
as threads or network connections, consider using the Bulkhead Pattern to isolate resources.
Use resource quotas to limit the total resources allowed for a namespace. That way, the front end can't starve the
backend services for resources or vice-versa.

Role based access control (RBAC)


Kubernetes and Azure both have mechanisms for role-based access control (RBAC):
Azure RBAC controls access to resources in Azure, including the ability to create new Azure resources.
Permissions can be assigned to users, groups, or service principals. (A service principal is a security identity
used by applications.)
Kubernetes RBAC controls permissions to the Kubernetes API. For example, creating pods and listing pods
are actions that can be authorized (or denied) to a user through RBAC. To assign Kubernetes permissions to
users, you create roles and role bindings:
A Role is a set of permissions that apply within a namespace. Permissions are defined as verbs (get,
update, create, delete) on resources (pods, deployments, etc.).
A RoleBinding assigns users or groups to a Role.
There is also a ClusterRole object, which is like a Role but applies to the entire cluster, across all
namespaces. To assign users or groups to a ClusterRole, create a ClusterRoleBinding.
AKS integrates these two RBAC mechanisms. When you create an AKS cluster, you can configure it to use Azure
AD for user authentication. For details on how to set this up, see Integrate Azure Active Directory with Azure
Kubernetes Service.
Once this is configured, a user who wants to access the Kubernetes API (for example, through kubectl) must sign
in using their Azure AD credentials.
By default, an Azure AD user has no access to the cluster. To grant access, the cluster administrator creates
RoleBindings that refer to Azure AD users or groups. If a user doesn't have permissions for a particular operation,
it will fail.
If users have no access by default, how does the cluster admin have permission to create the role bindings in the
first place? An AKS cluster actually has two types of credentials for calling the Kubernetes API server: cluster user
and cluster admin. The cluster admin credentials grant full access to the cluster. The Azure CLI command
az aks get-credentials --admin downloads the cluster admin credentials and saves them into your kubeconfig
file. The cluster administrator can use this kubeconfig to create roles and role bindings.
Because the cluster admin credentials are so powerful, use Azure RBAC to restrict access to them:
The "Azure Kubernetes Service Cluster Admin Role" has permission to download the cluster admin
credentials. Only cluster administrators should be assigned to this role.
The "Azure Kubernetes Service Cluster User Role" has permission to download the cluster user credentials.
Non-admin users can be assigned to this role. This role does not give any particular permissions on
Kubernetes resources inside the cluster — it just allows a user to connect to the API server.
When you define your RBAC policies (both Kubernetes and Azure), think about the roles in your organization:
Who can create or delete an AKS cluster and download the admin credentials?
Who can administer a cluster?
Who can create or update resources within a namespace?
It's a good practice to scope Kubernetes RBAC permissions by namespace, using Roles and RoleBindings, rather
than ClusterRoles and ClusterRoleBindings.
Finally, there is the question of what permissions the AKS cluster has to create and manage Azure resources, such
as load balancers, networking, or storage. To authenticate itself with Azure APIs, the cluster uses an Azure AD
service principal. If you don't specify a service principal when you create the cluster, one is created automatically.
However, it's a good security practice to create the service principal first and assign the minimal RBAC permissions
to it. For more information, see Service principals with Azure Kubernetes Service.

Secrets management and application credentials


Applications and services often need credentials that allow them to connect to external services such as Azure
Storage or SQL Database. The challenge is to keep these credentials safe and not leak them.
For Azure resources, one option is to use managed identities. The idea of a managed identity is that an application
or service has an identity stored in Azure AD, and uses this identity to authenticate with an Azure service. The
application or service has a Service Principal created for it in Azure AD, and authenticates using OAuth 2.0 tokens.
The executing process calls a localhost address to get the token. That way, you don't need to store any passwords
or connection strings. You can use managed identities in AKS by assigning identities to individual pods, using the
aad-pod-identity project.
Currently, not all Azure services support authentication using managed identities. For a list, see Azure services
that support Azure AD authentication.
Even with managed identities, you'll probably need to store some credentials or other application secrets, whether
for Azure services that don't support managed identities, third-party services, API keys, and so on. Here are some
options for storing secrets securely:
Azure Key Vault. In AKS, you can mount one or more secrets from Key Vault as a volume. The volume reads
the secrets from Key Vault. The pod can then read the secrets just like a regular volume. For more
information, see the secrets-store-csi-driver-provider-azure project on GitHub.
The pod authenticates itself by using either a pod identity (described above) or by using an Azure AD
Service Principal along with a client secret. Using pod identities is recommended because the client secret
isn't needed in that case.
HashiCorp Vault. Kubernetes applications can authenticate with HashiCorp Vault using Azure AD managed
identities. See HashiCorp Vault speaks Azure Active Directory. You can deploy Vault itself to Kubernetes,
consider running it in a separate dedicated cluster from your application cluster.
Kubernetes secrets. Another option is simply to use Kubernetes secrets. This option is the easiest to
configure but has some challenges. Secrets are stored in etcd, which is a distributed key-value store. AKS
encrypts etcd at rest. Microsoft manages the encryption keys.
Using a system like HashiCorp Vault or Azure Key Vault provides several advantages, such as:
Centralized control of secrets.
Ensuring that all secrets are encrypted at rest.
Centralized key management.
Access control of secrets.
Auditing

Container and Orchestrator security


These are recommended practices for securing your pods and containers:
Threat Monitoring – Monitor for threats using Azure Defender for container registries and Azure
Defender for Kubernetes (or 3rd party capabilities). If you are hosting containers on a VM, use Azure
Defender for servers or a 3rd party capability. Additionally, you can integrate logs from Container
Monitoring solution in Azure Monitor to Azure Sentinel or an existing SIEM
Vulnerability monitoring - Continuously monitor images and running containers for known
vulnerabilities using Azure Security Center or a 3rd party solution available through the Azure Marketplace.
Automate image patching using ACR Tasks, a feature of Azure Container Registry. A container image is
built up from layers. The base layers include the OS image and application framework images, such as
ASP.NET Core or Node.js. The base images are typically created upstream from the application developers,
and are maintained by other project maintainers. When these images are patched upstream, it's important
to update, test, and redeploy your own images, so that you don't leave any known security vulnerabilities.
ACR Tasks can help to automate this process.
Store images in a trusted private registr y such as Azure Container Registry or Docker Trusted
Registry. Use a validating admission webhook in Kubernetes to ensure that pods can only pull images from
the trusted registry.
Apply Least Privilege principle
Don't run containers in privileged mode. Privileged mode gives a container access to all devices on the
host.
When possible, avoid running processes as root inside containers. Containers do not provide complete
isolation from a security standpoint, so it's better to run a container process as a non-privileged user.

DevOps considerations
This reference architecture provides an Azure Resource Manager template for provisioning the cloud resources,
and its dependencies. With the use of Azure Resource Manager templates you can use Azure DevOps Services to
provision different environments in minutes, for example to replicate production scenarios. This allows you save
cost and provision load testing environment only when needed.
Consider following the workload isolation criteria to structure your ARM template, a workload is typically defined
as an arbitrary unit of functionality; you could, for exmaple, have a separate template for the cluster and then
other for the dependant services. Workload isolation enables DevOps to perform continuous integration and
continuous delivery (CI/CD), since every workload is associated and managed by its corresponding DevOps team.

Deployment (CI/CD) considerations


Here are some goals of a robust CI/CD process for a microservices architecture:
Each team can build and deploy the services that it owns independently, without affecting or disrupting other
teams.
Before a new version of a service is deployed to production, it gets deployed to dev/test/QA environments for
validation. Quality gates are enforced at each stage.
A new version of a service can be deployed side by side with the previous version.
Sufficient access control policies are in place.
For containerized workloads, you can trust the container images that are deployed to production.
To learn more about the challenges, see CI/CD for microservices architectures.
For specific recommendations and best practices, see CI/CD for microservices on Kubernetes.

Cost considerations
Use the Azure pricing calculator to estimate costs. Other considerations are described in the Cost section in
Microsoft Azure Well-Architected Framework.
Here are some points to consider for some of the services used in this architecture.
Azure Kubernetes Service (AKS )
There are no costs associated for AKS in deployment, management, and operations of the Kubernetes cluster. You
only pay for the virtual machines instances, storage, and networking resources consumed by your Kubernetes
cluster.
To estimate the cost of the required resources please see the Container Services calculator.
Azure Load balancer
You are charged only for the number of configured load-balancing and outbound rules. Inbound NAT rules are
free. There is no hourly charge for the Standard Load Balancer when no rules are configured.
See Azure Load Balancer Pricing for more information.
Azure DevOps Services
This reference architecture only uses Azure Pipelines. Azure offers the Azure Pipeline as an individual Service. You
are allowed a free Microsoft-hosted job with 1,800 minutes per month for CI/CD and 1 self-hosted job with
unlimited minutes per month, extra jobs have charges. For more information, see Azure DevOps Services Pricing.
Azure Monitor
For Azure Monitor Log Analytics, you are charged for data ingestion and retention. For more information, see
Azure Monitor Pricing for more information.

Deploy the solution


To deploy the reference implementation for this architecture, follow the steps in the GitHub repo.

Next steps
To learn about monitoring this architecture, see Monitoring a microservices architecture in Azure Kubernetes
Service (AKS).
To learn how we measured the performance of this application, see Performance tuning scenario: Distributed
business transactions.
Microservices with AKS
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Use AKS to simplify the deployment and management of microservices based architecture. AKS streamlines
horizontal scaling, self-healing, load balancing, secret management.

Architecture

Download an SVG of this architecture.

Data Flow
1. Developer uses IDE such as Visual Studio to commit changes to GitHub
2. GitHub triggers a new build on Azure DevOps
3. Azure DevOps packages microservices as containers and pushes them to the Azure Container Registry
4. Containers are deployed to AKS cluster
5. Users access services via apps and website
6. Azure Active Directory is used to secure access to the resources
7. Microservices use databases to store and retrieve information
8. Administrator accesses via a separate admin portal
Secure DevOps for AKS
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

DevOps and Kubernetes are better together. Implementing secure DevOps together with Kubernetes on Azure, you
can achieve the balance between speed and security and deliver code faster at scale. Put guardrails around the
development processes using CI/CD with dynamic policy controls and accelerate feedback loop with constant
monitoring. Use Azure Pipelines to deliver fast while ensuring enforcement of critical policies with Azure Policy.
Azure provides you real-time observability for your build and release pipelines, and the ability to apply compliance
audit and reconfigurations easily.

Architecture

Download an SVG of this architecture.


Developers rapidly iterate, test, and debug different parts of an application together in the same Kubernetes cluster
Code is merged into a GitHub repository, after which automated builds and tests are run by Azure Pipelines
Code is merged into a GitHub repository, after which automated builds and tests are run by Azure Pipelines
Release pipeline automatically executes pre-defined deployment strategy with each code change
App telemetry, container health monitoring, and real-time log analytics are obtained using Azure Monitor
Data Flow
1. Developers rapidly iterate, test, and debug different parts of an application together in the same Kubernetes
cluster.
2. Code is merged into a GitHub repository, after which automated builds and tests are run by Azure Pipelines.
3. Release pipeline automatically executes pre-defined deployment strategy with each code change.
4. Kubernetes clusters are provisioned using tools like Helm charts that define the desired state of app resources
and configurations.
5. Container image is pushed to Azure Container Registry.
6. Cluster operators define policies in Azure Policy to govern deployments to the AKS cluster.
7. Azure Policy audits requests from the pipeline at the AKS control plane level.
8. App telemetry, container health monitoring, and real-time log analytics are obtained using Azure Monitor.
9. Insights used to address issues and fed into next sprint plans.
Build and deploy to Azure Kubernetes Service
Building a telehealth system on Azure
11/2/2020 • 8 minutes to read • Edit Online

This article explains how to build a telehealth system using Azure cloud services. The details are based on a real
customer implementation that connects a professional healthcare organization to its remote patients. While there
are other ways to build such a system, the solution described has been successful in enabling communication
between patients and their remote care provider, as well as remotely tuning the medical devices that patients carry.
There are about 700 million people who suffer from hearing disabilities. However, only 10% of them use hearing
aid devices to improve their lives. In some geographies or situations, it is impossible for a patient to get direct
assistance when needed. For example, consider patients who:
Need help in a specific hearing situation (for example, while walking in the park, attending a party, or being at
home), which cannot be reproduced in the hearing care professional's office.
Have mobility issues or reside long distances from their hearing care professional.
Live in an emerging country/region that has a limited number of hearing care professionals.
To overcome these difficulties, it is important to have the ability to provide hearing care services remotely. In this
case, the healthcare professional uses chat or video communication to engage with their remote patients. People
who are hard of hearing use a smartphone to allow access to the hearing aid device during the remote session. The
patient immediately experiences improved hearing as the hearing care professional deploys changes to the
configuration of the hearing aid device in real time.

Potential use cases


The following additional use cases have similar design patterns:
Any Bluetooth-enabled device can be accessed and remotely tuned using such a solution.
Communication (text, voice, video) or knowledge exchange (education, satisfaction surveys) in a remote
setting/context.
Globally distributed web content management.
Internet of Things (IoT)

Architecture
The solution is built on four pillars, including:
Clients
Communication components
APIs and business logic
Storage and infrastructure services
On the left side of the architectural diagram, there are clients in two groups, the healthcare professional and the
patient. The healthcare professional uses the fitting software and web portal clients to communicate with their
patients. The patients, on the other hand, use a mobile app that is linked to the medical device via a Bluetooth
connection. This back and forth communication is achieved using backend services:
Public facing APIs
Internal microservices that are responsible for workflows such as video calls through Web RTC or client-to-
client communication using Signal. signal is a software library for Microsoft ASP.NET that allows server code to
send asynchronous notifications to client-side web applications.
The state of these services is persisted in several Azure services (on the right side of the diagram) such as Azure
Database for PostgreSQL. Media files are saved in Azure storage accounts. All logs from all services are gathered in
a centralized logging solution that uses Azure Application Insights. Finally, asynchronous communication can be
achieved between the clients through push notifications using the help of the Azure Notification Hub.
The solution was set up in this way to:
Benefit from the scalability of the cloud services running in the backend.
Increase the autonomy of the teams building the solution. Each team oversees functional domains and
drives the evolution of their components. Since the functional domains do not overlap, each team can
innovate at its own pace. Also, since the codebases of the services are independent, the CI/CD pipeline for
the entire solution is simplified.
Build the inter-services communication and coordination mechanism required by the distribution of
functionalities across microservices. The solution described in this document uses Azure Cache for Redis to
accomplish this task.
Achieve central monitoring and enhance the ability to troubleshoot the solution.
Components
Azure Database for PostgreSQL stores user (patient and health care professional) and device-related data. The
service was chosen because it's stable, lightweight, and has no vendor lock-in.
Azure Kubernetes Service hosts the application business logic and provides ease of deployment and flexibility
for customization. The service also abstracts the solution from the actual hardware used underneath.
Azure Cache for Redis hosts temporary data used for intra service data (shared data). The service can be
recreated from the database in case the data expires from the cache
Azure Notification Hub notifies patient of incoming content: chat, video calls, device configuration settings.
Azure Functions schedules tasks. For instance, broad communications to large set of users, coordination of
analytics work in the backend (aggregations…).
Azure Application Insights centralizes signals/events from the system (logs, telemetry from logs from
microservices, frontend, and devices) for troubleshooting purposes.
Azure Content Delivery Network (CDN) is used for maintenance and updates (delivery of java scripts file) to the
web portal and to deliver media files (videos, images) through the portal. All this content is stored in the Azure
storage accounts in the background.
Azure Traffic Manager load balances between geo locations.
Azure SignalR allows server code to send asynchronous notifications to client-side web applications. End-user
devices can be configured in either ## Standard## or ## Advanced## mode.
Standard mode
In Standard mode, the fitting software prepares a notification, which contains some configuration JSON file or
content for the device. The notification is then passed to Azure Notification Hub, which pushes the notification to
the user's phone.
Advanced mode
In Advanced mode, the hearing aid professional uses the fitting software to push detailed configuration to the
device. This requires a stable and reliable connection between the backend and the device, which SignalR achieves
by using WebSockets. The end user's phone is on the receiving end of this channel. From the phone, the Bluetooth
connection establishes the final communication link with the device.

Alternatives
On the database side, any other PaaS database services could be used. For hosting the application logic, rather that
Azure Kubernetes Service, Azure Application Service or Azure Service Fabric could be used.

Learnings
We recommend using a traffic manager in front of the different clusters to optimize for latency between regions
and as a fallback mechanism should the clusters become unavailable. For the databases, we recommend using
read-only replicas for queries that require loading and aggregating a large amount of data. We recommend
delivering static web files (.html, .js, images, etc.) globally using a content delivery network (CDN) to improve speed
through caching.
Deployment
The most important aspect to consider when deploying this scenario is the coordination of deployments across the
cloud-based backend and the frontend (phones/devices). Consider using the concept of a feature flag to achieve
this.
Management
To better align to the idea of having each functional domain dealt with using a specific microservice, long term,
there is an opportunity to split the database into several smaller databases. Doing so will enable the principle
isolation and autonomy of the flow related to each microservice as opposed to concentrating the data related to all
services into a single database. Achieving this goal will require automating provisioning and managing those
databases, which is one of the core capabilities of a PaaS database service in the cloud. That database management
layer should be integrated in the solution as well as into the unified monitoring solution.
Monitoring
It is important to monitor each of the tiers, and each monitoring facet should be federated into a single bucket in
the cloud. It is important to enable the correlation of all these logs and telemetry data points to ensure holistic
insights across components and layers.
Today, monitored layers include the:
Windows application (fitting software on the hearing care professional's desktop)
Hosted application logic
Cloud services
Sizing and scaling
Make sure to optimize the configuration of the Azure Kubernetes clusters to match the scale requirements that
fluctuate with the time of the day or regional patterns. Consider offloading read workloads (such as aggregating
queries) by using Read Replicas in Azure Database for PostgreSQL.
Using the TimescaleDB extension of PostgreSQL will enable more efficient handling of the time-related data
coming from the medical devices. Consider using a scale out solution such as Azure Database for PostgreSQL –
Hyperscale (Citus) to reach global scale by provisioning multiple database nodes.

Security and compliance


This solution handles PHI and personal data. As such, it is important to use services that are certified for medical
applications (HIPAA certifications, not only for the data that remains in the database but also the logs and telemetry
data). For details please consult the HIPAA section of the Microsoft Trust Center.

Pricing
For a deployment in a single region, example pricing information is available in the Pricing Calculator

Next steps
To get started with implementing a comparable architecture for your business, consider building skills around web
services, databases such as Azure Database for PostgreSQL, and mobile application development techniques and
technologies such as Xamarin and .Net Core.

Related resources
Real-time communications
More information about how WebRTC provides real-time communication capabilities to mobile applications is
available on the WebRTC project site.
Turn servers
Use a client library such as Icelink (loaded by the application on the phone and by the fitting software of the
desktop of the hearing aid professional) to manage the turn servers* and the types of connection (tcp, udp, p2p)
between the two clients (fitting software and application on the phone). The client library:
Creates the streaming channel
Establishes the connections
Manages the connection in case of errors, missing packets, automatically adjusts the streaming to the variations
of the bandwidth
Encode/decodes the calls (audio and/or video) during the calls
*Turn servers are network entities in charge of relaying media in VoIP related protocols. In this solution they are
hosted by https://xirsys.com/ in several datacenters worldwide. Establishes direct connection between two clients
under the same session.
CI/CD pipeline for container-based workloads
11/2/2020 • 7 minutes to read • Edit Online

This example scenario is applicable to businesses that want to modernize application development by using
containers and DevOps workflows. In this scenario, a Node.js web app is built and deployed by Jenkins into an
Azure Container Registry and Azure Kubernetes Service. For a globally distributed database tier, Azure Cosmos DB
is used. To monitor and troubleshoot application performance, Azure Monitor integrates with a Grafana instance
and dashboard.
Example application scenarios include providing an automated development environment, validating new code
commits, and pushing new deployments into staging or production environments. Traditionally, businesses had to
manually build and compile applications and updates, and maintain a large, monolithic code base. With a modern
approach to application development that uses continuous integration (CI) and continuous deployment (CD), you
can more quickly build, test, and deploy services. This modern approach lets you release applications and updates
to your customers faster, and respond to changing business demands in a more agile manner.
By using Azure services such as Azure Kubernetes Service, Container Registry, and Cosmos DB, companies can use
the latest in application development techniques and tools to simplify the process of implementing high
availability.

Relevant use cases


Other relevant use cases include:
Modernizing application development practices to a microservice, container-based approach.
Speeding up application development and deployment lifecycles.
Automating deployments to test or acceptance environments for validation.

Architecture

This scenario covers a DevOps pipeline for a Node.js web application and database back end. The data flows
through the scenario as follows:
1. A developer makes changes to the Node.js web application source code.
2. The code change is committed to a source control repository, such as GitHub.
3. To start the continuous integration (CI) process, a GitHub webhook triggers a Jenkins project build.
4. The Jenkins build job uses a dynamic build agent in Azure Kubernetes Service to perform a container build
process.
5. A container image is created from the code in source control, and is then pushed to an Azure Container
Registry.
6. Through continuous deployment (CD), Jenkins deploys this updated container image to the Kubernetes cluster.
7. The Node.js web application uses Cosmos DB as its back end. Both Cosmos DB and Azure Kubernetes Service
report metrics to Azure Monitor.
8. A Grafana instance provides visual dashboards of the application performance based on the data from Azure
Monitor.
Components
Jenkins is an open-source automation server that can integrate with Azure services to enable continuous
integration (CI) and continuous deployment (CD). In this scenario, Jenkins orchestrates the creation of new
container images based on commits to source control, pushes those images to Azure Container Registry, then
updates application instances in Azure Kubernetes Service.
Azure Linux Virtual Machines is the IaaS platform used to run the Jenkins and Grafana instances.
Azure Container Registry stores and manages container images that are used by the Azure Kubernetes Service
cluster. Images are securely stored, and can be replicated to other regions by the Azure platform to speed up
deployment times.
Azure Kubernetes Service is a managed Kubernetes platform that lets you deploy and manage containerized
applications without container orchestration expertise. As a hosted Kubernetes service, Azure handles critical
tasks like health monitoring and maintenance for you.
Azure Cosmos DB is a globally distributed, multi-model database that allows you to choose from various
database and consistency models to suit your needs. With Cosmos DB, your data can be globally replicated, and
there is no cluster management or replication components to deploy and configure.
Azure Monitor helps you track performance, maintain security, and identify trends. Metrics obtained by Monitor
can be used by other resources and tools, such as Grafana.
Grafana is an open-source solution to query, visualize, alert, and understand metrics. A data source plugin for
Azure Monitor allows Grafana to create visual dashboards to monitor the performance of your applications
running in Azure Kubernetes Service and using Cosmos DB.
Alternatives
Azure Pipelines help you implement a continuous integration (CI), test, and deployment (CD) pipeline for any
app.
Kubernetes can be run directly on Azure VMs instead of via a managed service if you would like more control
over the cluster.
Service Fabric is another alternate container orchestrator that can replace AKS.

Considerations
Availability
To monitor your application performance and report on issues, this scenario combines Azure Monitor with Grafana
for visual dashboards. These tools let you monitor and troubleshoot performance issues that may require code
updates, which can all then be deployed with the CI/CD pipeline.
As part of the Azure Kubernetes Service cluster, a load balancer distributes application traffic to one or more
containers (pods) that run your application. This approach to running containerized applications in Kubernetes
provides a highly available infrastructure for your customers.
Scalability
Azure Kubernetes Service lets you scale the number of cluster nodes to meet the demands of your applications. As
your application increases, you can scale out the number of Kubernetes nodes that run your service.
Application data is stored in Azure Cosmos DB, a globally distributed, multi-model database that can scale globally.
Cosmos DB abstracts the need to scale your infrastructure as with traditional database components, and you can
choose to replicate your Cosmos DB globally to meet the demands of your customers.
For other scalability topics, see the performance efficiency checklist available in the Azure Architecture Center.
Security
To minimize the attack footprint, this scenario does not expose the Jenkins VM instance over HTTP. For any
management tasks that require you to interact with Jenkins, you create a secure remote connection using an SSH
tunnel from your local machine. Only SSH public key authentication is allowed for the Jenkins and Grafana VM
instances. Password-based logins are disabled. For more information, see Run a Jenkins server on Azure.
For separation of credentials and permissions, this scenario uses a dedicated Azure Active Directory (AD) service
principal. The credentials for this service principal are stored as a secure credential object in Jenkins so that they
are not directly exposed and visible within scripts or the build pipeline.
For general guidance on designing secure solutions, see the Azure Security Documentation.
Resiliency
This scenario uses Azure Kubernetes Service for your application. Built into Kubernetes are resiliency components
that monitor and restart the containers (pods) if there is an issue. Combined with running multiple Kubernetes
nodes, your application can tolerate a pod or node being unavailable.
For general guidance on designing resilient solutions, see Designing reliable Azure applications.

Deploy the scenario


Prerequisites
You must have an existing Azure account. If you don't have an Azure subscription, create a free account
before you begin.
You need an SSH public key pair. For steps on how to create a public key pair, see Create and use an SSH key
pair for Linux VMs.
You need an Azure Active Directory (AD) service principal for the authentication of service and resources. If
needed, you can create a service principal with az ad sp create-for-rbac

az ad sp create-for-rbac --name myDevOpsScenario

Make a note of the appId and password in the output from this command. You provide these values to the
template when you deploy the scenario.
Find the supported Kubernetes versions for your deployment region by running az aks get-versions. The
following command gets the CLI default version:

az aks get-versions -l <region> --query "orchestrators[?default!=null].orchestratorVersion" -o tsv

Walk-through
To deploy this scenario with an Azure Resource Manager template, perform the following steps.
1. Click the link below to deploy the solution.

2. Wait for the template deployment to open in the Azure portal, then complete the following steps:
Choose to Create new resource group, then provide a name such as myAKSDevOpsScenario in the text
box.
Select a region from the Location drop-down box.
Enter your service principal app ID and password from the az ad sp create-for-rbac command.
Provide a username and secure password for the Jenkins instance and Grafana console.
Provide an SSH key to secure logins to the Linux VMs.
Enter the Kubernetes version from the az aks get-versions command.
Review the terms and conditions, then check I agree to the terms and conditions stated above .
Select the Purchase button.
It can take 15-20 minutes for the deployment to complete.

Pricing
To explore the cost of running this scenario, all of the services are pre-configured in the cost calculator. To see how
the pricing would change for your particular use case, change the appropriate variables to match your expected
traffic.
We have provided three sample cost profiles based on the number of container images to store and Kubernetes
nodes to run your applications.
Small: this pricing example correlates to 1000 container builds per month.
Medium: this pricing example correlates to 100,000 container builds per month.
Large: this pricing example correlates to 1,000,000 container builds per month.

Related resources
This scenario used Azure Container Registry and Azure Kubernetes Service to store and run a container-based
application. Azure Container Instances can also be used to run container-based applications, without having to
provision any orchestration components. For more information, see Azure Container Instances overview.
Azure Data Architecture Guide
11/2/2020 • 2 minutes to read • Edit Online

This guide presents a structured approach for designing data-centric solutions on Microsoft Azure. It is based on
proven practices derived from customer engagements.

Introduction
The cloud is changing the way applications are designed, including how data is processed and stored. Instead of a
single general-purpose database that handles all of a solution's data, polyglot persistence solutions use multiple,
specialized data stores, each optimized to provide specific capabilities. The perspective on data in the solution
changes as a result. There are no longer multiple layers of business logic that read and write to a single data layer.
Instead, solutions are designed around a data pipeline that describes how data flows through a solution, where it
is processed, where it is stored, and how it is consumed by the next component in the pipeline.

How this guide is structured


This guide is structured around two general categories of data solution, traditional RDBMS workloads and big data
solutions.
Traditional RDBMS workloads . These workloads include online transaction processing (OLTP) and online
analytical processing (OLAP). Data in OLTP systems is typically relational data with a predefined schema and a set
of constraints to maintain referential integrity. Often, data from multiple sources in the organization may be
consolidated into a data warehouse, using an ETL process to move and transform the source data.
OLTP Data Warehouse Analysis Reporting

ETL

Big data solutions . A big data architecture is designed to handle the ingestion, processing, and analysis of data
that is too large or complex for traditional database systems. The data may be processed in batch or in real time.
Big data solutions typically involve a large amount of non-relational data, such as key-value data, JSON
documents, or time series data. Often traditional RDBMS systems are not well-suited to store this type of data. The
term NoSQL refers to a family of databases designed to hold non-relational data. The term isn't quite accurate,
because many non-relational data stores support SQL compatible queries. The term NoSQL stands for "Not only
SQL".
Ingestion Processing Storage
ML

Reporting

These two categories are not mutually exclusive, and there is overlap between them, but we feel that it's a useful
way to frame the discussion. Within each category, the guide discusses common scenarios , including relevant
Azure services and the appropriate architecture for the scenario. In addition, the guide compares technology
choices for data solutions in Azure, including open source options. Within each category, we describe the key
selection criteria and a capability matrix, to help you choose the right technology for your scenario.
This guide is not intended to teach you data science or database theory — you can find entire books on those
subjects. Instead, the goal is to help you select the right data architecture or data pipeline for your scenario, and
then select the Azure services and technologies that best fit your requirements. If you already have an architecture
in mind, you can skip directly to the technology choices.
Extract, transform, and load (ETL)
11/2/2020 • 5 minutes to read • Edit Online

A common problem that organizations face is how to gather data from multiple sources, in multiple formats, and
move it to one or more data stores. The destination may not be the same type of data store as the source, and often
the format is different, or the data needs to be shaped or cleaned before loading it into its final destination.
Various tools, services, and processes have been developed over the years to help address these challenges. No
matter the process used, there is a common need to coordinate the work and apply some level of data
transformation within the data pipeline. The following sections highlight the common methods used to perform
these tasks.

Extract, transform, and load (ETL) process


Extract, transform, and load (ETL) is a data pipeline used to collect data from various sources, transform the data
according to business rules, and load it into a destination data store. The transformation work in ETL takes place in
a specialized engine, and often involves using staging tables to temporarily hold data as it is being transformed and
ultimately loaded to its destination.
The data transformation that takes place usually involves various operations, such as filtering, sorting, aggregating,
joining data, cleaning data, deduplicating, and validating data.

Often, the three ETL phases are run in parallel to save time. For example, while data is being extracted, a
transformation process could be working on data already received and prepare it for loading, and a loading process
can begin working on the prepared data, rather than waiting for the entire extraction process to complete.
Relevant Azure service:
Azure Data Factory v2
Other tools:
SQL Server Integration Services (SSIS)

Extract, load, and transform (ELT)


Extract, load, and transform (ELT) differs from ETL solely in where the transformation takes place. In the ELT pipeline,
the transformation occurs in the target data store. Instead of using a separate transformation engine, the
processing capabilities of the target data store are used to transform data. This simplifies the architecture by
removing the transformation engine from the pipeline. Another benefit to this approach is that scaling the target
data store also scales the ELT pipeline performance. However, ELT only works well when the target system is
powerful enough to transform the data efficiently.

Typical use cases for ELT fall within the big data realm. For example, you might start by extracting all of the source
data to flat files in scalable storage such as Hadoop distributed file system (HDFS) or Azure Data Lake Store.
Technologies such as Spark, Hive, or PolyBase can then be used to query the source data. The key point with ELT is
that the data store used to perform the transformation is the same data store where the data is ultimately
consumed. This data store reads directly from the scalable storage, instead of loading the data into its own
proprietary storage. This approach skips the data copy step present in ETL, which can be a time consuming
operation for large data sets.
In practice, the target data store is a data warehouse using either a Hadoop cluster (using Hive or Spark) or a Azure
Synapse Analytics. In general, a schema is overlaid on the flat file data at query time and stored as a table, enabling
the data to be queried like any other table in the data store. These are referred to as external tables because the
data does not reside in storage managed by the data store itself, but on some external scalable storage.
The data store only manages the schema of the data and applies the schema on read. For example, a Hadoop
cluster using Hive would describe a Hive table where the data source is effectively a path to a set of files in HDFS. In
Azure Synapse, PolyBase can achieve the same result — creating a table against data stored externally to the
database itself. Once the source data is loaded, the data present in the external tables can be processed using the
capabilities of the data store. In big data scenarios, this means the data store must be capable of massively parallel
processing (MPP), which breaks the data into smaller chunks and distributes processing of the chunks across
multiple machines in parallel.
The final phase of the ELT pipeline is typically to transform the source data into a final format that is more efficient
for the types of queries that need to be supported. For example, the data may be partitioned. Also, ELT might use
optimized storage formats like Parquet, which stores row-oriented data in a columnar fashion and provides
optimized indexing.
Relevant Azure service:
Azure Synapse
HDInsight with Hive
Azure Data Factory v2
Oozie on HDInsight
Other tools:
SQL Server Integration Services (SSIS)

Data flow and control flow


In the context of data pipelines, the control flow ensures orderly processing of a set of tasks. To enforce the correct
processing order of these tasks, precedence constraints are used. You can think of these constraints as connectors
in a workflow diagram, as shown in the image below. Each task has an outcome, such as success, failure, or
completion. Any subsequent task does not initiate processing until its predecessor has completed with one of these
outcomes.
Control flows execute data flows as a task. In a data flow task, data is extracted from a source, transformed, or
loaded into a data store. The output of one data flow task can be the input to the next data flow task, and data flows
can run in parallel. Unlike control flows, you cannot add constraints between tasks in a data flow. You can, however,
add a data viewer to observe the data as it is processed by each task.

In the diagram above, there are several tasks within the control flow, one of which is a data flow task. One of the
tasks is nested within a container. Containers can be used to provide structure to tasks, providing a unit of work.
One such example is for repeating elements within a collection, such as files in a folder or database statements.
Relevant Azure service:
Azure Data Factory v2
Other tools:
SQL Server Integration Services (SSIS)

Technology choices
Online Transaction Processing (OLTP) data stores
Online Analytical Processing (OLAP) data stores
Data warehouses
Pipeline orchestration

Next steps
The following reference architectures show end-to-end ELT pipelines on Azure:
Enterprise BI in Azure with Azure Synapse
Automated enterprise BI with Azure Synapse and Azure Data Factory
Online analytical processing (OLAP)
11/2/2020 • 8 minutes to read • Edit Online

Online analytical processing (OLAP) is a technology that organizes large business databases and supports complex
analysis. It can be used to perform complex analytical queries without negatively affecting transactional systems.
The databases that a business uses to store all its transactions and records are called online transaction processing
(OLTP) databases. These databases usually have records that are entered one at a time. Often they contain a great
deal of information that is valuable to the organization. The databases that are used for OLTP, however, were not
designed for analysis. Therefore, retrieving answers from these databases is costly in terms of time and effort.
OLAP systems were designed to help extract this business intelligence information from the data in a highly
performant way. This is because OLAP databases are optimized for heavy read, low write workloads.

Semantic modeling
A semantic data model is a conceptual model that describes the meaning of the data elements it contains.
Organizations often have their own terms for things, sometimes with synonyms, or even different meanings for the
same term. For example, an inventory database might track a piece of equipment with an asset ID and a serial
number, but a sales database might refer to the serial number as the asset ID. There is no simple way to relate these
values without a model that describes the relationship.
Semantic modeling provides a level of abstraction over the database schema, so that users don't need to know the
underlying data structures. This makes it easier for end users to query data without performing aggregates and
joins over the underlying schema. Also, usually columns are renamed to more user-friendly names, so that the
context and meaning of the data are more obvious.
Semantic modeling is predominately used for read-heavy scenarios, such as analytics and business intelligence
(OLAP), as opposed to more write-heavy transactional data processing (OLTP). This is mostly due to the nature of a
typical semantic layer:
Aggregation behaviors are set so that reporting tools display them properly.
Business logic and calculations are defined.
Time-oriented calculations are included.
Data is often integrated from multiple sources.
Traditionally, the semantic layer is placed over a data warehouse for these reasons.
There are two primary types of semantic models:
Tabular . Uses relational modeling constructs (model, tables, columns). Internally, metadata is inherited from
OLAP modeling constructs (cubes, dimensions, measures). Code and script use OLAP metadata.
Multidimensional . Uses traditional OLAP modeling constructs (cubes, dimensions, measures).
Relevant Azure service:
Azure Analysis Services

Example use case


An organization has data stored in a large database. It wants to make this data available to business users and
customers to create their own reports and do some analysis. One option is just to give those users direct access to
the database. However, there are several drawbacks to doing this, including managing security and controlling
access. Also, the design of the database, including the names of tables and columns, may be hard for a user to
understand. Users would need to know which tables to query, how those tables should be joined, and other
business logic that must be applied to get the correct results. Users would also need to know a query language like
SQL even to get started. Typically this leads to multiple users reporting the same metrics but with different results.
Another option is to encapsulate all of the information that users need into a semantic model. The semantic model
can be more easily queried by users with a reporting tool of their choice. The data provided by the semantic model
is pulled from a data warehouse, ensuring that all users see a single version of the truth. The semantic model also
provides friendly table and column names, relationships between tables, descriptions, calculations, and row-level
security.

Typical traits of semantic modeling


Semantic modeling and analytical processing tends to have the following traits:

REQ UIREM EN T DESC RIP T IO N

Schema Schema on write, strongly enforced

Uses Transactions No

Locking Strategy None

Updateable No (typically requires recomputing cube)

Appendable No (typically requires recomputing cube)

Workload Heavy reads, read-only


REQ UIREM EN T DESC RIP T IO N

Indexing Multidimensional indexing

Datum size Small to medium sized

Model Multidimensional

Data shape: Cube or star/snowflake schema

Query flexibility Highly flexible

Scale: Large (10s-100s GBs)

When to use this solution


Consider OLAP in the following scenarios:
You need to execute complex analytical and ad hoc queries rapidly, without negatively affecting your OLTP
systems.
You want to provide business users with a simple way to generate reports from your data
You want to provide a number of aggregations that will allow users to get fast, consistent results.
OLAP is especially useful for applying aggregate calculations over large amounts of data. OLAP systems are
optimized for read-heavy scenarios, such as analytics and business intelligence. OLAP allows users to segment
multi-dimensional data into slices that can be viewed in two dimensions (such as a pivot table) or filter the data by
specific values. This process is sometimes called "slicing and dicing" the data, and can be done regardless of
whether the data is partitioned across several data sources. This helps users to find trends, spot patterns, and
explore the data without having to know the details of traditional data analysis.
Semantic models can help business users abstract relationship complexities and make it easier to analyze data
quickly.

Challenges
For all the benefits OLAP systems provide, they do produce a few challenges:
Whereas data in OLTP systems is constantly updated through transactions flowing in from various sources,
OLAP data stores are typically refreshed at a much slower intervals, depending on business needs. This means
OLAP systems are better suited for strategic business decisions, rather than immediate responses to changes.
Also, some level of data cleansing and orchestration needs to be planned to keep the OLAP data stores up-to-
date.
Unlike traditional, normalized, relational tables found in OLTP systems, OLAP data models tend to be
multidimensional. This makes it difficult or impossible to directly map to entity-relationship or object-oriented
models, where each attribute is mapped to one column. Instead, OLAP systems typically use a star or snowflake
schema in place of traditional normalization.

OLAP in Azure
In Azure, data held in OLTP systems such as Azure SQL Database is copied into the OLAP system, such as Azure
Analysis Services. Data exploration and visualization tools like Power BI, Excel, and third-party options connect to
Analysis Services servers and provide users with highly interactive and visually rich insights into the modeled data.
The flow of data from OLTP data to OLAP is typically orchestrated using SQL Server Integration Services, which can
be executed using Azure Data Factory.
In Azure, all of the following data stores will meet the core requirements for OLAP:
SQL Server with Columnstore indexes
Azure Analysis Services
SQL Server Analysis Services (SSAS)
SQL Server Analysis Services (SSAS) offers OLAP and data mining functionality for business intelligence
applications. You can either install SSAS on local servers, or host within a virtual machine in Azure. Azure Analysis
Services is a fully managed service that provides the same major features as SSAS. Azure Analysis Services
supports connecting to various data sources in the cloud and on-premises in your organization.
Clustered Columnstore indexes are available in SQL Server 2014 and above, as well as Azure SQL Database, and
are ideal for OLAP workloads. However, beginning with SQL Server 2016 (including Azure SQL Database), you can
take advantage of hybrid transactional/analytics processing (HTAP) through the use of updateable nonclustered
columnstore indexes. HTAP enables you to perform OLTP and OLAP processing on the same platform, which
removes the need to store multiple copies of your data, and eliminates the need for distinct OLTP and OLAP
systems. For more information, see Get started with Columnstore for real-time operational analytics.

Key selection criteria


To narrow the choices, start by answering these questions:
Do you want a managed service rather than managing your own servers?
Do you require secure authentication using Azure Active Directory (Azure AD)?
Do you want to conduct real-time analytics? If so, narrow your options to those that support real-time
analytics.
Real-time analytics in this context applies to a single data source, such as an enterprise resource planning
(ERP) application, that will run both an operational and an analytics workload. If you need to integrate data
from multiple sources, or require extreme analytics performance by using pre-aggregated data such as
cubes, you might still require a separate data warehouse.
Do you need to use pre-aggregated data, for example to provide semantic models that make analytics more
business user friendly? If yes, choose an option that supports multidimensional cubes or tabular semantic
models.
Providing aggregates can help users consistently calculate data aggregates. Pre-aggregated data can also
provide a large performance boost when dealing with several columns across many rows. Data can be pre-
aggregated in multidimensional cubes or tabular semantic models.
Do you need to integrate data from several sources, beyond your OLTP data store? If so, consider options
that easily integrate multiple data sources.

Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
A Z URE SQ L
SQ L SERVER W IT H DATA B A SE W IT H
A Z URE A N A LY SIS SQ L SERVER C O L UM N STO RE C O L UM N STO RE
C A PA B IL IT Y SERVIC ES A N A LY SIS SERVIC ES IN DEXES IN DEXES

Is managed service Yes No No Yes

Supports No Yes No No
multidimensional
cubes

Supports tabular Yes Yes No No


semantic models

Easily integrate Yes Yes No 1 No 1


multiple data sources

Supports real-time No No Yes Yes


analytics

Requires process to Yes Yes No No


copy data from
source(s)

Azure AD integration Yes No No 2 Yes

[1] Although SQL Server and Azure SQL Database cannot be used to query from and integrate multiple external
data sources, you can still build a pipeline that does this for you using SSIS or Azure Data Factory. SQL Server
hosted in an Azure VM has additional options, such as linked servers and PolyBase. For more information, see
Pipeline orchestration, control flow, and data movement.
[2] Connecting to SQL Server running on an Azure Virtual Machine is not supported using an Azure AD account.
Use a domain Active Directory account instead.
Scalability Capabilities
A Z URE SQ L
SQ L SERVER W IT H DATA B A SE W IT H
A Z URE A N A LY SIS SQ L SERVER C O L UM N STO RE C O L UM N STO RE
C A PA B IL IT Y SERVIC ES A N A LY SIS SERVIC ES IN DEXES IN DEXES

Redundant regional Yes No Yes Yes


servers for high
availability

Supports query scale Yes No Yes Yes


out

Dynamic scalability Yes No Yes Yes


(scale up)
Online transaction processing (OLTP)
11/2/2020 • 6 minutes to read • Edit Online

The management of transactional data using computer systems is referred to as online transaction processing
(OLTP). OLTP systems record business interactions as they occur in the day-to-day operation of the organization,
and support querying of this data to make inferences.

Transactional data
Transactional data is information that tracks the interactions related to an organization's activities. These
interactions are typically business transactions, such as payments received from customers, payments made to
suppliers, products moving through inventory, orders taken, or services delivered. Transactional events, which
represent the transactions themselves, typically contain a time dimension, some numerical values, and references
to other data.
Transactions typically need to be atomic and consistent. Atomicity means that an entire transaction always
succeeds or fails as one unit of work, and is never left in a half-completed state. If a transaction cannot be
completed, the database system must roll back any steps that were already done as part of that transaction. In a
traditional RDBMS, this rollback happens automatically if a transaction cannot be completed. Consistency means
that transactions always leave the data in a valid state. (These are very informal descriptions of atomicity and
consistency. There are more formal definitions of these properties, such as ACID.)
Transactional databases can support strong consistency for transactions using various locking strategies, such as
pessimistic locking, to ensure that all data is strongly consistent within the context of the enterprise, for all users
and processes.
The most common deployment architecture that uses transactional data is the data store tier in a 3-tier
architecture. A 3-tier architecture typically consists of a presentation tier, business logic tier, and data store tier. A
related deployment architecture is the N-tier architecture, which may have multiple middle-tiers handling business
logic.

Typical traits of transactional data


Transactional data tends to have the following traits:

REQ UIREM EN T DESC RIP T IO N

Normalization Highly normalized

Schema Schema on write, strongly enforced

Consistency Strong consistency, ACID guarantees

Integrity High integrity

Uses transactions Yes

Locking strategy Pessimistic or optimistic

Updateable Yes
REQ UIREM EN T DESC RIP T IO N

Appendable Yes

Workload Heavy writes, moderate reads

Indexing Primary and secondary indexes

Datum size Small to medium sized

Model Relational

Data shape Tabular

Query flexibility Highly flexible

Scale Small (MBs) to Large (a few TBs)

When to use this solution


Choose OLTP when you need to efficiently process and store business transactions and immediately make them
available to client applications in a consistent way. Use this architecture when any tangible delay in processing
would have a negative impact on the day-to-day operations of the business.
OLTP systems are designed to efficiently process and store transactions, as well as query transactional data. The
goal of efficiently processing and storing individual transactions by an OLTP system is partly accomplished by data
normalization — that is, breaking the data up into smaller chunks that are less redundant. This supports efficiency
because it enables the OLTP system to process large numbers of transactions independently, and avoids extra
processing needed to maintain data integrity in the presence of redundant data.

Challenges
Implementing and using an OLTP system can create a few challenges:
OLTP systems are not always good for handling aggregates over large amounts of data, although there are
exceptions, such as a well-planned SQL Server-based solution. Analytics against the data, that rely on aggregate
calculations over millions of individual transactions, are very resource intensive for an OLTP system. They can
be slow to execute and can cause a slow-down by blocking other transactions in the database.
When conducting analytics and reporting on data that is highly normalized, the queries tend to be complex,
because most queries need to de-normalize the data by using joins. Also, naming conventions for database
objects in OLTP systems tend to be terse and succinct. The increased normalization coupled with terse naming
conventions makes OLTP systems difficult for business users to query, without the help of a DBA or data
developer.
Storing the history of transactions indefinitely and storing too much data in any one table can lead to slow
query performance, depending on the number of transactions stored. The common solution is to maintain a
relevant window of time (such as the current fiscal year) in the OLTP system and offload historical data to other
systems, such as a data mart or data warehouse.

OLTP in Azure
Applications such as websites hosted in App Service Web Apps, REST APIs running in App Service, or mobile or
desktop applications communicate with the OLTP system, typically via a REST API intermediary.
In practice, most workloads are not purely OLTP. There tends to be an analytical component as well. In addition,
there is an increasing demand for real-time reporting, such as running reports against the operational system. This
is also referred to as HTAP (Hybrid Transactional and Analytical Processing). For more information, see Online
Analytical Processing (OLAP).
In Azure, all of the following data stores will meet the core requirements for OLTP and the management of
transaction data:
Azure SQL Database
SQL Server in an Azure virtual machine
Azure Database for MySQL
Azure Database for PostgreSQL

Key selection criteria


To narrow the choices, start by answering these questions:
Do you want a managed service rather than managing your own servers?
Does your solution have specific dependencies for Microsoft SQL Server, MySQL or PostgreSQL
compatibility? Your application may limit the data stores you can choose based on the drivers it supports for
communicating with the data store, or the assumptions it makes about which database is used.
Are your write throughput requirements particularly high? If yes, choose an option that provides in-
memory tables.
Is your solution multitenant? If so, consider options that support capacity pools, where multiple database
instances draw from an elastic pool of resources, instead of fixed resources per database. This can help you
better distribute capacity across all database instances, and can make your solution more cost effective.
Does your data need to be readable with low latency in multiple regions? If yes, choose an option that
supports readable secondary replicas.
Does your database need to be highly available across geo-graphic regions? If yes, choose an option that
supports geographic replication. Also consider the options that support automatic failover from the primary
replica to a secondary replica.
Does your database have specific security needs? If yes, examine the options that provide capabilities like
row level security, data masking, and transparent data encryption.

Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
SQ L SERVER IN A N
A Z URE SQ L A Z URE VIRT UA L A Z URE DATA B A SE A Z URE DATA B A SE
C A PA B IL IT Y DATA B A SE M A C H IN E F O R M Y SQ L F O R P O STGRESQ L

Is Managed Service Yes No Yes Yes

Runs on Platform N/A Windows, Linux, N/A N/A


Docker

Programmability 1 T-SQL, .NET, R T-SQL, .NET, R, SQL SQL, PL/pgSQL


Python
[1] Not including client driver support, which allows many programming languages to connect to and use the
OLTP data store.
Scalability capabilities
SQ L SERVER IN A N
A Z URE SQ L A Z URE VIRT UA L A Z URE DATA B A SE A Z URE DATA B A SE
C A PA B IL IT Y DATA B A SE M A C H IN E F O R M Y SQ L F O R P O STGRESQ L

Maximum database 4 TB 256 TB 16 TB 16 TB


instance size

Supports capacity Yes Yes No No


pools

Supports clusters No Yes No No


scale out

Dynamic scalability Yes No Yes Yes


(scale up)

Analytic workload capabilities


SQ L SERVER IN A N
A Z URE SQ L A Z URE VIRT UA L A Z URE DATA B A SE A Z URE DATA B A SE
C A PA B IL IT Y DATA B A SE M A C H IN E F O R M Y SQ L F O R P O STGRESQ L

Temporal tables Yes Yes No No

In-memory Yes Yes No No


(memory-optimized)
tables

Columnstore support Yes Yes No No

Adaptive query Yes Yes No No


processing

Availability capabilities
SQ L SERVER IN A N
A Z URE SQ L A Z URE VIRT UA L A Z URE DATA B A SE A Z URE DATA B A SE
C A PA B IL IT Y DATA B A SE M A C H IN E F O R M Y SQ L F O R P O STGRESQ L

Readable secondaries Yes Yes Yes Yes

Geographic Yes Yes Yes Yes


replication

Automatic failover to Yes No No No


secondary

Point-in-time restore Yes Yes Yes Yes

Security capabilities
SQ L SERVER IN A N
A Z URE SQ L A Z URE VIRT UA L A Z URE DATA B A SE A Z URE DATA B A SE
C A PA B IL IT Y DATA B A SE M A C H IN E F O R M Y SQ L F O R P O STGRESQ L

Row level security Yes Yes Yes Yes

Data masking Yes Yes No No

Transparent data Yes Yes Yes Yes


encryption

Restrict access to Yes Yes Yes Yes


specific IP addresses

Restrict access to Yes Yes Yes Yes


allow VNet access
only

Azure Active Yes No Yes Yes


Directory
authentication

Active Directory No Yes No No


authentication

Multi-factor Yes No Yes Yes


authentication

Supports Always Yes Yes No No


Encrypted

Private IP No Yes No No
Data warehousing
11/2/2020 • 11 minutes to read • Edit Online

A data warehouse is a centralized repository of integrated data from one or more disparate sources. Data
warehouses store current and historical data and are used for reporting and analysis of the data.

To move data into a data warehouse, data is periodically extracted from various sources that contain important
business information. As the data is moved, it can be formatted, cleaned, validated, summarized, and reorganized.
Alternatively, the data can be stored in the lowest level of detail, with aggregated views provided in the warehouse
for reporting. In either case, the data warehouse becomes a permanent data store for reporting, analysis, and
business intelligence (BI).

Data warehouse architectures


The following reference architectures show end-to-end data warehouse architectures on Azure:
Enterprise BI in Azure with Azure Synapse Analytics. This reference architecture implements an extract, load,
and transform (ELT) pipeline that moves data from an on-premises SQL Server database into Azure Synapse.
Automated enterprise BI with Azure Synapse and Azure Data Factory. This reference architecture shows an ELT
pipeline with incremental loading, automated using Azure Data Factory.

When to use this solution


Choose a data warehouse when you need to turn massive amounts of data from operational systems into a
format that is easy to understand. Data warehouses don't need to follow the same terse data structure you may be
using in your OLTP databases. You can use column names that make sense to business users and analysts,
restructure the schema to simplify relationships, and consolidate several tables into one. These steps help guide
users who need to create reports and analyze the data in BI systems, without the help of a database administrator
(DBA) or data developer.
Consider using a data warehouse when you need to keep historical data separate from the source transaction
systems for performance reasons. Data warehouses make it easy to access historical data from multiple locations,
by providing a centralized location using common formats, keys, and data models.
Because data warehouses are optimized for read access, generating reports is faster than using the source
transaction system for reporting.
Other benefits include:
The data warehouse can store historical data from multiple sources, representing a single source of truth.
You can improve data quality by cleaning up data as it is imported into the data warehouse.
Reporting tools don't compete with the transactional systems for query processing cycles. A data warehouse
allows the transactional system to focus on handling writes, while the data warehouse satisfies the majority of
read requests.
A data warehouse can consolidate data from different software.
Data mining tools can find hidden patterns in the data using automatic methodologies.
Data warehouses make it easier to provide secure access to authorized users, while restricting access to others.
Business users don't need access to the source data, removing a potential attack vector.
Data warehouses make it easier to create business intelligence solutions, such as OLAP cubes.

Challenges
Properly configuring a data warehouse to fit the needs of your business can bring some of the following
challenges:
Committing the time required to properly model your business concepts. Data warehouses are information
driven. You must standardize business-related terms and common formats, such as currency and dates. You
also need to restructure the schema in a way that makes sense to business users but still ensures accuracy
of data aggregates and relationships.
Planning and setting up your data orchestration. Consider how to copy data from the source transactional
system to the data warehouse, and when to move historical data from operational data stores into the
warehouse.
Maintaining or improving data quality by cleaning the data as it is imported into the warehouse.

Data warehousing in Azure


You may have one or more sources of data, whether from customer transactions or business applications. This
data is traditionally stored in one or more OLTP databases. The data could be persisted in other storage mediums
such as network shares, Azure Storage Blobs, or a data lake. The data could also be stored by the data warehouse
itself or in a relational database such as Azure SQL Database. The purpose of the analytical data store layer is to
satisfy queries issued by analytics and reporting tools against the data warehouse. In Azure, this analytical store
capability can be met with Azure Synapse, or with Azure HDInsight using Hive or Interactive Query. In addition,
you will need some level of orchestration to move or copy data from data storage to the data warehouse, which
can be done using Azure Data Factory or Oozie on Azure HDInsight.
There are several options for implementing a data warehouse in Azure, depending on your needs. The following
lists are broken into two categories, symmetric multiprocessing (SMP) and massively parallel processing (MPP).
SMP:
Azure SQL Database
SQL Server in a virtual machine
MPP:
Azure Synapse Analytics (formerly Azure Data Warehouse)
Apache Hive on HDInsight
Interactive Query (Hive LLAP) on HDInsight
As a general rule, SMP-based warehouses are best suited for small to medium data sets (up to 4-100 TB), while
MPP is often used for big data. The delineation between small/medium and big data partly has to do with your
organization's definition and supporting infrastructure. (See Choosing an OLTP data store.)
Beyond data sizes, the type of workload pattern is likely to be a greater determining factor. For example, complex
queries may be too slow for an SMP solution, and require an MPP solution instead. MPP-based systems usually
have a performance penalty with small data sizes, because of how jobs are distributed and consolidated across
nodes. If your data sizes already exceed 1 TB and are expected to continually grow, consider selecting an MPP
solution. However, if your data sizes are smaller, but your workloads are exceeding the available resources of your
SMP solution, then MPP may be your best option as well.
The data accessed or stored by your data warehouse could come from a number of data sources, including a data
lake, such as Azure Data Lake Storage. For a video session that compares the different strengths of MPP services
that can use Azure Data Lake, see Azure Data Lake and Azure Data Warehouse: Applying Modern Practices to Your
App.
SMP systems are characterized by a single instance of a relational database management system sharing all
resources (CPU/Memory/Disk). You can scale up an SMP system. For SQL Server running on a VM, you can scale
up the VM size. For Azure SQL Database, you can scale up by selecting a different service tier.
MPP systems can be scaled out by adding more compute nodes (which have their own CPU, memory, and I/O
subsystems). There are physical limitations to scaling up a server, at which point scaling out is more desirable,
depending on the workload. However, the differences in querying, modeling, and data partitioning mean that MPP
solutions require a different skill set.
When deciding which SMP solution to use, see A closer look at Azure SQL Database and SQL Server on Azure
VMs.
Azure Synapse (formerly Azure SQL Data Warehouse) can also be used for small and medium datasets, where the
workload is compute and memory intensive. Read more about Azure Synapse patterns and common scenarios:
Azure SQL Data Warehouse Workload Patterns and Anti-Patterns
Azure SQL Data Warehouse loading patterns and strategies
Migrating data to Azure SQL Data Warehouse in practice
Common ISV application patterns using Azure SQL Data Warehouse

Key selection criteria


To narrow the choices, start by answering these questions:
Do you want a managed service rather than managing your own servers?
Are you working with extremely large data sets or highly complex, long-running queries? If yes, consider an
MPP option.
For a large data set, is the data source structured or unstructured? Unstructured data may need to be
processed in a big data environment such as Spark on HDInsight, Azure Databricks, Hive LLAP on
HDInsight, or Azure Data Lake Analytics. All of these can serve as ELT (Extract, Load, Transform) and ETL
(Extract, Transform, Load) engines. They can output the processed data into structured data, making it easier
to load into Azure Synapse or one of the other options. For structured data, Azure Synapse has a
performance tier called Optimized for Compute, for compute-intensive workloads requiring ultra-high
performance.
Do you want to separate your historical data from your current, operational data? If so, select one of the
options where orchestration is required. These are standalone warehouses optimized for heavy read access,
and are best suited as a separate historical data store.
Do you need to integrate data from several sources, beyond your OLTP data store? If so, consider options
that easily integrate multiple data sources.
Do you have a multitenancy requirement? If so, Azure Synapse is not ideal for this requirement. For more
information, see Azure Synapse Patterns and Anti-Patterns.
Do you prefer a relational data store? If so, choose an option with a relational data store, but also note that
you can use a tool like PolyBase to query non-relational data stores if needed. If you decide to use PolyBase,
however, run performance tests against your unstructured data sets for your workload.
Do you have real-time reporting requirements? If you require rapid query response times on high volumes
of singleton inserts, choose an option that supports real-time reporting.
Do you need to support a large number of concurrent users and connections? The ability to support a
number of concurrent users/connections depends on several factors.
For Azure SQL Database, refer to the documented resource limits based on your service tier.
SQL Server allows a maximum of 32,767 user connections. When running on a VM, performance
will depend on the VM size and other factors.
Azure Synapse has limits on concurrent queries and concurrent connections. For more information,
see Concurrency and workload management in Azure Synapse. Consider using complementary
services, such as Azure Analysis Services, to overcome limits in Azure Synapse.
What sort of workload do you have? In general, MPP-based warehouse solutions are best suited for
analytical, batch-oriented workloads. If your workloads are transactional by nature, with many small
read/write operations or multiple row-by-row operations, consider using one of the SMP options. One
exception to this guideline is when using stream processing on an HDInsight cluster, such as Spark
Streaming, and storing the data within a Hive table.

Capability Matrix
The following tables summarize the key differences in capabilities.
General capabilities
A Z URE SQ L A PA C H E H IVE O N H IVE L L A P O N
C A PA B IL IT Y DATA B A SE SQ L SERVER ( VM ) A Z URE SY N A P SE H DIN SIGH T H DIN SIGH T

Is managed Yes No Yes Yes 1 Yes 1


service

Requires data No No Yes Yes Yes


orchestration
(holds copy of
data/historical
data)

Easily integrate No No Yes Yes Yes


multiple data
sources

Supports No No Yes No 2 No 2
pausing compute

Relational data Yes Yes Yes No No


store

Real-time Yes Yes No No Yes


reporting
A Z URE SQ L A PA C H E H IVE O N H IVE L L A P O N
C A PA B IL IT Y DATA B A SE SQ L SERVER ( VM ) A Z URE SY N A P SE H DIN SIGH T H DIN SIGH T

Flexible backup Yes Yes No 3 Yes 4 Yes 4


restore points

SMP/MPP SMP SMP MPP MPP MPP

[1] Manual configuration and scaling.


[2] HDInsight clusters can be deleted when not needed, and then re-created. Attach an external data store to your
cluster so your data is retained when you delete your cluster. You can use Azure Data Factory to automate your
cluster's lifecycle by creating an on-demand HDInsight cluster to process your workload, then delete it once the
processing is complete.
[3] With Azure Synapse, you can restore a database to any available restore point within the last seven days.
Snapshots start every four to eight hours and are available for seven days. When a snapshot is older than seven
days, it expires and its restore point is no longer available.
[4] Consider using an external Hive metastore that can be backed up and restored as needed. Standard backup and
restore options that apply to Blob Storage or Data Lake Storage can be used for the data, or third-party HDInsight
backup and restore solutions, such as Imanis Data can be used for greater flexibility and ease of use.
Scalability capabilities
A Z URE SQ L A PA C H E H IVE O N H IVE L L A P O N
C A PA B IL IT Y DATA B A SE SQ L SERVER ( VM ) A Z URE SY N A P SE H DIN SIGH T H DIN SIGH T

Redundant Yes Yes Yes No No


regional servers
for high
availability

Supports query No No Yes Yes Yes


scale out
(distributed
queries)

Dynamic Yes No Yes 1 No No


scalability

Supports in- Yes Yes Yes Yes Yes


memory caching
of data

[1] Azure Synapse allows you to scale up or down by adjusting the number of data warehouse units (DWUs). See
Manage compute power in Azure Synapse.
Security capabilities
SQ L SERVER IN A
A Z URE SQ L VIRT UA L A PA C H E H IVE O N H IVE L L A P O N
C A PA B IL IT Y DATA B A SE M A C H IN E A Z URE SY N A P SE H DIN SIGH T H DIN SIGH T

Authentication SQL / Azure SQL / Azure AD / SQL / Azure AD local / Azure AD 1 local / Azure AD 1
Active Directory Active Directory
(Azure AD)
SQ L SERVER IN A
A Z URE SQ L VIRT UA L A PA C H E H IVE O N H IVE L L A P O N
C A PA B IL IT Y DATA B A SE M A C H IN E A Z URE SY N A P SE H DIN SIGH T H DIN SIGH T

Authorization Yes Yes Yes Yes Yes 1

Auditing Yes Yes Yes Yes Yes 1

Data encryption Yes 2 Yes 2 Yes 2 Yes 2 Yes 1


at rest

Row-level Yes Yes Yes No Yes 1


security

Supports Yes Yes Yes Yes Yes 3


firewalls

Dynamic data Yes Yes Yes No Yes 1


masking

[1] Requires using a domain-joined HDInsight cluster.


[2] Requires using Transparent Data Encryption (TDE) to encrypt and decrypt your data at rest.
[3] Supported when used within an Azure Virtual Network.
Read more about securing your data warehouse:
Securing your SQL Database
Secure a database in Azure Synapse
Extend Azure HDInsight using an Azure Virtual Network
Enterprise-level Hadoop security with domain-joined HDInsight clusters
Non-relational data and NoSQL
11/2/2020 • 12 minutes to read • Edit Online

A non-relational database is a database that does not use the tabular schema of rows and columns found in most
traditional database systems. Instead, non-relational databases use a storage model that is optimized for the
specific requirements of the type of data being stored. For example, data may be stored as simple key/value pairs,
as JSON documents, or as a graph consisting of edges and vertices.
What all of these data stores have in common is that they don't use a relational model. Also, they tend to be more
specific in the type of data they support and how data can be queried. For example, time series data stores are
optimized for queries over time-based sequences of data, while graph data stores are optimized for exploring
weighted relationships between entities. Neither format would generalize well to the task of managing
transactional data.
The term NoSQL refers to data stores that do not use SQL for queries, and instead use other programming
languages and constructs to query the data. In practice, "NoSQL" means "non-relational database," even though
many of these databases do support SQL-compatible queries. However, the underlying query execution strategy is
usually very different from the way a traditional RDBMS would execute the same SQL query.
The following sections describe the major categories of non-relational or NoSQL database.

Document data stores


A document data store manages a set of named string fields and object data values in an entity referred to as a
document. These data stores typically store data in the form of JSON documents. Each field value could be a scalar
item, such as a number, or a compound element, such as a list or a parent-child collection. The data in the fields of a
document can be encoded in a variety of ways, including XML, YAML, JSON, BSON, or even stored as plain text. The
fields within documents are exposed to the storage management system, enabling an application to query and
filter data by using the values in these fields.
Typically, a document contains the entire data for an entity. What items constitute an entity are application-specific.
For example, an entity could contain the details of a customer, an order, or a combination of both. A single
document might contain information that would be spread across several relational tables in a relational database
management system (RDBMS). A document store does not require that all documents have the same structure. This
free-form approach provides a great deal of flexibility. For example, applications can store different data in
documents in response to a change in business requirements.
The application can retrieve documents by using the document key. This is a unique identifier for the document,
which is often hashed, to help distribute data evenly. Some document databases create the document key
automatically. Others enable you to specify an attribute of the document to use as the key. The application can also
query documents based on the value of one or more fields. Some document databases support indexing to
facilitate fast lookup of documents based on one or more indexed fields.
Many document databases support in-place updates, enabling an application to modify the values of specific fields
in a document without rewriting the entire document. Read and write operations over multiple fields in a single
document are typically atomic.
Relevant Azure service:
Azure Cosmos DB

Columnar data stores


A columnar or column-family data store organizes data into columns and rows. In its simplest form, a column-
family data store can appear very similar to a relational database, at least conceptually. The real power of a column-
family database lies in its denormalized approach to structuring sparse data, which stems from the column-
oriented approach to storing data.
You can think of a column-family data store as holding tabular data with rows and columns, but the columns are
divided into groups known as column families. Each column family holds a set of columns that are logically related
and are typically retrieved or manipulated as a unit. Other data that is accessed separately can be stored in separate
column families. Within a column family, new columns can be added dynamically, and rows can be sparse (that is, a
row doesn't need to have a value for every column).
The following diagram shows an example with two column families, Identity and Contact Info . The data for a
single entity has the same row key in each column family. This structure, where the rows for any given object in a
column family can vary dynamically, is an important benefit of the column-family approach, making this form of
data store highly suited for storing data with varying schemas.

Unlike a key/value store or a document database, most column-family databases physically store data in key order,
rather than by computing a hash. The row key is considered the primary index and enables key-based access via a
specific key or a range of keys. Some implementations allow you to create secondary indexes over specific columns
in a column family. Secondary indexes let you retrieve data by columns value, rather than row key.
On disk, all of the columns within a column family are stored together in the same file, with a certain number of
rows in each file. With large data sets, this approach creates a performance benefit by reducing the amount of data
that needs to be read from disk when only a few columns are queried together at a time.
Read and write operations for a row are typically atomic within a single column family, although some
implementations provide atomicity across the entire row, spanning multiple column families.
Relevant Azure service:
Cosmos DB Cassandra API
HBase in HDInsight

Key/value data stores


A key/value store is essentially a large hash table. You associate each data value with a unique key, and the
key/value store uses this key to store the data by using an appropriate hashing function. The hashing function is
selected to provide an even distribution of hashed keys across the data storage.
Most key/value stores only support simple query, insert, and delete operations. To modify a value (either partially
or completely), an application must overwrite the existing data for the entire value. In most implementations,
reading or writing a single value is an atomic operation. If the value is large, writing may take some time.
An application can store arbitrary data as a set of values, although some key/value stores impose limits on the
maximum size of values. The stored values are opaque to the storage system software. Any schema information
must be provided and interpreted by the application. Essentially, values are blobs and the key/value store simply
retrieves or stores the value by key.

Key/value stores are highly optimized for applications performing simple lookups using the value of the key, or by
a range of keys, but are less suitable for systems that need to query data across different tables of keys/values, such
as joining data across multiple tables.
Key/value stores are also not optimized for scenarios where querying or filtering by non-key values is important,
rather than performing lookups based only on keys. For example, with a relational database, you can find a record
by using a WHERE clause to filter the non-key columns, but key/values stores usually do not have this type of
lookup capability for values, or if they do, it requires a slow scan of all values.
A single key/value store can be extremely scalable, as the data store can easily distribute data across multiple nodes
on separate machines.
Relevant Azure services:
Azure Cosmos DB Table API
Azure Cache for Redis
Azure Table Storage

Graph data stores


A graph data store manages two types of information, nodes and edges. Nodes represent entities, and edges
specify the relationships between these entities. Both nodes and edges can have properties that provide
information about that node or edge, similar to columns in a table. Edges can also have a direction indicating the
nature of the relationship.
The purpose of a graph data store is to allow an application to efficiently perform queries that traverse the network
of nodes and edges, and to analyze the relationships between entities. The following diagram shows an
organization's personnel data structured as a graph. The entities are employees and departments, and the edges
indicate reporting relationships and the department in which employees work. In this graph, the arrows on the
edges show the direction of the relationships.

This structure makes it straightforward to perform queries such as "Find all employees who report directly or
indirectly to Sarah" or "Who works in the same department as John?" For large graphs with lots of entities and
relationships, you can perform complex analyses quickly. Many graph databases provide a query language that you
can use to traverse a network of relationships efficiently.
Relevant Azure service:
Azure Cosmos DB Graph API

Time series data stores


Time series data is a set of values organized by time, and a time series data store is optimized for this type of data.
Time series data stores must support a very high number of writes, as they typically collect large amounts of data
in real time from a large number of sources. Time series data stores are optimized for storing telemetry data.
Scenarios include IoT sensors or application/system counters. Updates are rare, and deletes are often done as bulk
operations.

Although the records written to a time series database are generally small, there are often a large number of
records, and total data size can grow rapidly. Time series data stores also handle out-of-order and late-arriving data,
automatic indexing of data points, and optimizations for queries described in terms of windows of time. This last
feature enables queries to run across millions of data points and multiple data streams quickly, in order to support
time series visualizations, which is a common way that time series data is consumed.
For more information, see Time series solutions
Relevant Azure services:
Azure Time Series Insights
OpenTSDB with HBase on HDInsight

Object data stores


Object data stores are optimized for storing and retrieving large binary objects or blobs such as images, text files,
video and audio streams, large application data objects and documents, and virtual machine disk images. An object
consists of the stored data, some metadata, and a unique ID for accessing the object. Object stores are designed to
support files that are individually very large, as well provide large amounts of total storage to manage all files.

Some object data stores replicate a given blob across multiple server nodes, which enables fast parallel reads. This
in turn enables the scale-out querying of data contained in large files, because multiple processes, typically running
on different servers, can each query the large data file simultaneously.
One special case of object data stores is the network file share. Using file shares enables files to be accessed across
a network using standard networking protocols like server message block (SMB). Given appropriate security and
concurrent access control mechanisms, sharing data in this way can enable distributed services to provide highly
scalable data access for basic, low-level operations such as simple read and write requests.
Relevant Azure services:
Azure Blob Storage
Azure Data Lake Store
Azure File Storage

External index data stores


External index data stores provide the ability to search for information held in other data stores and services. An
external index acts as a secondary index for any data store, and can be used to index massive volumes of data and
provide near real-time access to these indexes.
For example, you might have text files stored in a file system. Finding a file by its file path is quick, but searching
based on the contents of the file would require a scan of all of the files, which is slow. An external index lets you
create secondary search indexes and then quickly find the path to the files that match your criteria. Another
example application of an external index is with key/value stores that only index by the key. You can build a
secondary index based on the values in the data, and quickly look up the key that uniquely identifies each matched
item.
The indexes are created by running an indexing process. This can be performed using a pull model, triggered by the
data store, or using a push model, initiated by application code. Indexes can be multidimensional and may support
free-text searches across large volumes of text data.
External index data stores are often used to support full text and web-based search. In these cases, searching can be
exact or fuzzy. A fuzzy search finds documents that match a set of terms and calculates how closely they match.
Some external indexes also support linguistic analysis that can return matches based on synonyms, genre
expansions (for example, matching "dogs" to "pets"), and stemming (for example, searching for "run" also matches
"ran" and "running").
Relevant Azure service:
Azure Search

Typical requirements
Non-relational data stores often use a different storage architecture from that used by relational databases.
Specifically, they tend toward having no fixed schema. Also, they tend not to support transactions, or else restrict
the scope of transactions, and they generally don't include secondary indexes for scalability reasons.
The following compares the requirements for each of the non-relational data stores:

C O L UM N - FA M ILY
REQ UIREM EN T DO C UM EN T DATA DATA K EY / VA L UE DATA GRA P H DATA

Normalization Denormalized Denormalized Denormalized Normalized

Schema Schema on read Column families Schema on read Schema on read


defined on write,
column schema on
read

Consistency (across Tunable consistency, Column-family–level Key-level guarantees Graph-level


concurrent document-level guarantees guarantees
transactions) guarantees

Atomicity (transaction Collection Table Table Graph


scope)

Locking Strategy Optimistic (lock free) Pessimistic (row locks) Optimistic (ETag)

Access pattern Random access Aggregates on Random access Random access


tall/wide data
C O L UM N - FA M ILY
REQ UIREM EN T DO C UM EN T DATA DATA K EY / VA L UE DATA GRA P H DATA

Indexing Primary and Primary and Primary index only Primary and
secondary indexes secondary indexes secondary indexes

Data shape Document Tabular with column Key and value Graph containing
families containing edges and vertices
columns

Sparse Yes Yes Yes No

Wide (lots of Yes Yes No No


columns/attributes)

Datum size Small (KBs) to Medium (MBs) to Small (KBs) Small (KBs)
medium (low MBs) Large (low GBs)

Overall Maximum Very Large (PBs) Very Large (PBs) Very Large (PBs) Large (TBs)
Scale

REQ UIREM EN T T IM E SERIES DATA O B JEC T DATA EXT ERN A L IN DEX DATA

Normalization Normalized Denormalized Denormalized

Schema Schema on read Schema on read Schema on write

Consistency (across N/A N/A N/A


concurrent transactions)

Atomicity (transaction scope) N/A Object N/A

Locking Strategy N/A Pessimistic (blob locks) N/A

Access pattern Random access and Sequential access Random access


aggregation

Indexing Primary and secondary Primary index only N/A


indexes

Data shape Tabular Blob and metadata Document

Sparse No N/A No

Wide (lots of No Yes Yes


columns/attributes)

Datum size Small (KBs) Large (GBs) to Very Large Small (KBs)
(TBs)

Overall Maximum Scale Large (low TBs) Very Large (PBs) Large (low TBs)
Processing free-form text for search
11/2/2020 • 2 minutes to read • Edit Online

To support search, free-form text processing can be performed against documents containing paragraphs of text.
Text search works by constructing a specialized index that is precomputed against a collection of documents. A
client application submits a query that contains the search terms. The query returns a result set, consisting of a list
of documents sorted by how well each document matches the search criteria. The result set may also include the
context in which the document matches the criteria, which enables the application to highlight the matching phrase
in the document.

Free-form text processing can produce useful, actionable data from large amounts of noisy text data. The results
can give unstructured documents a well-defined and queryable structure.

Challenges
Processing a collection of free-form text documents is typically computationally intensive, as well as time
intensive.
In order to search free-form text effectively, the search index should support fuzzy search based on terms that
have a similar construction. For example, search indexes are built with lemmatization and linguistic stemming,
so that queries for "run" will match documents that contain "ran" and "running."

Architecture
In most scenarios, the source text documents are loaded into object storage such as Azure Storage or Azure Data
Lake Store. An exception is using full text search within SQL Server or Azure SQL Database. In this case, the
document data is loaded into tables managed by the database. Once stored, the documents are processed in a
batch to create the index.

Technology choices
Options for creating a search index include Azure Search, Elasticsearch, and HDInsight with Solr. Each of these
technologies can populate a search index from a collection of documents. Azure Search provides indexers that can
automatically populate the index for documents ranging from plain text to Excel and PDF formats. On HDInsight,
Apache Solr can index binary files of many types, including plain text, Word, and PDF. Once the index is constructed,
clients can access the search interface by means of a REST API.
If your text data is stored in SQL Server or Azure SQL Database, you can use the full-text search that is built into the
database. The database populates the index from text, binary, or XML data stored within the same database. Clients
search by using T-SQL queries.
For more information, see Search data stores.
Time series solutions
11/2/2020 • 4 minutes to read • Edit Online

Time series data is a set of values organized by time. Examples of time series data include sensor data, stock prices,
click stream data, and application telemetry. Time series data can be analyzed for historical trends, real-time alerts,
or predictive modeling.

Time series data represents how an asset or process changes over time. The data has a timestamp, but more
importantly, time is the most meaningful axis for viewing or analyzing the data. Time series data typically arrives in
order of time and is usually treated as an insert rather than an update to your database. Because of this, change is
measured over time, enabling you to look backward and to predict future change. As such, time series data is best
visualized with scatter or line charts.

Some examples of time series data are:


Stock prices captured over time to detect trends.
Server performance, such as CPU usage, I/O load, memory usage, and network bandwidth consumption.
Telemetry from sensors on industrial equipment, which can be used to detect pending equipment failure and
trigger alert notifications.
Real-time car telemetry data including speed, braking, and acceleration over a time window to produce an
aggregate risk score for the driver.
In each of these cases, you can see how time is most meaningful as an axis. Displaying the events in the order in
which they arrived is a key characteristic of time series data, as there is a natural temporal ordering. This differs
from data captured for standard OLTP data pipelines where data can be entered in any order, and updated at any
time.

When to use this solution


Choose a time series solution when you need to ingest data whose strategic value is centered around changes over
a period of time, and you are primarily inserting new data and rarely updating, if at all. You can use this information
to detect anomalies, visualize trends, and compare current data to historical data, among other things. This type of
architecture is also best suited for predictive modeling and forecasting results, because you have the historical
record of changes over time, which can be applied to any number of forecasting models.
Using time series offers the following benefits:
Clearly represents how an asset or process changes over time.
Helps you quickly detect changes to a number of related sources, making anomalies and emerging trends
clearly stand out.
Best suited for predictive modeling and forecasting.
Internet of Things (IoT )
Data collected by IoT devices is a natural fit for time series storage and analysis. The incoming data is inserted and
rarely, if ever, updated. The data is timestamped and inserted in the order it was received, and this data is typically
displayed in chronological order, enabling users to discover trends, spot anomalies, and use the information for
predictive analysis.
For more information, see Internet of Things.
Real-time analytics
Time series data is often time sensitive — that is, it must be acted on quickly, to spot trends in real time or generate
alerts. In these scenarios, any delay in insights can cause downtime and business impact. In addition, there is often
a need to correlate data from a variety of different sources, such as sensors.
Ideally, you would have a stream processing layer that can handle the incoming data in real time and process all of
it with high precision and high granularity. This isn't always possible, depending on your streaming architecture
and the components of your stream buffering and stream processing layers. You may need to sacrifice some
precision of the time series data by reducing it. This is done by processing sliding time windows (several seconds,
for example), allowing the processing layer to perform calculations in a timely manner. You may also need to
downsample and aggregate your data when displaying longer periods of time, such as zooming to display data
captured over several months.

Challenges
Time series data is often very high volume, especially in IoT scenarios. Storing, indexing, querying, analyzing,
and visualizing time series data can be challenging.
It can be challenging to find the right combination of high-speed storage and powerful compute operations
for handling real-time analytics, while minimizing time to market and overall cost investment.

Architecture
In many scenarios that involve time series data, such as IoT, the data is captured in real time. As such, a real-time
processing architecture is appropriate.
Data from one or more data sources is ingested into the stream buffering layer by IoT Hub, Event Hubs, or Kafka on
HDInsight. Next, the data is processed in the stream processing layer that can optionally hand off the processed
data to a machine learning service for predictive analytics. The processed data is stored in an analytical data store,
such as HBase, Azure Cosmos DB, Azure Data Lake, or Blob Storage. An analytics and reporting application or
service, like Power BI, Azure Data Explorer, or OpenTSDB (if stored in HBase) can be used to display the time series
data for analysis.
Another option is to use Azure Time Series Insights. Time Series Insights is a fully managed service for time series
data. In this architecture, Time Series Insights performs the roles of stream processing, data store, and analytics and
reporting. It accepts streaming data from either IoT Hub or Event Hubs and stores, processes, analyzes, and
displays the data in near real time. It does not pre-aggregate the data, but stores the raw events.
Time Series Insights is schema adaptive, which means that you do not have to do any data preparation to start
deriving insights. This enables you to explore, compare, and correlate a variety of data sources seamlessly. It also
provides SQL-like filters and aggregates, ability to construct, visualize, compare, and overlay various time series
patterns, heat maps, and the ability to save and share queries.

Technology choices
Data Storage
Analysis, visualizations, and reporting
Analytical Data Stores
Stream processing
Working with CSV and JSON files for data solutions
11/2/2020 • 4 minutes to read • Edit Online

CSV and JSON are likely the most common formats used for ingesting, exchanging, and storing unstructured or
semi-structured data.

About CSV format


CSV (comma-separated values) files are commonly used to exchange tabular data between systems in plain text.
They typically contain a header row that provides column names for the data, but are otherwise considered semi-
structured. This is due to the fact that CSVs cannot naturally represent hierarchical or relational data. Data
relationships are typically handled with multiple CSV files, where foreign keys are stored in columns of one or more
files, but the relationships between those files are not expressed by the format itself. Files in CSV format may use
other delimiters besides commas, such as tabs or spaces.
Despite their limitations, CSV files are a popular choice for data exchange, because they are supported by a wide
range of business, consumer, and scientific applications. For example, database and spreadsheet programs can
import and export CSV files. Similarly, most batch and stream data processing engines, such as Spark and Hadoop,
natively support serializing and deserializing CSV-formatted files and offer ways to apply a schema on read. This
makes it easier to work with the data, by offering options to query against it and store the information in a more
efficient data format for faster processing.

About JSON format


JSON (JavaScript Object Notation) data is represented as key-value pairs in a semi-structured format. JSON is often
compared to XML, as both are capable of storing data in hierarchical format, with child data represented inline with
its parent. Both are self-describing and human readable, but JSON documents tend to be much smaller, leading to
their popular use in online data exchange, especially with the advent of REST-based web services.
JSON-formatted files have several benefits over CSV:
JSON maintains hierarchical structures, making it easier to hold related data in a single document and represent
complex relationships.
Most programming languages provide native support for deserializing JSON into objects, or provide
lightweight JSON serialization libraries.
JSON supports lists of objects, helping to avoid messy translations of lists into a relational data model.
JSON is a commonly used file format for NoSQL databases, such as MongoDB, Couchbase, and Azure Cosmos
DB.
Since a lot of data coming across the wire is already in JSON format, most web-based programming languages
support working with JSON natively, or through the use of external libraries to serialize and deserialize JSON data.
This universal support for JSON has led to its use in logical formats through data structure representation,
exchange formats for hot data, and data storage for cold data.
Many batch and stream data processing engines natively support JSON serialization and deserialization. Though
the data contained within JSON documents may ultimately be stored in a more performance-optimized formats,
such as Parquet or Avro, it serves as the raw data for source of truth, which is critical for reprocessing the data as
needed.

When to use CSV or JSON formats


CSVs are more commonly used for exporting and importing data, or processing it for analytics and machine
learning. JSON-formatted files have the same benefits, but are more common in hot data exchange solutions. JSON
documents are often sent by web and mobile devices performing online transactions, by IoT (internet of things)
devices for one-way or bidirectional communication, or by client applications communicating with SaaS and PaaS
services or serverless architectures.
CSV and JSON file formats both make it easy to exchange data between dissimilar systems or devices. Their semi-
structured formats allow flexibility in transferring almost any type of data, and universal support for these formats
make them simple to work with. Both can be used as the raw source of truth in cases where the processed data is
stored in binary formats for more efficient querying.

Working with CSV and JSON data in Azure


Azure provides several solutions for working with CSV and JSON files, depending on your needs. The primary
landing place for these files is either Azure Storage or Azure Data Lake Store. Most Azure services that work with
these and other text-based files integrate with either object storage service. In some situations, however, you may
opt to directly import the data into Azure SQL or some other data store. SQL Server has native support for storing
and working with JSON documents, which makes it easy to import and process those types of files. You can use a
utility like SQL Bulk Import to easily import CSV files.
You can also query JSON files directly from Azure Blob Storage without importing them into Azure SQL. For a
complete example of this approach, see Work with JSON files with Azure SQL. Currently this option isn't available
for CSV files.
Depending on the scenario, you may perform batch processing or real-time processing of the data.

Challenges
There are some challenges to consider when working with these formats:
Without any restraints on the data model, CSV and JSON files are prone to data corruption ("garbage in,
garbage out"). For instance, there's no notion of a date/time object in either file, so the file format does not
prevent you from inserting "ABC123" in a date field, for example.
Using CSV and JSON files as your cold storage solution does not scale well when working with big data. In
most cases, they cannot be split into partitions for parallel processing, and cannot be compressed as well as
binary formats. This often leads to processing and storing this data into read-optimized formats such as
Parquet and ORC (optimized row columnar), which also provide indexes and inline statistics about the data
contained.
You may need to apply a schema on the semi-structured data to make it easier to query and analyze.
Typically, this requires storing the data in another form that complies with your environment's data storage
needs, such as within a database.
Big data architectures
11/2/2020 • 10 minutes to read • Edit Online

A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or
complex for traditional database systems. The threshold at which organizations enter into the big data realm
differs, depending on the capabilities of the users and their tools. For some, it can mean hundreds of gigabytes of
data, while for others it means hundreds of terabytes. As tools for working with big data sets advance, so does the
meaning of big data. More and more, this term relates to the value you can extract from your data sets through
advanced analytics, rather than strictly the size of the data, although in these cases they tend to be quite large.
Over the years, the data landscape has changed. What you can do, or are expected to do, with data has changed.
The cost of storage has fallen dramatically, while the means by which data is collected keeps growing. Some data
arrives at a rapid pace, constantly demanding to be collected and observed. Other data arrives more slowly, but in
very large chunks, often in the form of decades of historical data. You might be facing an advanced analytics
problem, or one that requires machine learning. These are challenges that big data architectures seek to solve.
Big data solutions typically involve one or more of the following types of workload:
Batch processing of big data sources at rest.
Real-time processing of big data in motion.
Interactive exploration of big data.
Predictive analytics and machine learning.
Consider big data architectures when you need to:
Store and process data in volumes too large for a traditional database.
Transform unstructured data for analysis and reporting.
Capture, process, and analyze unbounded streams of data in real time, or with low latency.

Components of a big data architecture


The following diagram shows the logical components that fit into a big data architecture. Individual solutions may
not contain every item in this diagram.

Most big data architectures include some or all of the following components:
Data sources . All big data solutions start with one or more data sources. Examples include:
Application data stores, such as relational databases.
Static files produced by applications, such as web server log files.
Real-time data sources, such as IoT devices.
Data storage . Data for batch processing operations is typically stored in a distributed file store that can
hold high volumes of large files in various formats. This kind of store is often called a data lake. Options for
implementing this storage include Azure Data Lake Store or blob containers in Azure Storage.
Batch processing . Because the data sets are so large, often a big data solution must process data files
using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually these
jobs involve reading source files, processing them, and writing the output to new files. Options include
running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an
HDInsight Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster.
Real-time message ingestion . If the solution includes real-time sources, the architecture must include a
way to capture and store real-time messages for stream processing. This might be a simple data store,
where incoming messages are dropped into a folder for processing. However, many solutions need a
message ingestion store to act as a buffer for messages, and to support scale-out processing, reliable
delivery, and other message queuing semantics. This portion of a streaming architecture is often referred to
as stream buffering. Options include Azure Event Hubs, Azure IoT Hub, and Kafka.
Stream processing . After capturing real-time messages, the solution must process them by filtering,
aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to an
output sink. Azure Stream Analytics provides a managed stream processing service based on perpetually
running SQL queries that operate on unbounded streams. You can also use open source Apache streaming
technologies like Storm and Spark Streaming in an HDInsight cluster.
Analytical data store . Many big data solutions prepare data for analysis and then serve the processed
data in a structured format that can be queried using analytical tools. The analytical data store used to serve
these queries can be a Kimball-style relational data warehouse, as seen in most traditional business
intelligence (BI) solutions. Alternatively, the data could be presented through a low-latency NoSQL
technology such as HBase, or an interactive Hive database that provides a metadata abstraction over data
files in the distributed data store. Azure Synapse Analytics provides a managed service for large-scale,
cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which can also
be used to serve data for analysis.
Analysis and repor ting . The goal of most big data solutions is to provide insights into the data through
analysis and reporting. To empower users to analyze the data, the architecture may include a data modeling
layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services. It might also
support self-service BI, using the modeling and visualization technologies in Microsoft Power BI or Microsoft
Excel. Analysis and reporting can also take the form of interactive data exploration by data scientists or data
analysts. For these scenarios, many Azure services support analytical notebooks, such as Jupyter, enabling
these users to leverage their existing skills with Python or R. For large-scale data exploration, you can use
Microsoft R Server, either standalone or with Spark.
Orchestration . Most big data solutions consist of repeated data processing operations, encapsulated in
workflows, that transform source data, move data between multiple sources and sinks, load the processed
data into an analytical data store, or push the results straight to a report or dashboard. To automate these
workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.

Lambda architecture
When working with very large data sets, it can take a long time to run the sort of queries that clients need. These
queries can't be performed in real time, and often require algorithms such as MapReduce that operate in parallel
across the entire data set. The results are then stored separately from the raw data and used for querying.
One drawback to this approach is that it introduces latency — if processing takes a few hours, a query may return
results that are several hours old. Ideally, you would like to get some results in real time (perhaps with some loss of
accuracy), and combine these results with the results from the batch analytics.
The lambda architecture , first proposed by Nathan Marz, addresses this problem by creating two paths for data
flow. All data coming into the system goes through these two paths:
A batch layer (cold path) stores all of the incoming data in its raw form and performs batch processing on
the data. The result of this processing is stored as a batch view .
A speed layer (hot path) analyzes data in real time. This layer is designed for low latency, at the expense of
accuracy.
The batch layer feeds into a ser ving layer that indexes the batch view for efficient querying. The speed layer
updates the serving layer with incremental updates based on the most recent data.

Data that flows into the hot path is constrained by latency requirements imposed by the speed layer, so that it can
be processed as quickly as possible. Often, this requires a tradeoff of some level of accuracy in favor of data that is
ready as quickly as possible. For example, consider an IoT scenario where a large number of temperature sensors
are sending telemetry data. The speed layer may be used to process a sliding time window of the incoming data.
Data flowing into the cold path, on the other hand, is not subject to the same low latency requirements. This allows
for high accuracy computation across large data sets, which can be very time intensive.
Eventually, the hot and cold paths converge at the analytics client application. If the client needs to display timely,
yet potentially less accurate data in real time, it will acquire its result from the hot path. Otherwise, it will select
results from the cold path to display less timely but more accurate data. In other words, the hot path has data for a
relatively small window of time, after which the results can be updated with more accurate data from the cold path.
The raw data stored at the batch layer is immutable. Incoming data is always appended to the existing data, and the
previous data is never overwritten. Any changes to the value of a particular datum are stored as a new
timestamped event record. This allows for recomputation at any point in time across the history of the data
collected. The ability to recompute the batch view from the original raw data is important, because it allows for
new views to be created as the system evolves.

Kappa architecture
A drawback to the lambda architecture is its complexity. Processing logic appears in two different places — the cold
and hot paths — using different frameworks. This leads to duplicate computation logic and the complexity of
managing the architecture for both paths.
The kappa architecture was proposed by Jay Kreps as an alternative to the lambda architecture. It has the same
basic goals as the lambda architecture, but with an important distinction: All data flows through a single path,
using a stream processing system.

There are some similarities to the lambda architecture's batch layer, in that the event data is immutable and all of it
is collected, instead of a subset. The data is ingested as a stream of events into a distributed and fault tolerant
unified log. These events are ordered, and the current state of an event is changed only by a new event being
appended. Similar to a lambda architecture's speed layer, all event processing is performed on the input stream and
persisted as a real-time view.
If you need to recompute the entire data set (equivalent to what the batch layer does in lambda), you simply replay
the stream, typically using parallelism to complete the computation in a timely fashion.

Internet of Things (IoT)


From a practical viewpoint, Internet of Things (IoT) represents any device that is connected to the Internet. This
includes your PC, mobile phone, smart watch, smart thermostat, smart refrigerator, connected automobile, heart
monitoring implants, and anything else that connects to the Internet and sends or receives data. The number of
connected devices grows every day, as does the amount of data collected from them. Often this data is being
collected in highly constrained, sometimes high-latency environments. In other cases, data is sent from low-latency
environments by thousands or millions of devices, requiring the ability to rapidly ingest the data and process
accordingly. Therefore, proper planning is required to handle these constraints and unique requirements.
Event-driven architectures are central to IoT solutions. The following diagram shows a possible logical architecture
for IoT. The diagram emphasizes the event-streaming components of the architecture.

The cloud gateway ingests device events at the cloud boundary, using a reliable, low latency messaging system.
Devices might send events directly to the cloud gateway, or through a field gateway . A field gateway is a
specialized device or software, usually collocated with the devices, that receives events and forwards them to the
cloud gateway. The field gateway might also preprocess the raw device events, performing functions such as
filtering, aggregation, or protocol transformation.
After ingestion, events go through one or more stream processors that can route the data (for example, to
storage) or perform analytics and other processing.
The following are some common types of processing. (This list is certainly not exhaustive.)
Writing event data to cold storage, for archiving or batch analytics.
Hot path analytics, analyzing the event stream in (near) real time, to detect anomalies, recognize patterns
over rolling time windows, or trigger alerts when a specific condition occurs in the stream.
Handling special types of nontelemetry messages from devices, such as notifications and alarms.
Machine learning.
The boxes that are shaded gray show components of an IoT system that are not directly related to event streaming,
but are included here for completeness.
The device registr y is a database of the provisioned devices, including the device IDs and usually device
metadata, such as location.
The provisioning API is a common external interface for provisioning and registering new devices.
Some IoT solutions allow command and control messages to be sent to devices.
Relevant Azure services:
Azure IoT Hub
Azure Event Hubs
Azure Stream Analytics
Learn more about IoT on Azure by reading the Azure IoT reference architecture.
Batch processing
11/2/2020 • 6 minutes to read • Edit Online

A common big data scenario is batch processing of data at rest. In this scenario, the source data is loaded into data
storage, either by the source application itself or by an orchestration workflow. The data is then processed in-place
by a parallelized job, which can also be initiated by the orchestration workflow. The processing may include
multiple iterative steps before the transformed results are loaded into an analytical data store, which can be
queried by analytics and reporting components.
For example, the logs from a web server might be copied to a folder and then processed overnight to generate
daily reports of web activity.

When to use this solution


Batch processing is used in a variety of scenarios, from simple data transformations to a more complete ETL
(extract-transform-load) pipeline. In a big data context, batch processing may operate over very large data sets,
where the computation takes significant time. (For example, see Lambda architecture.) Batch processing typically
leads to further interactive exploration, provides the modeling-ready data for machine learning, or writes the data
to a data store that is optimized for analytics and visualization.
One example of batch processing is transforming a large set of flat, semi-structured CSV or JSON files into a
schematized and structured format that is ready for further querying. Typically the data is converted from the raw
formats used for ingestion (such as CSV) into binary formats that are more performant for querying because they
store data in a columnar format, and often provide indexes and inline statistics about the data.

Challenges
Data format and encoding . Some of the most difficult issues to debug happen when files use an
unexpected format or encoding. For example, source files might use a mix of UTF-16 and UTF-8 encoding,
or contain unexpected delimiters (space versus tab), or include unexpected characters. Another common
example is text fields that contain tabs, spaces, or commas that are interpreted as delimiters. Data loading
and parsing logic must be flexible enough to detect and handle these issues.
Orchestrating time slices . Often source data is placed in a folder hierarchy that reflects processing
windows, organized by year, month, day, hour, and so on. In some cases, data may arrive late. For example,
suppose that a web server fails, and the logs for March 7th don't end up in the folder for processing until
March 9th. Are they just ignored because they're too late? Can the downstream processing logic handle out-
of-order records?

Architecture
A batch processing architecture has the following logical components, shown in the diagram above.
Data storage . Typically a distributed file store that can serve as a repository for high volumes of large files
in various formats. Generically, this kind of store is often referred to as a data lake.
Batch processing . The high-volume nature of big data often means that solutions must process data files
using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually these
jobs involve reading source files, processing them, and writing the output to new files.
Analytical data store . Many big data solutions are designed to prepare data for analysis and then serve
the processed data in a structured format that can be queried using analytical tools.
Analysis and repor ting . The goal of most big data solutions is to provide insights into the data through
analysis and reporting.
Orchestration . With batch processing, typically some orchestration is required to migrate or copy the data
into your data storage, batch processing, analytical data store, and reporting layers.

Technology choices
The following technologies are recommended choices for batch processing solutions in Azure.
Data storage
Azure Storage Blob Containers . Many existing Azure business processes already use Azure blob storage,
making this a good choice for a big data store.
Azure Data Lake Store . Azure Data Lake Store offers virtually unlimited storage for any size of file, and
extensive security options, making it a good choice for extremely large-scale big data solutions that require a
centralized store for data in heterogeneous formats.
For more information, see Data storage.
Batch processing
U-SQL . U-SQL is the query processing language used by Azure Data Lake Analytics. It combines the declarative
nature of SQL with the procedural extensibility of C#, and takes advantage of parallelism to enable efficient
processing of data at massive scale.
Hive . Hive is a SQL-like language that is supported in most Hadoop distributions, including HDInsight. It can be
used to process data from any HDFS-compatible store, including Azure blob storage and Azure Data Lake Store.
Pig . Pig is a declarative big data processing language used in many Hadoop distributions, including HDInsight.
It is particularly useful for processing data that is unstructured or semi-structured.
Spark . The Spark engine supports batch processing programs written in a range of languages, including Java,
Scala, and Python. Spark uses a distributed architecture to process data in parallel across multiple worker
nodes.
For more information, see Batch processing.
Analytical data store
Azure Synapse Analytics . Azure Synapse is a managed service based on SQL Server database technologies
and optimized to support large-scale data warehousing workloads.
Spark SQL . Spark SQL is an API built on Spark that supports the creation of dataframes and tables that can be
queried using SQL syntax.
HBase . HBase is a low-latency NoSQL store that offers a high-performance, flexible option for querying
structured and semi-structured data.
Hive . In addition to being useful for batch processing, Hive offers a database architecture that is conceptually
similar to that of a typical relational database management system. Improvements in Hive query performance
through innovations like the Tez engine and Stinger initiative mean that Hive tables can be used effectively as
sources for analytical queries in some scenarios.
For more information, see Analytical data stores.
Analytics and reporting
Azure Analysis Ser vices . Many big data solutions emulate traditional enterprise business intelligence
architectures by including a centralized online analytical processing (OLAP) data model (often referred to as a
cube) on which reports, dashboards, and interactive "slice and dice" analysis can be based. Azure Analysis
Services supports the creation of tabular models to meet this need.
Power BI . Power BI enables data analysts to create interactive data visualizations based on data models in an
OLAP model or directly from an analytical data store.
Microsoft Excel . Microsoft Excel is one of the most widely used software applications in the world, and offers
a wealth of data analysis and visualization capabilities. Data analysts can use Excel to build document data
models from analytical data stores, or to retrieve data from OLAP data models into interactive PivotTables and
charts.
For more information, see Analytics and reporting.
Orchestration
Azure Data Factor y . Azure Data Factory pipelines can be used to define a sequence of activities, scheduled for
recurring temporal windows. These activities can initiate data copy operations as well as Hive, Pig, MapReduce,
or Spark jobs in on-demand HDInsight clusters; U-SQL jobs in Azure Date Lake Analytics; and stored
procedures in Azure Synapse or Azure SQL Database.
Oozie and Sqoop . Oozie is a job automation engine for the Apache Hadoop ecosystem and can be used to
initiate data copy operations as well as Hive, Pig, and MapReduce jobs to process data and Sqoop jobs to copy
data between HDFS and SQL databases.
For more information, see Pipeline orchestration
Real time processing
11/2/2020 • 5 minutes to read • Edit Online

Real time processing deals with streams of data that are captured in real-time and processed with minimal latency
to generate real-time (or near-real-time) reports or automated responses. For example, a real-time traffic
monitoring solution might use sensor data to detect high traffic volumes. This data could be used to dynamically
update a map to show congestion, or automatically initiate high-occupancy lanes or other traffic management
systems.

Real-time processing is defined as the processing of unbounded stream of input data, with very short latency
requirements for processing — measured in milliseconds or seconds. This incoming data typically arrives in an
unstructured or semi-structured format, such as JSON, and has the same processing requirements as batch
processing, but with shorter turnaround times to support real-time consumption.
Processed data is often written to an analytical data store, which is optimized for analytics and visualization. The
processed data can also be ingested directly into the analytics and reporting layer for analysis, business
intelligence, and real-time dashboard visualization.

Challenges
One of the big challenges of real-time processing solutions is to ingest, process, and store messages in real time,
especially at high volumes. Processing must be done in such a way that it does not block the ingestion pipeline.
The data store must support high-volume writes. Another challenge is being able to act on the data quickly, such
as generating alerts in real time or presenting the data in a real-time (or near-real-time) dashboard.

Architecture
A real-time processing architecture has the following logical components.
Real-time message ingestion. The architecture must include a way to capture and store real-time
messages to be consumed by a stream processing consumer. In simple cases, this service could be
implemented as a simple data store in which new messages are deposited in a folder. But often the solution
requires a message broker, such as Azure Event Hubs, that acts as a buffer for the messages. The message
broker should support scale-out processing and reliable delivery.
Stream processing. After capturing real-time messages, the solution must process them by filtering,
aggregating, and otherwise preparing the data for analysis.
Analytical data store. Many big data solutions are designed to prepare data for analysis and then serve
the processed data in a structured format that can be queried using analytical tools.
Analysis and repor ting. The goal of most big data solutions is to provide insights into the data through
analysis and reporting.

Technology choices
The following technologies are recommended choices for real-time processing solutions in Azure.
Real-time message ingestion
Azure Event Hubs . Azure Event Hubs is a messaging solution for ingesting millions of event messages per
second. The captured event data can be processed by multiple consumers in parallel. While Event Hubs
natively supports AMQP (Advanced Message Queuing Protocol 1.0), it also provides a binary compatibility
layer that allows applications using the Kafka protocol (Kafka 1.0 and above) to process events using Event
Hubs with no application changes.
Azure IoT Hub . Azure IoT Hub provides bi-directional communication between Internet-connected devices,
and a scalable message queue that can handle millions of simultaneously connected devices.
Apache Kafka . Kafka is an open source message queuing and stream processing application that can scale to
handle millions of messages per second from multiple message producers, and route them to multiple
consumers. Kafka is available in Azure as an HDInsight cluster type.
For more information, see Real-time message ingestion.
Data storage
Azure Storage Blob Containers or Azure Data Lake Store . Incoming real-time data is usually captured in
a message broker (see above), but in some scenarios, it can make sense to monitor a folder for new files and
process them as they are created or updated. Additionally, many real-time processing solutions combine
streaming data with static reference data, which can be stored in a file store. Finally, file storage may be used as
an output destination for captured real-time data for archiving, or for further batch processing in a lambda
architecture.
For more information, see Data storage.
Stream processing
Azure Stream Analytics . Azure Stream Analytics can run perpetual queries against an unbounded stream of
data. These queries consume streams of data from storage or message brokers, filter and aggregate the data
based on temporal windows, and write the results to sinks such as storage, databases, or directly to reports in
Power BI. Stream Analytics uses a SQL-based query language that supports temporal and geospatial
constructs, and can be extended using JavaScript.
Storm . Apache Storm is an open source framework for stream processing that uses a topology of spouts and
bolts to consume, process, and output the results from real-time streaming data sources. You can provision
Storm in an Azure HDInsight cluster, and implement a topology in Java or C#.
Spark Streaming . Apache Spark is an open source distributed platform for general data processing. Spark
provides the Spark Streaming API, in which you can write code in any supported Spark language, including
Java, Scala, and Python. Spark 2.0 introduced the Spark Structured Streaming API, which provides a simpler
and more consistent programming model. Spark 2.0 is available in an Azure HDInsight cluster.
For more information, see Stream processing.
Analytical data store
Azure Synapse Analytics , HBase , Spark , or Hive . Processed real-time data can be stored in a relational
database such Synapse Analytics, a NoSQL store such as HBase, or as files in distributed storage over which
Spark or Hive tables can be defined and queried.
For more information, see Analytical data stores.
Analytics and reporting
Azure Analysis Ser vices , Power BI , and Microsoft Excel . Processed real-time data that is stored in an
analytical data store can be used for historical reporting and analysis in the same way as batch processed data.
Additionally, Power BI can be used to publish real-time (or near-real-time) reports and visualizations from
analytical data sources where latency is sufficiently low, or in some cases directly from the stream processing
output.
For more information, see Analytics and reporting.
In a purely real-time solution, most of the processing orchestration is managed by the message ingestion and
stream processing components. However, in a lambda architecture that combines batch processing and real-time
processing, you may need to use an orchestration framework such as Azure Data Factory or Apache Oozie and
Sqoop to manage batch workflows for captured real-time data.

Next steps
The following reference architecture shows an end-to-end stream processing pipeline:
Stream processing with Azure Stream Analytics
Choosing an analytical data store in Azure
11/2/2020 • 5 minutes to read • Edit Online

In a big data architecture, there is often a need for an analytical data store that serves processed data in a
structured format that can be queried using analytical tools. Analytical data stores that support querying of both
hot-path and cold-path data are collectively referred to as the serving layer, or data serving storage.
The serving layer deals with processed data from both the hot path and cold path. In the lambda architecture, the
serving layer is subdivided into a speed serving layer, which stores data that has been processed incrementally,
and a batch serving layer, which contains the batch-processed output. The serving layer requires strong support
for random reads with low latency. Data storage for the speed layer should also support random writes, because
batch loading data into this store would introduce undesired delays. On the other hand, data storage for the batch
layer does not need to support random writes, but batch writes instead.
There is no single best data management choice for all data storage tasks. Different data management solutions
are optimized for different tasks. Most real-world cloud apps and big data processes have a variety of data storage
requirements and often use a combination of data storage solutions.

What are your options when choosing an analytical data store?


There are several options for data serving storage in Azure, depending on your needs:
Azure Synapse Analytics
Azure SQL Database
SQL Server in Azure VM
HBase/Phoenix on HDInsight
Hive LLAP on HDInsight
Azure Analysis Services
Azure Cosmos DB
These options provide various database models that are optimized for different types of tasks:
Key/value databases hold a single serialized object for each key value. They're good for storing large volumes
of data where you want to get one item for a given key value and you don't have to query based on other
properties of the item.
Document databases are key/value databases in which the values are documents. A "document" in this context
is a collection of named fields and values. The database typically stores the data in a format such as XML, YAML,
JSON, or BSON, but may use plain text. Document databases can query on non-key fields and define
secondary indexes to make querying more efficient. This makes a document database more suitable for
applications that need to retrieve data based on criteria more complex than the value of the document key. For
example, you could query on fields such as product ID, customer ID, or customer name.
Column-family databases are key/value data stores that structure data storage into collections of related
columns called column families. For example, a census database might have one group of columns for a
person's name (first, middle, last), one group for the person's address, and one group for the person's profile
information (data of birth, gender). The database can store each column family in a separate partition, while
keeping all of the data for one person related to the same key. An application can read a single column family
without reading through all of the data for an entity.
Graph databases store information as a collection of objects and relationships. A graph database can efficiently
perform queries that traverse the network of objects and the relationships between them. For example, the
objects might be employees in a human resources database, and you might want to facilitate queries such as
"find all employees who directly or indirectly work for Scott."

Key selection criteria


To narrow the choices, start by answering these questions:
Do you need serving storage that can serve as a hot path for your data? If yes, narrow your options to
those that are optimized for a speed serving layer.
Do you need massively parallel processing (MPP) support, where queries are automatically distributed
across several processes or nodes? If yes, select an option that supports query scale out.
Do you prefer to use a relational data store? If so, narrow your options to those with a relational database
model. However, note that some non-relational stores support SQL syntax for querying, and tools such as
PolyBase can be used to query non-relational data stores.

Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
H B A SE/ P H O E A Z URE
SQ L A Z URE N IX O N H IVE L L A P O N A N A LY SIS
C A PA B IL IT Y DATA B A SE SY N A P SE H DIN SIGH T H DIN SIGH T SERVIC ES C O SM O S DB

Is managed Yes Yes Yes 1 Yes 1 Yes Yes


service

Primary Relational Relational Wide column Hive/In- Tabular/MOL Document


database (columnar tables with store Memory AP semantic store, graph,
model format when columnar models key-value
using storage store, wide
columnstore column store
indexes)

SQL language Yes Yes Yes (using Yes No Yes


support Phoenix JDBC
driver)

Optimized for Yes 2 No Yes Yes No Yes


speed serving
layer

[1] With manual configuration and scaling.


[2] Using memory-optimized tables and hash or nonclustered indexes.
Scalability capabilities
H B A SE/ P H O E A Z URE
SQ L A Z URE N IX O N H IVE L L A P O N A N A LY SIS
C A PA B IL IT Y DATA B A SE SY N A P SE H DIN SIGH T H DIN SIGH T SERVIC ES C O SM O S DB

Redundant Yes Yes Yes No No Yes


regional
servers for
high
availability
H B A SE/ P H O E A Z URE
SQ L A Z URE N IX O N H IVE L L A P O N A N A LY SIS
C A PA B IL IT Y DATA B A SE SY N A P SE H DIN SIGH T H DIN SIGH T SERVIC ES C O SM O S DB

Supports No Yes Yes Yes Yes Yes


query scale
out

Dynamic Yes Yes No No Yes Yes


scalability
(scale up)

Supports in- Yes Yes No Yes Yes No


memory
caching of
data

Security capabilities
H B A SE/ P H O E A Z URE
SQ L A Z URE N IX O N H IVE L L A P O N A N A LY SIS
C A PA B IL IT Y DATA B A SE SY N A P SE H DIN SIGH T H DIN SIGH T SERVIC ES C O SM O S DB

Authenticatio SQL / Azure SQL / Azure local / Azure local / Azure Azure AD database
n Active AD AD 1 AD 1 users / Azure
Directory AD via access
(Azure AD) control (IAM)

Data Yes 2 Yes 2 Yes 1 Yes 1 Yes Yes


encryption at
rest

Row-level Yes Yes 3 Yes 1 Yes 1 Yes (through No


security object-level
security in
model)

Supports Yes Yes Yes 4 Yes 4 Yes Yes


firewalls

Dynamic data Yes Yes Yes 1 Yes No No


masking

[1] Requires using a domain-joined HDInsight cluster.


[2] Requires using transparent data encryption (TDE) to encrypt and decrypt your data at rest.
[3] Filter predicates only. See Row-Level Security
[4] When used within an Azure Virtual Network. See Extend Azure HDInsight using an Azure Virtual Network.
Choosing a data analytics technology in Azure
11/2/2020 • 4 minutes to read • Edit Online

The goal of most big data solutions is to provide insights into the data through analysis and reporting. This can
include preconfigured reports and visualizations, or interactive data exploration.

What are your options when choosing a data analytics technology?


There are several options for analysis, visualizations, and reporting in Azure, depending on your needs:
Power BI
Jupyter Notebooks
Zeppelin Notebooks
Microsoft Azure Notebooks
Power BI
Power BI is a suite of business analytics tools. It can connect to hundreds of data sources, and can be used for ad
hoc analysis. See this list of the currently available data sources. Use Power BI Embedded to integrate Power BI
within your own applications without requiring any additional licensing.
Organizations can use Power BI to produce reports and publish them to the organization. Everyone can create
personalized dashboards, with governance and security built in. Power BI uses Azure Active Directory (Azure AD)
to authenticate users who log in to the Power BI service, and uses the Power BI login credentials whenever a user
attempts to access resources that require authentication.
Jupyter Notebooks
Jupyter Notebooks provide a browser-based shell that lets data scientists create notebook files that contain
Python, Scala, or R code and markdown text, making it an effective way to collaborate by sharing and
documenting code and results in a single document.
Most varieties of HDInsight clusters, such as Spark or Hadoop, come preconfigured with Jupyter notebooks for
interacting with data and submitting jobs for processing. Depending on the type of HDInsight cluster you are
using, one or more kernels will be provided for interpreting and running your code. For example, Spark clusters
on HDInsight provide Spark-related kernels that you can select from to execute Python or Scala code using the
Spark engine.
Jupyter notebooks provide a great environment for analyzing, visualizing, and processing your data prior to
building more advanced visualizations with a BI/reporting tool like Power BI.
Zeppelin Notebooks
Zeppelin Notebooks are another option for a browser-based shell, similar to Jupyter in functionality. Some
HDInsight clusters come preconfigured with Zeppelin notebooks. However, if you are using an HDInsight
Interactive Query (Hive LLAP) cluster, Zeppelin is currently your only choice of notebook that you can use to run
interactive Hive queries. Also, if you are using a domain-joined HDInsight cluster, Zeppelin notebooks are the only
type that enables you to assign different user logins to control access to notebooks and the underlying Hive tables.
Microsoft Azure Notebooks
Azure Notebooks is an online Jupyter Notebooks-based service that enables data scientists to create, run, and
share Jupyter Notebooks in cloud-based libraries. Azure Notebooks provides execution environments for Python
2, Python 3, F#, and R, and provides several charting libraries for visualizing your data, such as ggplot, matplotlib,
bokeh, and seaborn.
Unlike Jupyter notebooks running on an HDInsight cluster, which are connected to the cluster's default storage
account, Azure Notebooks does not provide any data. You must load data in a variety of ways, such downloading
data from an online source, interacting with Azure Blobs or Table Storage, connecting to a SQL database, or
loading data with the Copy Wizard for Azure Data Factory.
Key benefits:
Free service—no Azure subscription required.
No need to install Jupyter and the supporting R or Python distributions locally—just use a browser.
Manage your own online libraries and access them from any device.
Share your notebooks with collaborators.
Considerations:
You will be unable to access your notebooks when offline.
Limited processing capabilities of the free notebook service may not be enough to train large or complex
models.

Key selection criteria


To narrow the choices, start by answering these questions:
Do you need to connect to numerous data sources, providing a centralized place to create reports for data
spread throughout your domain? If so, choose an option that allows you to connect to 100s of data sources.
Do you want to embed dynamic visualizations in an external website or application? If so, choose an option
that provides embedding capabilities.
Do you want to design your visualizations and reports while offline? If yes, choose an option with offline
capabilities.
Do you need heavy processing power to train large or complex AI models or work with very large data
sets? If yes, choose an option that can connect to a big data cluster.

Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
Z EP P EL IN M IC RO SO F T A Z URE
C A PA B IL IT Y P O W ER B I JUP Y T ER N OT EB O O K S N OT EB O O K S N OT EB O O K S

Connect to big data Yes Yes Yes No


cluster for advanced
processing

Managed service Yes Yes 1 Yes 1 Yes

Connect to 100s of Yes No No No


data sources

Offline capabilities Yes 2 No No No

Embedding Yes No No No
capabilities
Z EP P EL IN M IC RO SO F T A Z URE
C A PA B IL IT Y P O W ER B I JUP Y T ER N OT EB O O K S N OT EB O O K S N OT EB O O K S

Automatic data Yes No No No


refresh

Access to numerous No Yes 3 Yes 3 Yes 4


open source
packages

Data Power Query, R 40 languages, 20+ interpreters, Python, F#, R


transformation/cleans including Python, R, including Python,
ing options Julia, and Scala JDBC, and R

Pricing Free for Power BI Free Free Free


Desktop (authoring),
see pricing for
hosting options

Multiuser Yes Yes (through sharing Yes Yes (through sharing)


collaboration or with a multiuser
server like
JupyterHub)

[1] When used as part of a managed HDInsight cluster.


[2] With the use of Power BI Desktop.
[2] You can search the Maven repository for community-contributed packages.
[3] Python packages can be installed using either pip or conda. R packages can be installed from CRAN or GitHub.
Packages in F# can be installed via nuget.org using the Paket dependency manager.
Choosing a batch processing technology in Azure
11/2/2020 • 3 minutes to read • Edit Online

Big data solutions often use long-running batch jobs to filter, aggregate, and otherwise prepare the data for
analysis. Usually these jobs involve reading source files from scalable storage (like HDFS, Azure Data Lake Store,
and Azure Storage), processing them, and writing the output to new files in scalable storage.
The key requirement of such batch processing engines is the ability to scale out computations, in order to handle a
large volume of data. Unlike real-time processing, however, batch processing is expected to have latencies (the
time between data ingestion and computing a result) that measure in minutes to hours.

Technology choices for batch processing


Azure Synapse Analytics
Azure Synapse is a distributed system designed to perform analytics on large data. It supports massive parallel
processing (MPP), which makes it suitable for running high-performance analytics. Consider Azure Synapse when
you have large amounts of data (more than 1 TB) and are running an analytics workload that will benefit from
parallelism.
Azure Data Lake Analytics
Data Lake Analytics is an on-demand analytics job service. It is optimized for distributed processing of very large
data sets stored in Azure Data Lake Store.
Languages: U-SQL (including Python, R, and C# extensions).
Integrates with Azure Data Lake Store, Azure Storage blobs, Azure SQL Database, and Azure Synapse.
Pricing model is per-job.
HDInsight
HDInsight is a managed Hadoop service. Use it deploy and manage Hadoop clusters in Azure. For batch
processing, you can use Spark, Hive, Hive LLAP, MapReduce.
Languages: R, Python, Java, Scala, SQL
Kerberos authentication with Active Directory, Apache Ranger based access control
Gives you full control of the Hadoop cluster
Azure Databricks
Azure Databricks is an Apache Spark-based analytics platform. You can think of it as "Spark as a service." It's the
easiest way to use Spark on the Azure platform.
Languages: R, Python, Java, Scala, Spark SQL
Fast cluster start times, autotermination, autoscaling.
Manages the Spark cluster for you.
Built-in integration with Azure Blob Storage, Azure Data Lake Storage (ADLS), Azure Synapse, and other
services. See Data Sources.
User authentication with Azure Active Directory.
Web-based notebooks for collaboration and data exploration.
Supports GPU-enabled clusters
Azure Distributed Data Engineering Toolkit
The Distributed Data Engineering Toolkit (AZTK) is a tool for provisioning on-demand Spark on Docker clusters in
Azure.
AZTK is not an Azure service. Rather, it's a client-side tool with a CLI and Python SDK interface, that's built on Azure
Batch. This option gives you the most control over the infrastructure when deploying a Spark cluster.
Bring your own Docker image.
Use low-priority VMs for an 80% discount.
Mixed mode clusters that use both low-priority and dedicated VMs.
Built in support for Azure Blob Storage and Azure Data Lake connection.

Key selection criteria


To narrow the choices, start by answering these questions:
Do you want a managed service rather than managing your own servers?
Do you want to author batch processing logic declaratively or imperatively?
Will you perform batch processing in bursts? If yes, consider options that let you auto-terminate the cluster
or whose pricing model is per batch job.
Do you need to query relational data stores along with your batch processing, for example to look up
reference data? If yes, consider the options that enable querying of external relational stores.

Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
A Z URE DATA L A K E
C A PA B IL IT Y A N A LY T IC S A Z URE SY N A P SE H DIN SIGH T A Z URE DATA B RIC K S

Is managed service Yes Yes Yes 1 Yes

Relational data store Yes Yes No No

Pricing model Per batch job By cluster hour By cluster hour Databricks Unit2 +
cluster hour

[1] With manual configuration and scaling.


[2] A Databricks Unit (DBU) is a unit of processing capability per hour.
Capabilities
A Z URE DATA H DIN SIGH T
LAKE A Z URE H DIN SIGH T H DIN SIGH T W IT H H IVE A Z URE
C A PA B IL IT Y A N A LY T IC S SY N A P SE W IT H SPA RK W IT H H IVE LLAP DATA B RIC K S

Autoscaling No No No No No Yes

Scale-out Per job Per cluster Per cluster Per cluster Per cluster Per cluster
granularity

In-memory No Yes Yes No Yes Yes


caching of
data
A Z URE DATA H DIN SIGH T
LAKE A Z URE H DIN SIGH T H DIN SIGH T W IT H H IVE A Z URE
C A PA B IL IT Y A N A LY T IC S SY N A P SE W IT H SPA RK W IT H H IVE LLAP DATA B RIC K S

Query from Yes No Yes No No Yes


external
relational
stores

Authenticatio Azure AD SQL / Azure No Azure AD1 Azure AD1 Azure AD


n AD

Auditing Yes Yes No Yes 1 Yes 1 Yes

Row-level No Yes2 No Yes 1 Yes 1 No


security

Supports Yes Yes Yes Yes 3 Yes 3 No


firewalls

Dynamic data No Yes No Yes 1 Yes 1 No


masking

[1] Requires using a domain-joined HDInsight cluster.


[2] Filter predicates only. See Row-Level Security
[3] Supported when used within an Azure Virtual Network.
Data lakes
11/2/2020 • 2 minutes to read • Edit Online

A data lake is a storage repository that holds a large amount of data in its native, raw format. Data lake stores are
optimized for scaling to terabytes and petabytes of data. The data typically comes from multiple heterogeneous
sources, and may be structured, semi-structured, or unstructured. The idea with a data lake is to store everything in
its original, untransformed state. This approach differs from a traditional data warehouse, which transforms and
processes the data at the time of ingestion.
Advantages of a data lake:
Data is never thrown away, because the data is stored in its raw format. This is especially useful in a big data
environment, when you may not know in advance what insights are available from the data.
Users can explore the data and create their own queries.
May be faster than traditional ETL tools.
More flexible than a data warehouse, because it can store unstructured and semi-structured data.
A complete data lake solution consists of both storage and processing. Data lake storage is designed for fault-
tolerance, infinite scalability, and high-throughput ingestion of data with varying shapes and sizes. Data lake
processing involves one or more processing engines built with these goals in mind, and can operate on data stored
in a data lake at scale.

When to use a data lake


Typical uses for a data lake include data exploration, data analytics, and machine learning.
A data lake can also act as the data source for a data warehouse. With this approach, the raw data is ingested into
the data lake and then transformed into a structured queryable format. Typically this transformation uses an ELT
(extract-load-transform) pipeline, where the data is ingested and transformed in place. Source data that is already
relational may go directly into the data warehouse, using an ETL process, skipping the data lake.
Data lake stores are often used in event streaming or IoT scenarios, because they can persist large amounts of
relational and nonrelational data without transformation or schema definition. They are built to handle high
volumes of small writes at low latency, and are optimized for massive throughput.

Challenges
Lack of a schema or descriptive metadata can make the data hard to consume or query.
Lack of semantic consistency across the data can make it challenging to perform analysis on the data, unless
users are highly skilled at data analytics.
It can be hard to guarantee the quality of the data going into the data lake.
Without proper governance, access control and privacy issues can be problems. What information is going into
the data lake, who can access that data, and for what uses?
A data lake may not be the best way to integrate data that is already relational.
By itself, a data lake does not provide integrated or holistic views across the organization.
A data lake may become a dumping ground for data that is never actually analyzed or mined for insights.

Relevant Azure services


Data Lake Store is a hyperscale, Hadoop-compatible repository.
Data Lake Analytics is an on-demand analytics job service to simplify big data analytics.
Choosing a big data storage technology in Azure
11/2/2020 • 7 minutes to read • Edit Online

This topic compares options for data storage for big data solutions — specifically, data storage for bulk data
ingestion and batch processing, as opposed to analytical data stores or real-time streaming ingestion.

What are your options when choosing data storage in Azure?


There are several options for ingesting data into Azure, depending on your needs.
File storage:
Azure Storage blobs
Azure Data Lake Store
NoSQL databases:
Azure Cosmos DB
HBase on HDInsight

Azure Storage blobs


Azure Storage is a managed storage service that is highly available, secure, durable, scalable, and redundant.
Microsoft takes care of maintenance and handles critical problems for you. Azure Storage is the most ubiquitous
storage solution Azure provides, due to the number of services and tools that can be used with it.
There are various Azure Storage services you can use to store data. The most flexible option for storing blobs from
a number of data sources is Blob storage. Blobs are basically files. They store pictures, documents, HTML files,
virtual hard disks (VHDs), big data such as logs, database backups — pretty much anything. Blobs are stored in
containers, which are similar to folders. A container provides a grouping of a set of blobs. A storage account can
contain an unlimited number of containers, and a container can store an unlimited number of blobs.
Azure Storage is a good choice for big data and analytics solutions, because of its flexibility, high availability, and
low cost. It provides hot, cool, and archive storage tiers for different use cases. For more information, see Azure
Blob Storage: Hot, cool, and archive storage tiers.
Azure Blob storage can be accessed from Hadoop (available through HDInsight). HDInsight can use a blob
container in Azure Storage as the default file system for the cluster. Through a Hadoop distributed file system
(HDFS) interface provided by a WASB driver, the full set of components in HDInsight can operate directly on
structured or unstructured data stored as blobs. Azure Blob storage can also be accessed via Azure Synapse
Analytics using its PolyBase feature.
Other features that make Azure Storage a good choice are:
Multiple concurrency strategies.
Disaster recovery and high availability options.
Encryption at rest.
Role-based access control (RBAC) to control access using Azure Active Directory users and groups.

Azure Data Lake Store


Azure Data Lake Store is an enterprise-wide hyperscale repository for big data analytic workloads. Data Lake
enables you to capture data of any size, type, and ingestion speed in one single secure location for operational and
exploratory analytics.
Data Lake Store does not impose any limits on account sizes, file sizes, or the amount of data that can be stored in
a data lake. Data is stored durably by making multiple copies and there is no limit on the duration of time that the
data can be stored in the Data Lake. In addition to making multiple copies of files to guard against any unexpected
failures, Data lake spreads parts of a file over a number of individual storage servers. This improves the read
throughput when reading the file in parallel for performing data analytics.
Data Lake Store can be accessed from Hadoop (available through HDInsight) using the WebHDFS-compatible
REST APIs. You may consider using this as an alternative to Azure Storage when your individual or combined file
sizes exceed that which is supported by Azure Storage. However, there are performance tuning guidelines you
should follow when using Data Lake Store as your primary storage for an HDInsight cluster, with specific
guidelines for Spark, Hive, MapReduce, and Storm. Also, be sure to check Data Lake Store's regional availability,
because it is not available in as many regions as Azure Storage, and it needs to be located in the same region as
your HDInsight cluster.
Coupled with Azure Data Lake Analytics, Data Lake Store is specifically designed to enable analytics on the stored
data and is tuned for performance for data analytics scenarios. Data Lake Store can also be accessed via Azure
Synapse using its PolyBase feature.

Azure Cosmos DB
Azure Cosmos DB is Microsoft's globally distributed multi-model database. Cosmos DB guarantees single-digit-
millisecond latencies at the 99th percentile anywhere in the world, offers multiple well-defined consistency models
to fine-tune performance, and guarantees high availability with multi-homing capabilities.
Azure Cosmos DB is schema-agnostic. It automatically indexes all the data without requiring you to deal with
schema and index management. It's also multi-model, natively supporting document, key-value, graph, and
column-family data models.
Azure Cosmos DB features:
Geo-replication
Elastic scaling of throughput and storage worldwide
Five well-defined consistency levels

HBase on HDInsight
Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after Google BigTable.
HBase provides random access and strong consistency for large amounts of unstructured and semi-structured
data in a schemaless database organized by column families.
Data is stored in the rows of a table, and data within a row is grouped by column family. HBase is schemaless in
the sense that neither the columns nor the type of data stored in them need to be defined before using them. The
open-source code scales linearly to handle petabytes of data on thousands of nodes. It can rely on data
redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop
ecosystem.
The HDInsight implementation leverages the scale-out architecture of HBase to provide automatic sharding of
tables, strong consistency for reads and writes, and automatic failover. Performance is enhanced by in-memory
caching for reads and high-throughput streaming for writes. In most cases, you'll want to create the HBase cluster
inside a virtual network so other HDInsight clusters and applications can directly access the tables.

Key selection criteria


To narrow the choices, start by answering these questions:
Do you need managed, high-speed, cloud-based storage for any type of text or binary data? If yes, then
select one of the file storage options.
Do you need file storage that is optimized for parallel analytics workloads and high throughput/IOPS? If
yes, then choose an option that is tuned to analytics workload performance.
Do you need to store unstructured or semi-structured data in a schemaless database? If so, select one of the
non-relational options. Compare options for indexing and database models. Depending on the type of data
you need to store, the primary database models may be the largest factor.
Can you use the service in your region? Check the regional availability for each Azure service. See Products
available by region.

Capability matrix
The following tables summarize the key differences in capabilities.
File storage capabilities
C A PA B IL IT Y A Z URE DATA L A K E STO RE A Z URE B LO B STO RA GE C O N TA IN ERS

Purpose Optimized storage for big data analytics General purpose object store for a wide
workloads variety of storage scenarios

Use cases Batch, streaming analytics, and machine Any type of text or binary data, such as
learning data such as log files, IoT data, application back end, backup data,
click streams, large datasets media storage for streaming, and
general purpose data

Structure Hierarchical file system Object store with flat namespace

Authentication Based on Azure Active Directory Based on shared secrets Account


Identities Access Keys and Shared Access
Signature Keys, and role-based access
control (RBAC)

Authentication protocol OAuth 2.0. Calls must contain a valid Hash-based message authentication
JWT (JSON web token) issued by Azure code (HMAC). Calls must contain a
Active Directory Base64-encoded SHA-256 hash over a
part of the HTTP request.

Authorization POSIX access control lists (ACLs). ACLs For account-level authorization use
based on Azure Active Directory Account Access Keys. For account,
identities can be set file and folder level. container, or blob authorization use
Shared Access Signature Keys.

Auditing Available. Available

Encryption at rest Transparent, server side Transparent, server side; Client-side


encryption

Developer SDKs .NET, Java, Python, Node.js .NET, Java, Python, Node.js, C++, Ruby

Analytics workload performance Optimized performance for parallel Not optimized for analytics workloads
analytics workloads, High Throughput
and IOPS
C A PA B IL IT Y A Z URE DATA L A K E STO RE A Z URE B LO B STO RA GE C O N TA IN ERS

Size limits No limits on account sizes, file sizes or Specific limits documented here
number of files

Geo-redundancy Locally-redundant (LRS), globally Locally redundant (LRS), globally


redundant (GRS), read-access globally redundant (GRS), read-access globally
redundant (RA-GRS), zone-redundant redundant (RA-GRS), zone-redundant
(ZRS). (ZRS). See here for more information

NoSQL database capabilities


C A PA B IL IT Y A Z URE C O SM O S DB H B A SE O N H DIN SIGH T

Primary database model Document store, graph, key-value Wide column store
store, wide column store

Secondary indexes Yes No

SQL language support Yes Yes (using the Phoenix JDBC driver)

Consistency Strong, bounded-staleness, session, Strong


consistent prefix, eventual

Native Azure Functions integration Yes No

Automatic global distribution Yes No HBase cluster replication can be


configured across regions with eventual
consistency

Pricing model Elastically scalable request units (RUs) Per-minute pricing for HDInsight cluster
charged per-second as needed, (horizontal scaling of nodes), storage
elastically scalable storage
Understand data store models
11/2/2020 • 12 minutes to read • Edit Online

Modern business systems manage increasingly large volumes of heterogeneous data. This heterogeneity means
that a single data store is usually not the best approach. Instead, it's often better to store different types of data in
different data stores, each focused toward a specific workload or usage pattern. The term polyglot persistence is
used to describe solutions that use a mix of data store technologies. Therefore, it's important to understand the
main storage models and their tradeoffs.
Selecting the right data store for your requirements is a key design decision. There are literally hundreds of
implementations to choose from among SQL and NoSQL databases. Data stores are often categorized by how
they structure data and the types of operations they support. This article describes several of the most common
storage models. Note that a particular data store technology may support multiple storage models. For example,
a relational database management systems (RDBMS) may also support key/value or graph storage. In fact, there
is a general trend for so-called multi-model support, where a single database system supports several models.
But it's still useful to understand the different models at a high level.
Not all data stores in a given category provide the same feature-set. Most data stores provide server-side
functionality to query and process data. Sometimes this functionality is built into the data storage engine. In
other cases, the data storage and processing capabilities are separated, and there may be several options for
processing and analysis. Data stores also support different programmatic and management interfaces.
Generally, you should start by considering which storage model is best suited for your requirements. Then
consider a particular data store within that category, based on factors such as feature set, cost, and ease of
management.

Relational database management systems


Relational databases organize data as a series of two-dimensional tables with rows and columns. Most vendors
provide a dialect of the Structured Query Language (SQL) for retrieving and managing data. An RDBMS typically
implements a transactionally consistent mechanism that conforms to the ACID (Atomic, Consistent, Isolated,
Durable) model for updating information.
An RDBMS typically supports a schema-on-write model, where the data structure is defined ahead of time, and
all read or write operations must use the schema.
This model is very useful when strong consistency guarantees are important — where all changes are atomic,
and transactions always leave the data in a consistent state. However, an RDBMS generally can't scale out
horizontally without sharding the data in some way. Also, the data in an RDBMS must normalized, which isn't
appropriate for every data set.
Azure services
Azure SQL Database | (Security Baseline)
Azure Database for MySQL | (Security Baseline)
Azure Database for PostgreSQL | (Security Baseline)
Azure Database for MariaDB | (Security Baseline)
Workload
Records are frequently created and updated.
Multiple operations have to be completed in a single transaction.
Relationships are enforced using database constraints.
Indexes are used to optimize query performance.
Data type
Data is highly normalized.
Database schemas are required and enforced.
Many-to-many relationships between data entities in the database.
Constraints are defined in the schema and imposed on any data in the database.
Data requires high integrity. Indexes and relationships need to be maintained accurately.
Data requires strong consistency. Transactions operate in a way that ensures all data are 100% consistent for
all users and processes.
Size of individual data entries is small to medium-sized.
Examples
Inventory management
Order management
Reporting database
Accounting

Key/value stores
A key/value store associates each data value with a unique key. Most key/value stores only support simple
query, insert, and delete operations. To modify a value (either partially or completely), an application must
overwrite the existing data for the entire value. In most implementations, reading or writing a single value is an
atomic operation.
An application can store arbitrary data as a set of values. Any schema information must be provided by the
application. The key/value store simply retrieves or stores the value by key.

Key/value stores are highly optimized for applications performing simple lookups, but are less suitable if you
need to query data across different key/value stores. Key/value stores are also not optimized for querying by
value.
A single key/value store can be extremely scalable, as the data store can easily distribute data across multiple
nodes on separate machines.
Azure services
Azure Cosmos DB Table API, etcd API (preview), and SQL API | (Cosmos DB Security Baseline)
Azure Cache for Redis | (Security Baseline)
Azure Table Storage | (Security Baseline)
Workload
Data is accessed using a single key, like a dictionary.
No joins, lock, or unions are required.
No aggregation mechanisms are used.
Secondary indexes are generally not used.
Data type
Each key is associated with a single value.
There is no schema enforcement.
No relationships between entities.
Examples
Data caching
Session management
User preference and profile management
Product recommendation and ad serving

Document databases
A document database stores a collection of documents, where each document consists of named fields and data.
The data can be simple values or complex elements such as lists and child collections. Documents are retrieved
by unique keys.
Typically, a document contains the data for single entity, such as a customer or an order. A document may
contain information that would be spread across several relational tables in an RDBMS. Documents don't need to
have the same structure. Applications can store different data in documents as business requirements change.

Azure service
Azure Cosmos DB SQL API | (Cosmos DB Security Baseline)
Workload
Insert and update operations are common.
No object-relational impedance mismatch. Documents can better match the object structures used in
application code.
Individual documents are retrieved and written as a single block.
Data requires index on multiple fields.
Data type
Data can be managed in de-normalized way.
Size of individual document data is relatively small.
Each document type can use its own schema.
Documents can include optional fields.
Document data is semi-structured, meaning that data types of each field are not strictly defined.
Examples
Product catalog
Content management
Inventory management

Graph databases
A graph database stores two types of information, nodes and edges. Edges specify relationships between nodes.
Nodes and edges can have properties that provide information about that node or edge, similar to columns in a
table. Edges can also have a direction indicating the nature of the relationship.
Graph databases can efficiently perform queries across the network of nodes and edges and analyze the
relationships between entities. The following diagram shows an organization's personnel database structured as
a graph. The entities are employees and departments, and the edges indicate reporting relationships and the
departments in which employees work.

This structure makes it straightforward to perform queries such as "Find all employees who report directly or
indirectly to Sarah" or "Who works in the same department as John?" For large graphs with lots of entities and
relationships, you can perform very complex analyses very quickly. Many graph databases provide a query
language that you can use to traverse a network of relationships efficiently.
Azure services
Azure Cosmos DB Gremlin API | (Security Baseline)
SQL Server | (Security Baseline)
Workload
Complex relationships between data items involving many hops between related data items.
The relationship between data items are dynamic and change over time.
Relationships between objects are first-class citizens, without requiring foreign-keys and joins to traverse.
Data type
Nodes and relationships.
Nodes are similar to table rows or JSON documents.
Relationships are just as important as nodes, and are exposed directly in the query language.
Composite objects, such as a person with multiple phone numbers, tend to be broken into separate, smaller
nodes, combined with traversable relationships
Examples
Organization charts
Social graphs
Fraud detection
Recommendation engines

Data analytics
Data analytics stores provide massively parallel solutions for ingesting, storing, and analyzing data. The data is
distributed across multiple servers to maximize scalability. Large data file formats such as delimiter files (CSV),
parquet, and ORC are widely used in data analytics. Historical data is typically stored in data stores such as blob
storage or Azure Data Lake Storage Gen2, which are then accessed by Azure Synapse, Databricks, or HDInsight
as external tables. A typical scenario using data stored as parquet files for performance, is described in the article
Use external tables with Synapse SQL.
Azure services
Azure Synapse Analytics | (Security Baseline)
Azure Data Lake | (Security Baseline)
Azure Data Explorer | (Security Baseline)
Azure Analysis Services
HDInsight | (Security Baseline)
Azure Databricks | (Security Baseline)
Workload
Data analytics
Enterprise BI
Data type
Historical data from multiple sources.
Usually denormalized in a "star" or "snowflake" schema, consisting of fact and dimension tables.
Usually loaded with new data on a scheduled basis.
Dimension tables often include multiple historic versions of an entity, referred to as a slowly changing
dimension.
Examples
Enterprise data warehouse

Column-family databases
A column-family database organizes data into rows and columns. In its simplest form, a column-family database
can appear very similar to a relational database, at least conceptually. The real power of a column-family
database lies in its denormalized approach to structuring sparse data.
You can think of a column-family database as holding tabular data with rows and columns, but the columns are
divided into groups known as column families. Each column family holds a set of columns that are logically
related together and are typically retrieved or manipulated as a unit. Other data that is accessed separately can
be stored in separate column families. Within a column family, new columns can be added dynamically, and rows
can be sparse (that is, a row doesn't need to have a value for every column).
The following diagram shows an example with two column families, Identity and Contact Info . The data for a
single entity has the same row key in each column-family. This structure, where the rows for any given object in
a column family can vary dynamically, is an important benefit of the column-family approach, making this form
of data store highly suited for storing structured, volatile data.

Unlike a key/value store or a document database, most column-family databases store data in key order, rather
than by computing a hash. Many implementations allow you to create indexes over specific columns in a
column-family. Indexes let you retrieve data by columns value, rather than row key.
Read and write operations for a row are usually atomic with a single column-family, although some
implementations provide atomicity across the entire row, spanning multiple column-families.
Azure services
Azure Cosmos DB Cassandra API | (Security Baseline)
HBase in HDInsight | (Security Baseline)
Workload
Most column-family databases perform write operations extremely quickly.
Update and delete operations are rare.
Designed to provide high throughput and low-latency access.
Supports easy query access to a particular set of fields within a much larger record.
Massively scalable.
Data type
Data is stored in tables consisting of a key column and one or more column families.
Specific columns can vary by individual rows.
Individual cells are accessed via get and put commands
Multiple rows are returned using a scan command.
Examples
Recommendations
Personalization
Sensor data
Telemetry
Messaging
Social media analytics
Web analytics
Activity monitoring
Weather and other time-series data
Search Engine Databases
A search engine database allows applications to search for information held in external data stores. A search
engine database can index massive volumes of data and provide near real-time access to these indexes.
Indexes can be multi-dimensional and may support free-text searches across large volumes of text data. Indexing
can be performed using a pull model, triggered by the search engine database, or using a push model, initiated
by external application code.
Searching can be exact or fuzzy. A fuzzy search finds documents that match a set of terms and calculates how
closely they match. Some search engines also support linguistic analysis that can return matches based on
synonyms, genre expansions (for example, matching dogs to pets ), and stemming (matching words with the
same root).
Azure service
Azure Search | (Security Baseline)
Workload
Data indexes from multiple sources and services.
Queries are ad-hoc and can be complex.
Full text search is required.
Ad hoc self-service query is required.
Data type
Semi-structured or unstructured text
Text with reference to structured data
Examples
Product catalogs
Site search
Logging

Time series databases


Time series data is a set of values organized by time. Time series databases typically collect large amounts of
data in real time from a large number of sources. Updates are rare, and deletes are often done as bulk
operations. Although the records written to a time-series database are generally small, there are often a large
number of records, and total data size can grow rapidly.
Azure service
Azure Time Series Insights
Workload
Records are generally appended sequentially in time order.
An overwhelming proportion of operations (95-99%) are writes.
Updates are rare.
Deletes occur in bulk, and are made to contiguous blocks or records.
Data is read sequentially in either ascending or descending time order, often in parallel.
Data type
A timestamp is used as the primary key and sorting mechanism.
Tags may define additional information about the type, origin, and other information about the entry.
Examples
Monitoring and event telemetry.
Sensor or other IoT data.

Object storage
Object storage is optimized for storing and retrieving large binary objects (images, files, video and audio
streams, large application data objects and documents, virtual machine disk images). Large data files are also
popularly used in this model, for example, delimiter file (CSV), parquet, and ORC. Object stores can manage
extremely large amounts of unstructured data.
Azure service
Azure Blob Storage | (Security Baseline)
Azure Data Lake Storage Gen2 | (Security Baseline)
Workload
Identified by key.
Content is typically an asset such as a delimiter, image, or video file.
Content must be durable and external to any application tier.
Data type
Data size is large.
Value is opaque.
Examples
Images, videos, office documents, PDFs
Static HTML, JSON, CSS
Log and audit files
Database backups

Shared files
Sometimes, using simple flat files can be the most effective means of storing and retrieving information. Using
file shares enables files to be accessed across a network. Given appropriate security and concurrent access
control mechanisms, sharing data in this way can enable distributed services to provide highly scalable data
access for performing basic, low-level operations such as simple read and write requests.
Azure service
Azure Files | (Security Baseline)
Workload
Migration from existing apps that interact with the file system.
Requires SMB interface.
Data type
Files in a hierarchical set of folders.
Accessible with standard I/O libraries.
Examples
Legacy files
Shared content accessible among a number of VMs or app instances
Aided with this understanding of different data storage models, the next step is to evaluate your workload and
application, and decide which data store will meet your specific needs. Use the data storage decision tree to help
with this process.
Choosing a data pipeline orchestration technology in
Azure
11/2/2020 • 2 minutes to read • Edit Online

Most big data solutions consist of repeated data processing operations, encapsulated in workflows. A pipeline
orchestrator is a tool that helps to automate these workflows. An orchestrator can schedule jobs, execute
workflows, and coordinate dependencies among tasks.

What are your options for data pipeline orchestration?


In Azure, the following services and tools will meet the core requirements for pipeline orchestration, control flow,
and data movement:
Azure Data Factory
Oozie on HDInsight
SQL Server Integration Services (SSIS)
These services and tools can be used independently from one another, or used together to create a hybrid
solution. For example, the Integration Runtime (IR) in Azure Data Factory V2 can natively execute SSIS packages in
a managed Azure compute environment. While there is some overlap in functionality between these services,
there are a few key differences.

Key Selection Criteria


To narrow the choices, start by answering these questions:
Do you need big data capabilities for moving and transforming your data? Usually this means multi-
gigabytes to terabytes of data. If yes, then narrow your options to those that best suited for big data.
Do you require a managed service that can operate at scale? If yes, select one of the cloud-based services
that aren't limited by your local processing power.
Are some of your data sources located on-premises? If yes, look for options that can work with both cloud
and on-premises data sources or destinations.
Is your source data stored in Blob storage on an HDFS filesystem? If so, choose an option that supports
Hive queries.

Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
SQ L SERVER IN T EGRAT IO N
C A PA B IL IT Y A Z URE DATA FA C TO RY SERVIC ES ( SSIS) O O Z IE O N H DIN SIGH T

Managed Yes No Yes

Cloud-based Yes No (local) Yes


SQ L SERVER IN T EGRAT IO N
C A PA B IL IT Y A Z URE DATA FA C TO RY SERVIC ES ( SSIS) O O Z IE O N H DIN SIGH T

Prerequisite Azure Subscription SQL Server Azure Subscription,


HDInsight cluster

Management tools Azure Portal, PowerShell, SSMS, PowerShell Bash shell, Oozie REST API,
CLI, .NET SDK Oozie web UI

Pricing Pay per usage Licensing / pay for features No additional charge on top
of running the HDInsight
cluster

Pipeline capabilities
SQ L SERVER IN T EGRAT IO N
C A PA B IL IT Y A Z URE DATA FA C TO RY SERVIC ES ( SSIS) O O Z IE O N H DIN SIGH T

Copy data Yes Yes Yes

Custom transformations Yes Yes Yes (MapReduce, Pig, and


Hive jobs)

Azure Machine Learning Yes Yes (with scripting) No


scoring

HDInsight On-Demand Yes No No

Azure Batch Yes No No

Pig, Hive, MapReduce Yes No Yes

Spark Yes No No

Execute SSIS Package Yes Yes No

Control flow Yes Yes Yes

Access on-premises data Yes Yes No

Scalability capabilities
SQ L SERVER IN T EGRAT IO N
C A PA B IL IT Y A Z URE DATA FA C TO RY SERVIC ES ( SSIS) O O Z IE O N H DIN SIGH T

Scale up Yes No No

Scale out Yes No Yes (by adding worker nodes


to cluster)

Optimized for big data Yes No Yes


Choosing a real-time message ingestion technology
in Azure
11/2/2020 • 2 minutes to read • Edit Online

Real time processing deals with streams of data that are captured in real-time and processed with minimal latency.
Many real-time processing solutions need a message ingestion store to act as a buffer for messages, and to
support scale-out processing, reliable delivery, and other message queuing semantics.

What are your options for real-time message ingestion?


Azure Event Hubs
Azure IoT Hub
Kafka on HDInsight

Azure Event Hubs


Azure Event Hubs is a highly scalable data streaming platform and event ingestion service, capable of receiving
and processing millions of events per second. Event Hubs can process and store events, data, or telemetry
produced by distributed software and devices. Data sent to an event hub can be transformed and stored using any
real-time analytics provider or batching/storage adapters. Event Hubs provides publish-subscribe capabilities with
low latency at massive scale, which makes it appropriate for big data scenarios.

Azure IoT Hub


Azure IoT Hub is a managed service that enables reliable and secure bidirectional communications between
millions of IoT devices and a cloud-based back end.
Feature of IoT Hub include:
Multiple options for device-to-cloud and cloud-to-device communication. These options include one-way
messaging, file transfer, and request-reply methods.
Message routing to other Azure services.
Queryable store for device metadata and synchronized state information.
Secure communications and access control using per-device security keys or X.509 certificates.
Monitoring of device connectivity and device identity management events.
In terms of message ingestion, IoT Hub is similar to Event Hubs. However, it was specifically designed for
managing IoT device connectivity, not just message ingestion. For more information, see Comparison of Azure IoT
Hub and Azure Event Hubs.

Kafka on HDInsight
Apache Kafka is an open-source distributed streaming platform that can be used to build real-time data pipelines
and streaming applications. Kafka also provides message broker functionality similar to a message queue, where
you can publish and subscribe to named data streams. It is horizontally scalable, fault-tolerant, and extremely fast.
Kafka on HDInsight provides a Kafka as a managed, highly scalable, and highly available service in Azure.
Some common use cases for Kafka are:
Messaging . Because it supports the publish-subscribe message pattern, Kafka is often used as a message
broker.
Activity tracking . Because Kafka provides in-order logging of records, it can be used to track and re-create
activities, such as user actions on a web site.
Aggregation . Using stream processing, you can aggregate information from different streams to combine
and centralize the information into operational data.
Transformation . Using stream processing, you can combine and enrich data from multiple input topics into
one or more output topics.

Key selection criteria


To narrow the choices, start by answering these questions:
Do you need two-way communication between your IoT devices and Azure? If so, choose IoT Hub.
Do you need to manage access for individual devices and be able to revoke access to a specific device? If
yes, choose IoT Hub.

Capability matrix
The following tables summarize the key differences in capabilities.

C A PA B IL IT Y IOT H UB EVEN T H UB S K A F K A O N H DIN SIGH T

Cloud-to-device Yes No No
communications

Device-initiated file upload Yes No No

Device state information Device twins No No

Protocol support MQTT, AMQP, HTTPS 1 AMQP, HTTPS, Kafka Kafka Protocol
Protocol

Security Per-device identity; Shared access policies; Authentication using SASL;


revocable access control. limited revocation through pluggable authorization;
publisher policies. integration with external
authentication services
supported.

[1] You can also use Azure IoT protocol gateway as a custom gateway to enable protocol adaptation for IoT Hub.
For more information, see Comparison of Azure IoT Hub and Azure Event Hubs.
Choosing a search data store in Azure
11/13/2020 • 2 minutes to read • Edit Online

This article compares technology choices for search data stores in Azure. A search data store is used to create and
store specialized indexes for performing searches on free-form text. The text that is indexed may reside in a
separate data store, such as blob storage. An application submits a query to the search data store, and the result is
a list of matching documents. For more information about this scenario, see Processing free-form text for search.

What are your options when choosing a search data store?


In Azure, all of the following data stores will meet the core requirements for search against free-form text data by
providing a search index:
Azure Search
using CosmosDb
using Azure SQL Database
Elasticsearch
HDInsight with Solr
Azure SQL Database with full text search

Key selection criteria


For search scenarios, begin choosing the appropriate search data store for your needs by answering these
questions:
Do you want a managed service rather than managing your own servers?
Can you specify your index schema at design time? If not, choose an option that supports updateable
schemas.
Do you need an index only for full-text search, or do you also need rapid aggregation of numeric data and
other analytics? If you need functionality beyond full-text search, consider options that support additional
analytics.
Do you need a search index for log analytics, with support for log collection, aggregation, and visualizations
on indexed data? If so, consider Elasticsearch, which is part of a log analytics stack.
Do you need to index data in common document formats such as PDF, Word, PowerPoint, and Excel? If yes,
choose an option that provides document indexers.
Does your database have specific security needs? If yes, consider the security features listed below.

Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
H DIN SIGH T W IT H
C A PA B IL IT Y A Z URE SEA RC H EL A ST IC SEA RC H SO L R SQ L DATA B A SE

Is managed service Yes No Yes Yes


H DIN SIGH T W IT H
C A PA B IL IT Y A Z URE SEA RC H EL A ST IC SEA RC H SO L R SQ L DATA B A SE

REST API Yes Yes Yes No

Programmability .NET, Java, Python Java Java T-SQL

Document indexers Yes No Yes No


for common file types
(PDF, DOCX, TXT, and
so on)

Manageability capabilities
H DIN SIGH T W IT H
C A PA B IL IT Y A Z URE SEA RC H EL A ST IC SEA RC H SO L R SQ L DATA B A SE

Updateable schema No Yes Yes Yes

Supports scale out Yes Yes Yes No

Analytic workload capabilities


H DIN SIGH T W IT H
C A PA B IL IT Y A Z URE SEA RC H EL A ST IC SEA RC H SO L R SQ L DATA B A SE

Supports analytics No Yes Yes Yes


beyond full text
search

Part of a log analytics No Yes (ELK) No No


stack

Supports semantic Yes (find similar Yes Yes Yes


search documents only)

Security capabilities
H DIN SIGH T W IT H
C A PA B IL IT Y A Z URE SEA RC H EL A ST IC SEA RC H SO L R SQ L DATA B A SE

Row-level security Partial (requires Partial (requires Yes Yes


application query to application query to
filter by group id) filter by group id)

Transparent data Yes No No Yes


encryption

Restrict access to Yes Yes Yes Yes


specific IP addresses

Restrict access to Yes Yes Yes Yes


allow virtual network
access only
H DIN SIGH T W IT H
C A PA B IL IT Y A Z URE SEA RC H EL A ST IC SEA RC H SO L R SQ L DATA B A SE

Active Directory No No No Yes


authentication
(integrated
authentication)

Performance Analysis
The file movies.json contains the data used in this section to compare the performance of the search technologies
previously mentioned. The dataset weights around 5 Mb and includes almost 27000 documents.
Notice that the following figures are only for guidance and were obtained using a default deployment, without any
manual tuning or proper escalation.
Performance for Elasticsearch and HDInsight with Solr, which are based in Lucene engine, should be similar to
Azure Search.
Performance Comparison
A Z URE SEA RC H & A Z URE SEA RC H & SQ L A Z URE SQ L DATA B A SE F UL L
C A PA B IL IT Y C O SM O SDB DATA B A SE T EXT SEA RC H

Indexation ~1 min ~45 secs ~5 mins

Search ~0.2 secs ~0.2 secs ~0.5 secs

See also
Processing free-form text for search
Choosing a stream processing technology in Azure
11/2/2020 • 2 minutes to read • Edit Online

This article compares technology choices for real-time stream processing in Azure.
Real-time stream processing consumes messages from either queue or file-based storage, process the messages,
and forward the result to another message queue, file store, or database. Processing may include querying,
filtering, and aggregating messages. Stream processing engines must be able to consume an endless streams of
data and produce results with minimal latency. For more information, see Real time processing.

What are your options when choosing a technology for real-time


processing?
In Azure, all of the following data stores will meet the core requirements supporting real-time processing:
Azure Stream Analytics
HDInsight with Spark Streaming
Apache Spark in Azure Databricks
HDInsight with Storm
Azure Functions
Azure App Service WebJobs
Apache Kafka streams API

Key Selection Criteria


For real-time processing scenarios, begin choosing the appropriate service for your needs by answering these
questions:
Do you prefer a declarative or imperative approach to authoring stream processing logic?
Do you need built-in support for temporal processing or windowing?
Does your data arrive in formats besides Avro, JSON, or CSV? If yes, consider options support any format
using custom code.
Do you need to scale your processing beyond 1 GB/s? If yes, consider the options that scale with the cluster
size.

Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
A PA C H E
A Z URE H DIN SIGH T SPA RK IN A Z URE A P P
ST REA M W IT H SPA RK A Z URE H DIN SIGH T A Z URE SERVIC E
C A PA B IL IT Y A N A LY T IC S ST REA M IN G DATA B RIC K S W IT H STO RM F UN C T IO N S W EB JO B S
A PA C H E
A Z URE H DIN SIGH T SPA RK IN A Z URE A P P
ST REA M W IT H SPA RK A Z URE H DIN SIGH T A Z URE SERVIC E
C A PA B IL IT Y A N A LY T IC S ST REA M IN G DATA B RIC K S W IT H STO RM F UN C T IO N S W EB JO B S

Programmabil Stream C#/F#, Java, C#/F#, Java, C#, Java C#, F#, Java, C#, Java,
ity analytics Python, Scala Python, R, Node.js, Node.js, PHP,
query Scala Python Python
language,
JavaScript

Programming Declarative Mixture of Mixture of Imperative Imperative Imperative


paradigm declarative declarative
and and
imperative imperative

Pricing model Streaming Per cluster Databricks Per cluster Per function Per app
units hour units hour execution and service plan
resource hour
consumption

Integration capabilities
A PA C H E
A Z URE H DIN SIGH T SPA RK IN A Z URE A P P
ST REA M W IT H SPA RK A Z URE H DIN SIGH T A Z URE SERVIC E
C A PA B IL IT Y A N A LY T IC S ST REA M IN G DATA B RIC K S W IT H STO RM F UN C T IO N S W EB JO B S

Inputs Azure Event Event Hubs, Event Hubs, Event Hubs, Supported Service Bus,
Hubs, Azure IoT Hub, IoT Hub, IoT Hub, bindings Storage
IoT Hub, Kafka, HDFS, Kafka, HDFS, Storage Blobs, Queues,
Azure Blob Storage Blobs, Storage Blobs, Azure Data Storage Blobs,
storage Azure Data Azure Data Lake Store Event Hubs,
Lake Store Lake Store WebHooks,
Cosmos DB,
Files

Sinks Azure Data HDFS, Kafka, HDFS, Kafka, Event Hubs, Supported Service Bus,
Lake Store, Storage Blobs, Storage Blobs, Service Bus, bindings Storage
Azure SQL Azure Data Azure Data Kafka Queues,
Database, Lake Store, Lake Store, Storage Blobs,
Storage Blobs, Cosmos DB Cosmos DB Event Hubs,
Event Hubs, WebHooks,
Power BI, Cosmos DB,
Table Storage, Files
Service Bus
Queues,
Service Bus
Topics,
Cosmos DB,
Azure
Functions

Processing capabilities
A PA C H E
A Z URE H DIN SIGH T SPA RK IN A Z URE A P P
ST REA M W IT H SPA RK A Z URE H DIN SIGH T A Z URE SERVIC E
C A PA B IL IT Y A N A LY T IC S ST REA M IN G DATA B RIC K S W IT H STO RM F UN C T IO N S W EB JO B S

Built-in Yes Yes Yes Yes No No


temporal/win
dowing
support

Input data Avro, JSON or Any format Any format Any format Any format Any format
formats CSV, UTF-8 using custom using custom using custom using custom using custom
encoded code code code code code

Scalability Query Bounded by Bounded by Bounded by Up to 200 Bounded by


partitions cluster size Databricks cluster size function app app service
cluster scale instances plan capacity
configuration processing in
parallel

Late arrival Yes Yes Yes Yes No No


and out of
order event
handling
support

See also:
Choosing a real-time message ingestion technology
Real time processing
Tenancy models for SaaS applications
11/2/2020 • 3 minutes to read • Edit Online

In a Software as a Service (SaaS) model, each of your customers are a tenant of your application. Each tenant pays
for access to the SaaS application by paying a subscription fee. This article describes the application tenancy
models available to SaaS application builders.
When designing a SaaS application, you must choose the application tenancy model that best fits the needs of your
customers and your business. In general, the application tenancy model doesn't impact the functionality of an
application. But it likely impacts other aspects of the overall solution including scale, tenant isolation, cost per
tenant and operation complexity.

What are application tenancy models


The sections below explore the concepts of single, mixed, and multi-tenant application models.
Single tenant

In the single tenant model, a single dedicated instance of an application is deployed for each customer. For
example, with a N-tier architecture style application, all customers get a new dedicated instance of the web, middle,
and data tiers. These tiers are not shared between these customers.
Mixed tenant
In this model, one or more parts of an application are deployed as dedicated for each customer, and the rest is
shared between all customers. For example, with a N-tier architecture style application the web and middle tiers are
shared between all customers. However, a dedicated data tier and database is provisioned for each customer.
Multi-tenant

In this model, a single instance of the application is deployed for all customers and shared amongst them. For
example, with a N-tier architecture style application, the web, middle and data tiers are shared between all
customers.
A combination of these models can be provided for customers with different needs. For example, your basic tier of
service would run on a shared multi-tenant instance of your application. As a baseline, your customers can access
your app with lower performance or limited functionality for a lower cost. On top of this baseline, a dedicate
service tier could run on a single tenant model. For customers that need higher performance or additional
functionality, you can provide an isolated instance of your application for a higher cost.

Compare application tenancy models


In general, the tenancy model doesn't impact the functionality of an application, but it likely impacts other aspects
of the overall solution. The following table summarizes the differences between the application tenancy models:

M EA SUREM EN T SIN GL E T EN A N T M IXED M ULT I- T EN A N T

Scale Medium High Very High

Tenant isolation Very High High Low

Cost per tenant High Medium Low

Operational complexity Low-High - Individually Low-Medium - Need to Low-High - Individual tenant


simple, complex at scale. address complexity at scale. management is complex.

The measurement terms are as detailed below:


Scale: The number of concurrent customers (or tenants) your application can service.
Tenant isolation: The degree in which customers' data and performance is separated from other customers.
Cost per tenant: The relative amount it costs to run a single customer's tenant.
Operational complexity: How complex it is to achieve operational tasks like automation, security, monitoring,
maintenance, metering, application deployments, disaster recovery/BCP.
In general, the single tenancy application model and a small number of tenants, are simpler to manage. With a
large number of tenants, it can become highly complex without strong operational excellence practices. Conversely,
in the multi-tenancy application model, it is easier to manage the system as a whole, but highly complex when you
want to manage an individual customer tenant by itself, without strong operational excellence practices.

Additional considerations
Once you know the customer's needs and business goals, start asking these questions:
Do I operate in a highly regulated industry that requires customers data to be isolated from other customers?
Am I looking to rapidly scale my application to many thousands of clients?
Am I concerned about how much it costs to run each tenant/customer instance?
The answers to these questions will help you refine your tenancy requirement.

Next steps
Read more about application tenancy patterns in Multi-tenant SaaS database tenancy patterns.
Monitoring Azure Databricks
11/2/2020 • 2 minutes to read • Edit Online

Azure Databricks is a fast, powerful Apache Spark–based analytics service that makes it easy to rapidly develop
and deploy big data analytics and artificial intelligence (AI) solutions. Many users take advantage of the simplicity
of notebooks in their Azure Databricks solutions. For users that require more robust computing options, Azure
Databricks supports the distributed execution of custom application code.
Monitoring is a critical part of any production-level solution, and Azure Databricks offers robust functionality for
monitoring custom application metrics, streaming query events, and application log messages. Azure Databricks
can send this monitoring data to different logging services.
The following articles show how to send monitoring data from Azure Databricks to Azure Monitor, the monitoring
data platform for Azure.
Send Azure Databricks application logs to Azure Monitor
Use dashboards to visualize Azure Databricks metrics
Troubleshoot performance bottlenecks
The code library that accompanies these articles extends the core monitoring functionality of Azure Databricks to
send Spark metrics, events, and logging information to Azure Monitor.
The audience for these articles and the accompanying code library are Apache Spark and Azure Databricks
solution developers. The code must be built into Java Archive (JAR) files and then deployed to an Azure Databricks
cluster. The code is a combination of Scala and Java, with a corresponding set of Maven project object model
(POM) files to build the output JAR files. Understanding of Java, Scala, and Maven are recommended as
prerequisites.

Next steps
Start by building the code library and deploying it to your Azure Databricks cluster.
Send Azure Databricks application logs to Azure Monitor
Send Azure Databricks application logs to Azure
Monitor
11/2/2020 • 2 minutes to read • Edit Online

This article shows how to send application logs and metrics from Azure Databricks to a Log Analytics workspace. It
uses the Azure Databricks Monitoring Library, which is available on GitHub.

Prerequisites
Configure your Azure Databricks cluster to use the monitoring library, as described in the GitHub readme.

NOTE
The monitoring library streams Apache Spark level events and Spark Structured Streaming metrics from your jobs to Azure
Monitor. You don't need to make any changes to your application code for these events and metrics.

Send application metrics using Dropwizard


Spark uses a configurable metrics system based on the Dropwizard Metrics Library. For more information, see
Metrics in the Spark documentation.
To send application metrics from Azure Databricks application code to Azure Monitor, follow these steps:
1. Build the spark-listeners-loganalytics-1.0-SNAPSHOT.jar JAR file as described in the GitHub readme.
2. Create Dropwizard gauges or counters in your application code. You can use the UserMetricsSystem class
defined in the monitoring library. The following example creates a counter named counter1 .

import org.apache.spark.metrics.UserMetricsSystems
import org.apache.spark.sql.SparkSession

object StreamingQueryListenerSampleJob {

private final val METRICS_NAMESPACE = "samplejob"


private final val COUNTER_NAME = "counter1"

def main(args: Array[String]): Unit = {

val spark = SparkSession


.builder
.getOrCreate

val driverMetricsSystem = UserMetricsSystems


.getMetricSystem(METRICS_NAMESPACE, builder => {
builder.registerCounter(COUNTER_NAME)
})

driverMetricsSystem.counter(COUNTER_NAME).inc(5)
}
}

The monitoring library includes a sample application that demonstrates how to use the UserMetricsSystem
class.
Send application logs using Log4j
To send your Azure Databricks application logs to Azure Log Analytics using the Log4j appender in the library,
follow these steps:
1. Build the spark-listeners-1.0-SNAPSHOT.jar and the spark-listeners-loganalytics-1.0-
SNAPSHOT.jar JAR file as described in the GitHub readme.
2. Create a log4j.proper ties configuration file for your application. Include the following configuration
properties. Substitute your application package name and log level where indicated:

log4j.appender.A1=com.microsoft.pnp.logging.loganalytics.LogAnalyticsAppender
log4j.appender.A1.layout=com.microsoft.pnp.logging.JSONLayout
log4j.appender.A1.layout.LocationInfo=false
log4j.additivity.<your application package name>=false
log4j.logger.<your application package name>=<log level>, A1

You can find a sample configuration file here.


3. In your application code, include the spark-listeners-loganalytics project, and import
com.microsoft.pnp.logging.Log4jconfiguration to your application code.

import com.microsoft.pnp.logging.Log4jConfiguration

4. Configure Log4j using the log4j.proper ties file you created in step 3:

getClass.getResourceAsStream("<path to file in your JAR file>/log4j.properties") {


stream => {
Log4jConfiguration.configure(stream)
}
}

5. Add Apache Spark log messages at the appropriate level in your code as required. For example, use the
logDebug method to send a debug log message. For more information, see Logging in the Spark
documentation.

logTrace("Trace message")
logDebug("Debug message")
logInfo("Info message")
logWarning("Warning message")
logError("Error message")

Run the sample application


The monitoring library includes a sample application that demonstrates how to send both application metrics and
application logs to Azure Monitor. To run the sample:
1. Build the spark-jobs project in the monitoring library, as described in the GitHub readme.
2. Navigate to your Databricks workspace and create a new job, as described here.
3. In the job detail page, select Set JAR .
4. Upload the JAR file from /src/spark-jobs/target/spark-jobs-1.0-SNAPSHOT.jar .
5. For Main class , enter com.microsoft.pnp.samplejob.StreamingQueryListenerSampleJob .
6. Select a cluster that is already configured to use the monitoring library. See Configure Azure Databricks to
send metrics to Azure Monitor.
When the job runs, you can view the application logs and metrics in your Log Analytics workspace.
Application logs appear under SparkLoggingEvent_CL:

SparkLoggingEvent_CL | where logger_name_s contains "com.microsoft.pnp"

Application metrics appear under SparkMetric_CL:

SparkMetric_CL | where name_s contains "rowcounter" | limit 50

IMPORTANT
After you verify the metrics appear, stop the sample application job.

Next steps
Deploy the performance monitoring dashboard that accompanies this code library to troubleshoot performance
issues in your production Azure Databricks workloads.
Use dashboards to visualize Azure Databricks metrics
Use dashboards to visualize Azure Databricks metrics
11/2/2020 • 7 minutes to read • Edit Online

This article shows how to set up a Grafana dashboard to monitor Azure Databricks jobs for performance issues.
Azure Databricks is a fast, powerful, and collaborative Apache Spark–based analytics service that makes it easy to
rapidly develop and deploy big data analytics and artificial intelligence (AI) solutions. Monitoring is a critical
component of operating Azure Databricks workloads in production. The first step is to gather metrics into a
workspace for analysis. In Azure, the best solution for managing log data is Azure Monitor. Azure Databricks does
not natively support sending log data to Azure monitor, but a library for this functionality is available in GitHub.
This library enables logging of Azure Databricks service metrics as well as Apache Spark structure streaming
query event metrics. Once you've successfully deployed this library to an Azure Databricks cluster, you can further
deploy a set of Grafana dashboards that you can deploy as part of your production environment.

Prerequisites
Configure your Azure Databricks cluster to use the monitoring library, as described in the GitHub readme.

Deploy the Azure Log Analytics workspace


To deploy the Azure Log Analytics workspace, follow these steps:
1. Navigate to the /perftools/deployment/loganalytics directory.
2. Deploy the logAnalyticsDeploy.json Azure Resource Manager template. For more information about
deploying Resource Manager templates, see Deploy resources with Resource Manager templates and Azure
CLI. The template has the following parameters:
location : The region where the Log Analytics workspace and dashboards are deployed.
ser viceTier : Rhe workspace pricing tier. See here for a list of valid values.
dataRetention (optional): The number of days the log data is retained in the Log Analytics workspace.
The default value is 30 days. If the pricing tier is Free , the data retention must be seven days.
workspaceName (optional): A name for the workspace. If not specified, the template generates a name.

az group deployment create --resource-group <resource-group-name> --template-file


logAnalyticsDeploy.json --parameters location='East US' serviceTier='Standalone'

This template creates the workspace and also creates a set of predefined queries that are used by dashboard.

Deploy Grafana in a virtual machine


Grafana is an open source project you can deploy to visualize the time series metrics stored in your Azure Log
Analytics workspace using the Grafana plugin for Azure Monitor. Grafana executes on a virtual machine (VM) and
requires a storage account, virtual network, and other resources. To deploy a virtual machine with the bitnami-
certified Grafana image and associated resources, follow these steps:
1. Use the Azure CLI to accept the Azure Marketplace image terms for Grafana.

az vm image accept-terms --publisher bitnami --offer grafana --plan default

2. Navigate to the /spark-monitoring/perftools/deployment/grafana directory in your local copy of the GitHub


repo.
3. Deploy the grafanaDeploy.json Resource Manager template as follows:

export DATA_SOURCE="https://raw.githubusercontent.com/mspnp/spark-
monitoring/master/perftools/deployment/grafana/AzureDataSource.sh"
az group deployment create \
--resource-group <resource-group-name> \
--template-file grafanaDeploy.json \
--parameters adminPass='<vm password>' dataSource=$DATA_SOURCE

Once the deployment is complete, the bitnami image of Grafana is installed on the virtual machine.

Update the Grafana password


As part of the setup process, the Grafana installation script outputs a temporary password for the admin user. You
need this temporary password to sign in. To obtain the temporary password, follow these steps:
1. Log in to the Azure portal.
2. Select the resource group where the resources were deployed.
3. Select the VM where Grafana was installed. If you used the default parameter name in the deployment
template, the VM name is prefaced with sparkmonitoring-vm-grafana .
4. In the Suppor t + troubleshooting section, click Boot diagnostics to open the boot diagnostics page.
5. Click Serial log on the boot diagnostics page.
6. Search for the following string: "Setting Bitnami application password to".
7. Copy the password to a safe location.
Next, change the Grafana administrator password by following these steps:
1. In the Azure portal, select the VM and click Over view .
2. Copy the public IP address.
3. Open a web browser and navigate to the following URL: http://<IP address>:3000 .
4. At the Grafana login screen, enter admin for the user name, and use the Grafana password from the previous
steps.
5. Once logged in, select Configuration (the gear icon).
6. Select Ser ver Admin .
7. On the Users tab, select the admin login.
8. Update the password.

Create an Azure Monitor data source


1. Create a service principal that allows Grafana to manage access to your Log Analytics workspace. For more
information, see Create an Azure service principal with Azure CLI

az ad sp create-for-rbac --name http://<service principal name> --role "Log Analytics Reader"

2. Note the values for appId, password, and tenant in the output from this command:

{
"appId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"displayName": "azure-cli-2019-03-27-00-33-39",
"name": "http://<service principal name>",
"password": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"tenant": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
}

3. Log into Grafana as described earlier. Select Configuration (the gear icon) and then Data Sources .
4. In the Data Sources tab, click Add data source .
5. Select Azure Monitor as the data source type.
6. In the Settings section, enter a name for the data source in the Name textbox.
7. In the Azure Monitor API Details section, enter the following information:
Subscription Id: Your Azure subscription ID.
Tenant Id: The tenant ID from earlier.
Client Id: The value of "appId" from earlier.
Client Secret: The value of "password" from earlier.
8. In the Azure Log Analytics API Details section, check the Same Details as Azure Monitor API
checkbox.
9. Click Save & Test . If the Log Analytics data source is correctly configured, a success message is displayed.

Create the dashboard


Create the dashboards in Grafana by following these steps:
1. Navigate to the /perftools/dashboards/grafana directory in your local copy of the GitHub repo.
2. Run the following script:
export WORKSPACE=<your Azure Log Analytics workspace ID>
export LOGTYPE=SparkListenerEvent_CL

sh DashGen.sh

The output from the script is a file named SparkMonitoringDash.json .


3. Return to the Grafana dashboard and select Create (the plus icon).
4. Select Impor t .
5. Click Upload .json File .
6. Select the SparkMonitoringDash.json file created in step 2.
7. In the Options section, under AL A , select the Azure Monitor data source created earlier.
8. Click Impor t .

Visualizations in the dashboards


Both the Azure Log Analytics and Grafana dashboards include a set of time-series visualizations. Each graph is
time-series plot of metric data related to an Apache Spark job, stages of the job, and tasks that make up each stage.
The visualizations are:
Job latency
This visualization shows execution latency for a job, which is a coarse view on the overall performance of a job.
Displays the job execution duration from start to completion. Note that the job start time is not the same as the job
submission time. Latency is represented as percentiles (10%, 30%, 50%, 90%) of job execution indexed by cluster
ID and application ID.
Stage latency
The visualization shows the latency of each stage per cluster, per application, and per individual stage. This
visualization is useful for identifying a particular stage that is running slowly.
Task latency
This visualization shows task execution latency. Latency is represented as a percentile of task execution per cluster,
stage name, and application.
Sum Task Execution per host
This visualization shows the sum of task execution latency per host running on a cluster. Viewing task execution
latency per host identifies hosts that have much higher overall task latency than other hosts. This may mean that
tasks have been inefficiently or unevenly distributed to hosts.
Task metrics
This visualization shows a set of the execution metrics for a given task's execution. These metrics include the size
and duration of a data shuffle, duration of serialization and deserialization operations, and others. For the full set of
metrics, view the Log Analytics query for the panel. This visualization is useful for understanding the operations
that make up a task and identifying resource consumption of each operation. Spikes in the graph represent costly
operations that should be investigated.
Cluster throughput
This visualization is a high-level view of work items indexed by cluster and application to represent the amount of
work done per cluster and application. It shows the number of jobs, tasks, and stages completed per cluster,
application, and stage in one minute increments.
Streaming Throughput/Latency
This visualization is related to the metrics associated with a structured streaming query. The graph shows the
number of input rows per second and the number of rows processed per second. The streaming metrics are also
represented per application. These metrics are sent when the OnQueryProgress event is generated as the
structured streaming query is processed and the visualization represents streaming latency as the amount of time,
in milliseconds, taken to execute a query batch.
Resource consumption per executor
Next is a set of visualizations for the dashboard show the particular type of resource and how it is consumed per
executor on each cluster. These visualizations help identify outliers in resource consumption per executor. For
example, if the work allocation for a particular executor is skewed, resource consumption will be elevated in
relation to other executors running on the cluster. This can be identified by spikes in the resource consumption for
an executor.
Executor compute time metrics
Next is a set of visualizations for the dashboard that show the ratio of executor serialize time, deserialize time, CPU
time, and Java virtual machine time to overall executor compute time. This demonstrates visually how much each
of these four metrics is contributing to overall executor processing.
Shuffle metrics
The final set of visualizations shows the data shuffle metrics associated with a structured streaming query across
all executors. These include shuffle bytes read, shuffle bytes written, shuffle memory, and disk usage in queries
where the file system is used.

Next steps
Troubleshoot performance bottlenecks
Troubleshoot performance bottlenecks in Azure
Databricks
11/2/2020 • 6 minutes to read • Edit Online

This article describes how to use monitoring dashboards to find performance bottlenecks in Spark jobs on Azure
Databricks.
Azure Databricks is an Apache Spark–based analytics service that makes it easy to rapidly develop and deploy big
data analytics. Monitoring and troubleshooting performance issues is a critical when operating production Azure
Databricks workloads. To identify common performance issues, it's helpful to use monitoring visualizations based
on telemetry data.

Prerequisites
To set up the Grafana dashboards shown in this article:
Configure your Databricks cluster to send telemetry to a Log Analytics workspace, using the Azure
Databricks Monitoring Library. For details, see the GitHub readme.
Deploy Grafana in a virtual machine. See Use dashboards to visualize Azure Databricks metrics.
The Grafana dashboard that is deployed includes a set of time-series visualizations. Each graph is time-series plot
of metrics related to an Apache Spark job, the stages of the job, and tasks that make up each stage.

Azure Databricks performance overview


Azure Databricks is based on Apache Spark, a general-purpose distributed computing system. Application code,
known as a job , executes on an Apache Spark cluster, coordinated by the cluster manager. In general, a job is the
highest-level unit of computation. A job represents the complete operation performed by the Spark application. A
typical operation includes reading data from a source, applying data transformations, and writing the results to
storage or another destination.
Jobs are broken down into stages . The job advances through the stages sequentially, which means that later
stages must wait for earlier stages to complete. Stages contain groups of identical tasks that can be executed in
parallel on multiple nodes of the Spark cluster. Tasks are the most granular unit of execution taking place on a
subset of the data.
The next sections describe some dashboard visualizations that are useful for performance troubleshooting.

Job and stage latency


Job latency is the duration of a job execution from when it starts until it completes. It is shown as percentiles of a
job execution per cluster and application ID, to allow the visualization of outliers. The following graph shows a job
history where the 90th percentile reached 50 seconds, even though the 50th percentile was consistently around 10
seconds.
Investigate job execution by cluster and application, looking for spikes in latency. Once clusters and applications
with high latency are identified, move on to investigate stage latency.
Stage latency is also shown as percentiles to allow the visualization of outliers. Stage latency is broken out by
cluster, application, and stage name. Identify spikes in task latency in the graph to determine which tasks are
holding back completion of the stage.

The cluster throughput graph shows the number of jobs, stages, and tasks completed per minute. This helps you to
understand the workload in terms of the relative number of stages and tasks per job. Here you can see that the
number of jobs per minute ranges between 2 and 6, while the number of stages is about 12 – 24 per minute.
Sum of task execution latency
This visualization shows the sum of task execution latency per host running on a cluster. Use this graph to detect
tasks that run slowly due to the host slowing down on a cluster, or a misallocation of tasks per executor. In the
following graph, most of the hosts have a sum of about 30 seconds. However, two of the hosts have sums that
hover around 10 minutes. Either the hosts are running slow or the number of tasks per executor is misallocated.

The number of tasks per executor shows that two executors are assigned a disproportionate number of tasks,
causing a bottleneck.
Task metrics per stage
The task metrics visualization gives the cost breakdown for a task execution. You can use it see the relative time
spent on tasks such as serialization and deserialization. This data might show opportunities to optimize — for
example, by using broadcast variables to avoid shipping data. The task metrics also show the shuffle data size for a
task, and the shuffle read and write times. If these values are high, it means that a lot of data is moving across the
network.
Another task metric is the scheduler delay, which measures how long it takes to schedule a task. Ideally, this value
should be low compared to the executor compute time, which is the time spent actually executing the task.
The following graph shows a scheduler delay time (3.7 s) that exceeds the executor compute time (1.1 s). That
means more time is spent waiting for tasks to be scheduled than doing the actual work.

In this case, the problem was caused by having too many partitions, which caused a lot of overhead. Reducing the
number of partitions lowered the scheduler delay time. The next graph shows that most of the time is spent
executing the task.
Streaming throughput and latency
Streaming throughput is directly related to structured streaming. There are two important metrics associated with
streaming throughput: Input rows per second and processed rows per second. If input rows per second outpaces
processed rows per second, it means the stream processing system is falling behind. Also, if the input data comes
from Event Hubs or Kafka, then input rows per second should keep up with the data ingestion rate at the front end.
Two jobs can have similar cluster throughput but very different streaming metrics. The following screenshot shows
two different workloads. They are similar in terms of cluster throughput ( jobs, stages, and tasks per minute). But
the second run processes 12,000 rows/sec versus 4,000 rows/sec.

Streaming throughput is often a better business metric than cluster throughput, because it measures the number
of data records that are processed.

Resource consumption per executor


These metrics help to understand the work that each executor performs.
Percentage metrics measure how much time an executor spends on various things, expressed as a ratio of time
spent versus the overall executor compute time. The metrics are:
% Serialize time
% Deserialize time
% CPU executor time
% JVM time
These visualizations show how much each of these metrics contributes to overall executor processing.

Shuffle metrics are metrics related to data shuffling across the executors.
Shuffle I/O
Shuffle memory
File system usage
Disk usage

Common performance bottlenecks


Two common performance bottlenecks in Spark are task stragglers and a non-optimal shuffle partition count.
Task stragglers
The stages in a job are executed sequentially, with earlier stages blocking later stages. If one task executes a shuffle
partition more slowly than other tasks, all tasks in the cluster must wait for the slow task to catch up before the
stage can end. This can happen for the following reasons:
1. A host or group of hosts are running slow. Symptoms: High task, stage, or job latency and low cluster
throughput. The summation of tasks latencies per host won't be evenly distributed. However, resource
consumption will be evenly distributed across executors.
2. Tasks have an expensive aggregation to execute (data skewing). Symptoms: High task latency, high stage
latency, high job latency, or low cluster throughput, but the summation of latencies per host is evenly
distributed. Resource consumption will be evenly distributed across executors.
3. If partitions are of unequal size, a larger partition may cause unbalanced task execution (partition skewing).
Symptoms: Executor resource consumption is high compared to other executors running on the cluster. All
tasks running on that executor will run slow and hold the stage execution in the pipeline. Those stages are
said to be stage barriers.
Non-optimal shuffle partition count
During a structured streaming query, the assignment of a task to an executor is a resource-intensive operation for
the cluster. If the shuffle data isn't the optimal size, the amount of delay for a task will negatively impact throughput
and latency. If there are too few partitions, the cores in the cluster will be underutilized which can result in
processing inefficiency. Conversely, if there are too many partitions, there's a great deal of management overhead
for a small number of tasks.
Use the resource consumption metrics to troubleshoot partition skewing and misallocation of executors on the
cluster. If a partition is skewed, executor resources will be elevated in comparison to other executors running on
the cluster.
For example, the following graph shows that the memory used by shuffling on the first two executors is 90X
bigger than the other executors:
Transferring data to and from Azure
11/2/2020 • 7 minutes to read • Edit Online

There are several options for transferring data to and from Azure, depending on your needs.

Physical transfer
Using physical hardware to transfer data to Azure is a good option when:
Your network is slow or unreliable.
Getting additional network bandwidth is cost-prohibitive.
Security or organizational policies do not allow outbound connections when dealing with sensitive data.
If your primary concern is how long it will take to transfer your data, you may want to run a test to verify whether
network transfer is actually slower than physical transport.
There are two main options for physically transporting data to Azure:
Azure Impor t/Expor t . The Azure Import/Export service lets you securely transfer large amounts of data to
Azure Blob Storage or Azure Files by shipping internal SATA HDDs or SDDs to an Azure datacenter. You can
also use this service to transfer data from Azure Storage to hard disk drives and have these shipped to you
for loading on-premises.
Azure Data Box . Azure Data Box is a Microsoft-provided appliance that works much like the Azure
Import/Export service. Microsoft ships you a proprietary, secure, and tamper-resistant transfer appliance
and handles the end-to-end logistics, which you can track through the portal. One benefit of the Azure Data
Box service is ease of use. You don't need to purchase several hard drives, prepare them, and transfer files to
each one. Azure Data Box is supported by a number of industry-leading Azure partners to make it easier to
seamlessly use offline transport to the cloud from their products.

Command line tools and APIs


Consider these options when you want scripted and programmatic data transfer.
Azure CLI . The Azure CLI is a cross-platform tool that allows you to manage Azure services and upload data
to Azure Storage.
AzCopy . Use AzCopy from a Windows or Linux command-line to easily copy data to and from Azure Blob,
File, and Table storage with optimal performance. AzCopy supports concurrency and parallelism, and the
ability to resume copy operations when interrupted. You can also use AzCopy to copy data from AWS to
Azure. For programmatic access, the Microsoft Azure Storage Data Movement Library is the core framework
that powers AzCopy. It is provided as a .NET Core library.
PowerShell . The Start-AzureStorageBlobCopy PowerShell cmdlet is an option for Windows administrators
who are used to PowerShell.
AdlCopy . AdlCopy enables you to copy data from Azure Storage Blobs into Data Lake Store. It can also be
used to copy data between two Azure Data Lake Store accounts. However, it cannot be used to copy data
from Data Lake Store to Storage Blobs.
Distcp . If you have an HDInsight cluster with access to Data Lake Store, you can use Hadoop ecosystem
tools like Distcp to copy data to and from an HDInsight cluster storage (WASB) into a Data Lake Store
account.
Sqoop . Sqoop is an Apache project and part of the Hadoop ecosystem. It comes preinstalled on all
HDInsight clusters. It allows data transfer between an HDInsight cluster and relational databases such as
SQL, Oracle, MySQL, and so on. Sqoop is a collection of related tools, including import and export. Sqoop
works with HDInsight clusters using either Azure Storage blobs or Data Lake Store attached storage.
PolyBase . PolyBase is a technology that accesses data outside of the database through the T-SQL language.
In SQL Server 2016, it allows you to run queries on external data in Hadoop or to import/export data from
Azure Blob Storage. In Azure Synapse Analytics, you can import/export data from Azure Blob Storage and
Azure Data Lake Store. Currently, PolyBase is the fastest method of importing data into Azure Synapse.
Hadoop command line . When you have data that resides on an HDInsight cluster head node, you can use
the hadoop -copyFromLocal command to copy that data to your cluster's attached storage, such as Azure
Storage blob or Azure Data Lake Store. In order to use the Hadoop command, you must first connect to the
head node. Once connected, you can upload a file to storage.

Graphical interface
Consider the following options if you are only transferring a few files or data objects and don't need to automate
the process.
Azure Storage Explorer . Azure Storage Explorer is a cross-platform tool that lets you manage the contents
of your Azure storage accounts. It allows you to upload, download, and manage blobs, files, queues, tables,
and Azure Cosmos DB entities. Use it with Blob storage to manage blobs and folders, as well as upload and
download blobs between your local file system and Blob storage, or between storage accounts.
Azure por tal . Both Blob storage and Data Lake Store provide a web-based interface for exploring files and
uploading new files one at a time. This is a good option if you do not want to install any tools or issue
commands to quickly explore your files, or to simply upload a handful of new ones.

Data pipeline
Azure Data Factor y . Azure Data Factory is a managed service best suited for regularly transferring files between
a number of Azure services, on-premises, or a combination of the two. Using Azure Data Factory, you can create
and schedule data-driven workflows (called pipelines) that ingest data from disparate data stores. It can process
and transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake
Analytics, and Azure Machine Learning. Create data-driven workflows for orchestrating and automating data
movement and data transformation.

Key Selection Criteria


For data transfer scenarios, choose the appropriate system for your needs by answering these questions:
Do you need to transfer very large amounts of data, where doing so over an Internet connection would take
too long, be unreliable, or too expensive? If yes, consider physical transfer.
Do you prefer to script your data transfer tasks, so they are reusable? If so, select one of the command line
options or Azure Data Factory.
Do you need to transfer a very large amount of data over a network connection? If so, select an option that
is optimized for big data.
Do you need to transfer data to or from a relational database? If yes, choose an option that supports one or
more relational databases. Note that some of these options also require a Hadoop cluster.
Do you need an automated data pipeline or workflow orchestration? If yes, consider Azure Data Factory.
Capability matrix
The following tables summarize the key differences in capabilities.
Physical transfer
C A PA B IL IT Y A Z URE IM P O RT / EXP O RT SERVIC E A Z URE DATA B O X

Form factor Internal SATA HDDs or SDDs Secure, tamper-proof, single hardware
appliance

Microsoft manages shipping logistics No Yes

Integrates with partner products No Yes

Custom appliance No Yes

Command line tools


Hadoop/HDInsight:

C A PA B IL IT Y DISTC P SQ O O P H A DO O P C L I

Optimized for big data Yes Yes Yes

Copy to relational database No Yes No

Copy from relational No Yes No


database

Copy to Blob storage Yes Yes Yes

Copy from Blob storage Yes Yes No

Copy to Data Lake Store Yes Yes Yes

Copy from Data Lake Store Yes Yes No

Other :

C A PA B IL IT Y A Z URE C L I AZC OPY P O W ERSH EL L A DL C O P Y P O LY B A SE

Compatible Linux, OS X, Linux, Windows Windows Linux, OS X, SQL Server, Azure


platforms Windows Windows Synapse

Optimized for big No Yes No Yes 1 Yes 2


data

Copy to No No No No Yes
relational
database

Copy from No No No No Yes


relational
database
C A PA B IL IT Y A Z URE C L I AZC OPY P O W ERSH EL L A DL C O P Y P O LY B A SE

Copy to Blob Yes Yes Yes No Yes


storage

Copy from Blob Yes Yes Yes Yes Yes


storage

Copy to Data No Yes Yes Yes Yes


Lake Store

Copy from Data No No Yes Yes Yes


Lake Store

[1] AdlCopy is optimized for transferring big data when used with a Data Lake Analytics account.
[2] PolyBase performance can be increased by pushing computation to Hadoop and using PolyBase scale-out
groups to enable parallel data transfer between SQL Server instances and Hadoop nodes.
Graphical interface and Azure Data Factory
C A PA B IL IT Y A Z URE STO RA GE EXP LO RER A Z URE P O RTA L * A Z URE DATA FA C TO RY

Optimized for big data No No Yes

Copy to relational database No No Yes

Copy from relational No No Yes


database

Copy to Blob storage Yes No Yes

Copy from Blob storage Yes No Yes

Copy to Data Lake Store No No Yes

Copy from Data Lake Store No No Yes

Upload to Blob storage Yes Yes Yes

Upload to Data Lake Store Yes Yes Yes

Orchestrate data transfers No No Yes

Custom data No No Yes


transformations

Pricing model Free Free Pay per usage

* Azure portal in this case means using the web-based exploration tools for Blob storage and Data Lake Store.
Extending on-premises data solutions to the cloud
11/2/2020 • 7 minutes to read • Edit Online

When organizations move workloads and data to the cloud, their on-premises datacenters often continue to play
an important role. The term hybrid cloud refers to a combination of public cloud and on-premises datacenters, to
create an integrated IT environment that spans both. Some organizations use hybrid cloud as a path to migrate
their entire datacenter to the cloud over time. Other organizations use cloud services to extend their existing on-
premises infrastructure.
This article describes some considerations and best practices for managing data in a hybrid cloud solution,

When to use a hybrid solution


Consider using a hybrid solution in the following scenarios:
As a transition strategy during a longer-term migration to a fully cloud-native solution.
When regulations or policies do not permit moving specific data or workloads to the cloud.
For disaster recovery and fault tolerance, by replicating data and services between on-premises and cloud
environments.
To reduce latency between your on-premises datacenter and remote locations, by hosting part of your
architecture in Azure.

Challenges
Creating a consistent environment in terms of security, management, and development, and avoiding
duplication of work.
Creating a reliable, low latency and secure data connection between your on-premises and cloud
environments.
Replicating your data and modifying applications and tools to use the correct data stores within each
environment.
Securing and encrypting data that is hosted in the cloud but accessed from on-premises, or vice versa.

On-premises data stores


On-premises data stores include databases and files. There may be several reasons to keep these data stores local.
There may be regulations or policies that do not permit moving specific data or workloads to the cloud. Data
sovereignty, privacy, or security concerns may favor on-premises placement. During a migration, you may want to
keep some data local to an application that hasn't been migrated yet.
Considerations in placing application data in a public cloud include:
Cost . The cost of storage in Azure can be significantly lower than the cost of maintaining storage with
similar characteristics in an on-premises datacenter. Of course, many companies have existing investments
in high-end SANs, so these cost advantages may not reach full fruition until existing hardware ages out.
Elastic scale . Planning and managing data capacity growth in an on-premises environment can be
challenging, particularly when data growth is difficult to predict. These applications can take advantage of
the capacity-on-demand and virtually unlimited storage available in the cloud. This consideration is less
relevant for applications that consist of relatively static-sized datasets.
Disaster recover y . Data stored in Azure can be automatically replicated within an Azure region and across
geographic regions. In hybrid environments, these same technologies can be used to replicate between on-
premises and cloud-based data stores.

Extending data stores to the cloud


There are several options for extending on-premises data stores to the cloud. One option is to have on-premises
and cloud replicas. This can help achieve a high level of fault tolerance, but may require making changes to
applications to connect to the appropriate data store in the event of a failover.
Another option is to move a portion of the data to cloud storage, while keeping the more current or more highly
accessed data on-premises. This method can provide a more cost-effective option for long-term storage, as well as
improve data access response times by reducing your operational data set.
A third option is to keep all data on-premises, but use cloud computing to host applications. To do this, you would
host your application in the cloud and connect it to your on-premises data store over a secure connection.

Azure Stack
For a complete hybrid cloud solution, consider using Microsoft Azure Stack. Azure Stack is a hybrid cloud platform
that lets you provide Azure services from your datacenter. This helps maintain consistency between on-premises
and Azure, by using identical tools and requiring no code changes.
The following are some use cases for Azure and Azure Stack:
Edge and disconnected solutions . Address latency and connectivity requirements by processing data
locally in Azure Stack and then aggregating in Azure for further analytics, with common application logic
across both.
Cloud applications that meet varied regulations . Develop and deploy applications in Azure, with the
flexibility to deploy the same applications on-premises on Azure Stack to meet regulatory or policy
requirements.
Cloud application model on-premises . Use Azure to update and extend existing applications or build
new ones. Use consistent DevOps processes across Azure in the cloud and Azure Stack on-premises.

SQL Server data stores


If you are running SQL Server on-premises, you can use Microsoft Azure Blob Storage service for backup and
restore. For more information, see SQL Server Backup and Restore with Microsoft Azure Blob Storage Service. This
capability gives you limitless offsite storage, and the ability to share the same backups between SQL Server
running on-premises and SQL Server running in a virtual machine in Azure.
Azure SQL Database is a managed relational database-as-a service. Because Azure SQL Database uses the
Microsoft SQL Server Engine, applications can access data in the same way with both technologies. Azure SQL
Database can also be combined with SQL Server in useful ways. For example, the SQL Server Stretch Database
feature lets an application access what looks like a single table in a SQL Server database while some or all rows of
that table might be stored in Azure SQL Database. This technology automatically moves data that's not accessed for
a defined period of time to the cloud. Applications reading this data are unaware that any data has been moved to
the cloud.
Maintaining data stores on-premises and in the cloud can be challenging when you desire to keep the data
synchronized. You can address this with SQL Data Sync, a service built on Azure SQL Database that lets you
synchronize the data you select, bi-directionally across multiple Azure SQL databases and SQL Server instances.
While Data Sync makes it easy to keep your data up-to-date across these various data stores, it should not be used
for disaster recovery or for migrating from on-premises SQL Server to Azure SQL Database.
For disaster recovery and business continuity, you can use AlwaysOn Availability Groups to replicate data across
two or more instances of SQL Server, some of which can be running on Azure virtual machines in another
geographic region.

Network shares and file-based data stores


In a hybrid cloud architecture, it is common for an organization to keep newer files on-premises while archiving
older files to the cloud. This is sometimes called file tiering, where there is seamless access to both sets of files, on-
premises and cloud-hosted. This approach helps to minimize network bandwidth usage and access times for newer
files, which are likely to be accessed the most often. At the same time, you get the benefits of cloud-based storage
for archived data.
Organizations may also wish to move their network shares entirely to the cloud. This would be desirable, for
example, if the applications that access them are also located in the cloud. This procedure can be done using data
orchestration tools.
Azure StorSimple offers the most complete integrated storage solution for managing storage tasks between your
on-premises devices and Azure cloud storage. StorSimple is an efficient, cost-effective, and easily manageable
storage area network (SAN) solution that eliminates many of the issues and expenses associated with enterprise
storage and data protection. It uses the proprietary StorSimple 8000 series device, integrates with cloud services,
and provides a set of integrated management tools.
Another way to use on-premises network shares alongside cloud-based file storage is with Azure Files. Azure Files
offers fully managed file shares that you can access with the standard Server Message Block (SMB) protocol
(sometimes referred to as CIFS). You can mount Azure Files as a file share on your local computer, or use them with
existing applications that access local or network share files.
To synchronize file shares in Azure Files with your on-premises Windows Servers, use Azure File Sync. One major
benefit of Azure File Sync is the ability to tier files between your on-premises file server and Azure Files. This lets
you keep only the newest and most recently accessed files locally.
For more information, see Deciding when to use Azure Blob storage, Azure Files, or Azure Disks.

Hybrid networking
This article focused on hybrid data solutions, but another consideration is how to extend your on-premises
network to Azure. For more information about this aspect of hybrid solutions, see:
Choose a solution for connecting an on-premises network to Azure
Hybrid network reference architectures
Securing data solutions
11/2/2020 • 5 minutes to read • Edit Online

For many, making data accessible in the cloud, particularly when transitioning from working exclusively in on-
premises data stores, can cause some concern around increased accessibility to that data and new ways in which to
secure it.

Challenges
Centralizing the monitoring and analysis of security events stored in numerous logs.
Implementing encryption and authorization management across your applications and services.
Ensuring that centralized identity management works across all of your solution components, whether on-
premises or in the cloud.

Data Protection
The first step to protecting information is identifying what to protect. Develop clear, simple, and well-communicated
guidelines to identify, protect, and monitor the most important data assets anywhere they reside. Establish the
strongest protection for assets that have a disproportionate impact on the organization's mission or profitability.
These are known as high value assets, or HVAs. Perform stringent analysis of HVA lifecycle and security
dependencies, and establish appropriate security controls and conditions. Similarly, identify and classify sensitive
assets, and define the technologies and processes to automatically apply security controls.
Once the data you need to protect has been identified, consider how you will protect the data at rest and data in
transit.
Data at rest : Data that exists statically on physical media, whether magnetic or optical disk, on premises or in
the cloud.
Data in transit : Data while it is being transferred between components, locations or programs, such as over
the network, across a service bus (from on-premises to cloud and vice-versa), or during an input/output
process.
To learn more about protecting your data at rest or in transit, see Azure Data Security and Encryption Best Practices.

Access Control
Central to protecting your data in the cloud is a combination of identity management and access control. Given the
variety and type of cloud services, as well as the rising popularity of hybrid cloud, there are several key practices
you should follow when it comes to identity and access control:
Centralize your identity management.
Enable Single Sign-On (SSO).
Deploy password management.
Enforce multi-factor authentication for users.
Use role based access control (RBAC).
Conditional Access Policies should be configured, which enhances the classic concept of user identity with
additional properties related to user location, device type, patch level, and so on.
Control locations where resources are created using resource manager.
Actively monitor for suspicious activities
For more information, see Azure Identity Management and access control security best practices.

Auditing
Beyond the identity and access monitoring previously mentioned, the services and applications that you use in the
cloud should be generating security-related events that you can monitor. The primary challenge to monitoring
these events is handling the quantities of logs , in order to avoid potential problems or troubleshoot past ones.
Cloud-based applications tend to contain many moving parts, most of which generate some level of logging and
telemetry. Use centralized monitoring and analysis to help you manage and make sense of the large amount of
information.
For more information, see Azure Logging and Auditing.

Securing data solutions in Azure


Encryption
Vir tual machines . Use Azure Disk Encryption to encrypt the attached disks on Windows or Linux VMs. This
solution integrates with Azure Key Vault to control and manage the disk-encryption keys and secrets.
Azure Storage . Use Azure Storage Service Encryption to automatically encrypt data at rest in Azure Storage.
Encryption, decryption, and key management are totally transparent to users. Data can also be secured in transit by
using client-side encryption with Azure Key Vault. For more information, see Client-Side Encryption and Azure Key
Vault for Microsoft Azure Storage.
SQL Database and Azure Synapse Analytics . Use Transparent Data Encryption (TDE) to perform real-time
encryption and decryption of your databases, associated backups, and transaction log files without requiring any
changes to your applications. SQL Database can also use Always Encrypted to help protect sensitive data at rest on
the server, during movement between client and server, and while the data is in use. You can use Azure Key Vault to
store your Always Encrypted encryption keys.
Rights management
Azure Rights Management is a cloud-based service that uses encryption, identity, and authorization policies to
secure files and email. It works across multiple devices — phones, tablets, and PCs. Information can be protected
both within your organization and outside your organization because that protection remains with the data, even
when it leaves your organization's boundaries.
Access control
Use role-based access control (RBAC) to restrict access to Azure resources based on user roles. If you are using
Active Directory on-premises, you can synchronize with Azure AD to provide users with a cloud identity based on
their on-premises identity.
Use Conditional access in Azure Active Directory to enforce controls on the access to applications in your
environment based on specific conditions. For example, your policy statement could take the form of: When
contractors are trying to access our cloud apps from networks that are not trusted, then block access.
Azure AD Privileged Identity Management can help you manage, control, and monitor your users and what sorts of
tasks they are performing with their admin privileges. This is an important step to limiting who in your
organization can carry out privileged operations in Azure AD, Azure, Microsoft 365, or SaaS apps, as well as
monitor their activities.
Network
To protect data in transit, always use SSL/TLS when exchanging data across different locations. Sometimes you
need to isolate your entire communication channel between your on-premises and cloud infrastructure by using
either a virtual private network (VPN) or ExpressRoute. For more information, see Extending on-premises data
solutions to the cloud.
Use network security groups to reduce the number of potential attack vectors. A network security group contains a
list of security rules that allow or deny inbound or outbound network traffic based on source or destination IP
address, port, and protocol.
Use Virtual Network service endpoints to secure Azure SQL or Azure Storage resources, so that only traffic from
your virtual network can access these resources.
VMs within an Azure Virtual Network (VNet) can securely communicate with other VNets using virtual network
peering. Network traffic between peered virtual networks is private. Traffic between the virtual networks is kept on
the Microsoft backbone network.
For more information, see Azure network security
Monitoring
Azure Security Center automatically collects, analyzes, and integrates log data from your Azure resources, the
network, and connected partner solutions, such as firewall solutions, to detect real threats and reduce false
positives.
Log Analytics provides centralized access to your logs and helps you analyze that data and create custom alerts.
Azure SQL Database Threat Detection detects anomalous activities indicating unusual and potentially harmful
attempts to access or exploit databases. Security officers or other designated administrators can receive an
immediate notification about suspicious database activities as they occur. Each notification provides details of the
suspicious activity and recommends how to further investigate and mitigate the threat.
Run Apache Cassandra on Azure VMs
11/2/2020 • 8 minutes to read • Edit Online

This article describes performance considerations for running Apache Cassandra on Azure virtual machines.
These recommendations are based on the results of performance tests, which you can find on GitHub. You should
use these recommendations as a baseline and then test against your own workload.

Azure VM sizes and disk types


Cassandra workloads on Azure commonly use either Standard_DS14_v2 or Standard_DS13_v2 virtual machines.
Cassandra workloads benefit from having more memory in the VM, so consider memory optimized virtual
machine sizes, such as Standard_DS14_v2, or local-storage optimized sizes such as Standard_L16s_v2.
For durability, data and commit logs are commonly stored on a stripe set of two to four 1-TB premium managed
disks (P30).
Cassandra nodes should not be too data-dense. We recommend having at most 1 – 2 TB of data per VM and
enough free space for compaction. To achieve the highest possible combined throughput and IOPS using premium
managed disks, we recommend creating a stripe set from a few 1-TB disks, instead of using a single 2-TB or 4-TB
disk. For example, on a DS14_v2 VM, four 1-TB disks have a maximum IOPS of 4 × 5000 = 20 K, versus 7.5 K for a
single 4-TB disk.
As Azure Ultra Disks become more widely available across regions, evaluate them for Cassandra workloads that
need smaller disk capacity. They can provide higher IOPS/throughput and lower latency on VM sizes like
Standard_D32s_v3 and Standard_D16s_v3.
For Cassandra workloads that don't need durable storage — that is, where data can be easily reconstructed from
another storage medium — consider using Standard_L32s_v2 or Standard_L16s_v2 VMs. These VMs sizes have
large and fast local temporary NVMe disks.
For more information, see Comparing performance of Azure local/ephemeral vs attached/persistent disks (GitHub).

Accelerated Networking
Cassandra nodes make heavy use of the network to send and receive data from the client VM and to communicate
between nodes for replication. For optimal performance, Cassandra VMs benefit from high-throughput and low-
latency network.
We recommended enabling Accelerated Networking on the NIC of the Cassandra node and on VMs running client
applications accessing Cassandra.
Accelerated networking requires a modern Linux distribution with the latest drivers, such as Cent OS 7.5+ or
Ubuntu 16.x/18.x. For more information, see Create a Linux virtual machine with Accelerated Networking.

Azure VM data disk caching


Cassandra read workloads perform best when random-access disk latency is low. We recommend using Azure
managed disks with ReadOnly caching enabled. ReadOnly caching provides lower average latency, because the
data is read from the cache on the host instead of going to the backend storage.
Read-heavy, random-read workloads like Cassandra benefit from the lower read latency even though cached mode
has lower throughput limits than uncached mode. (For example, DS14_v2 virtual machines have a maximum
cached throughput of 512 MB/s versus uncached of 768 MB/s.)
ReadOnly caching is particularly helpful for Cassandra time-series and other workloads where the working dataset
fits in the host cache and data is not constantly overwritten. For example, DS14_v2 provides a cache size of 512 GB,
which could store up to 50% of the data from a Cassandra node with 1-2 TB data density.
There is no significant performance penalty from cache-misses when ReadOnly caching is enabled, so cached mode
is recommended for all but the most write-heavy workloads.
For more information, see Comparing Azure VM data disk caching configurations (GitHub).

Linux read-ahead
In most Linux distributions in the Azure Marketplace, the default block device read-ahead setting is 4096 KB.
Cassandra's read IOs are usually random and relatively small. Therefore, having a large read-ahead wastes
throughput by reading parts of files that aren't needed.
To minimize unnecessary lookahead, set the Linux block device read-ahead setting to 8 KB. (See Recommended
production settings in the DataStax documentation.)
Configure 8 KB read-ahead for all block devices in the stripe set and on the array device itself (for example,
/dev/md0 ).

For more information, see Comparing impact of disk read-ahead settings (GitHub).

Disk array mdadm chunk size


When running Cassandra on Azure, it's common to create an mdadm stripe set (that is, RAID 0) of multiple data
disks to increase the overall disk throughput and IOPS closer to the VM limits. Optimal disk stripe size is an
application-specific setting. For example, for SQL Server OLTP workloads, the recommendation is 64 KB. For data
warehousing workloads, the recommendation is 256 KB.
Our tests found no significant difference between chunk sizes of 64k, 128k, and 256k for Cassandra read
workloads. There seems to be a small, slightly noticeable, advantage to the 128k chunk size. Therefore, we
recommend the following:
If you're already using a chunk size of 64 K or 256 K, it doesn't make sense rebuild the disk array to use 128
K size.
For a new configuration, it makes senses to use 128 K from the beginning.
For more information, see Measuring impact of mdadm chunk sizes on Cassandra performance (GitHub).

Commit log filesystem


Cassandra writes perform best when commit logs are on disks with high throughput and low latency. In the default
configuration, Cassandra 3.x flushes data from memory to the commit log file every ~10 seconds and doesn't
touch the disk for every write. In this configuration, write performance is almost identical whether the commit log
is on premium attached disks versus local/ephemeral disks.
Commit logs must be durable, so that a restarted node can reconstruct any data not yet in data files from the
flushed commit logs. For better durability, store commit logs on premium managed disks and not on local storage,
which can be lost if the VM is migrated to another host.
Based on our tests, Cassandra on CentOS 7.x may have lower write performance when commit logs are on the xfs
versus ext4 filesystem. Turning on commit log compression brings xfs performance in line with ext4. Compressed
xfs performs as well as compressed and non-compressed ext4 in our tests.
For more information, see Observations on ext4 and xfs file systems and compressed commit logs (GitHub).

Measuring baseline VM performance


After deploying the VMs for the Cassandra ring, run a few synthetic tests to establish baseline network and disk
performance. Use these tests to confirm that performance is in line with expectations, based on the VM size.
Later, when you run the actual workload, knowing the performance baseline makes it easier to investigate potential
bottlenecks. For example, knowing the baseline performance for network egress on the VM can help to rule out
network as a bottleneck.
For more information about running performance tests, see Validating baseline Azure VM Performance (GitHub).
Document size
Cassandra read and write performance depends on the document size. You can expect to see higher latency and
lower operations/second when reading or writing with larger documents.
For more information, see Comparing relative performance of various Cassandra document sizes (GitHub).
Replication factor
Most Cassandra workloads use a replication factor (RF) of 3 when using attached premium disks and even 5 when
using temporary/ephemeral local disks. The number of nodes in the Cassandra ring should be a multiple of the
replication factor. For example, RF 3 implies a ring of 3, 6, 9, or 12 nodes, while RF 5 would have 5, 10, 15, or 20
nodes. When using RF greater than 1 and a consistency level of LOCAL_QUORUM, it's normal for read and write
performance to be lower than the same workload running with RF 1.
For more information, see Comparing relative performance of various replication factors (GitHub).
Linux page caching
When reading data files, Cassandra's Java code uses regular file I/O and benefits from Linux page caching. After
parts of the file are read one time, the read content is stored in the OS page cache. Subsequent read access to the
same data is much faster.
For this reason, when executing read performance tests against the same data, the second and subsequent reads
will appear to be much faster than the original read, which needed to access data on the remote data disk or from
the host cache when ReadOnly is enabled. To get similar performance measurements on subsequent runs, clear the
Linux page cache and restart the Cassandra service to clear its internal memory. When ReadOnly caching is
enabled, the data might be in the host cache, and subsequent reads will be faster even after clearing the OS page
cache and restarting the Cassandra service.
For more information, see Observations on Cassandra usage of Linux page caching (GitHub).

Multi-datacenter replication
Cassandra natively supports the concept of multiple data centers, making it easy to configure one Cassandra ring
across multiple Azure regions or across availability zones within one region.
For a multiregion deployment, use Azure Global VNet-peering to connect the virtual networks in the different
regions. When VMs are deployed in the same region but in separate availability zones, the VMs can be in the same
virtual network.
It's important to measure the baseline roundtrip latency between regions. Network latency between regions can be
10-100 times higher than latency within a region. Expect a lag between data appearing in the second region when
using LOCAL_QUORUM write consistency, or significantly decreased performance of writes when using
EACH_QUORUM.
When running Apache Cassandra at scale, and specifically in a multi-DC environment, node repair becomes
challenging. Tools such as Reaper can help to coordinate repairs at scale (for example, across all the nodes in a data
center, one data center at a time, to limit the load on the whole cluster). However, node repair for large clusters is
not yet a fully solved problem and applies in all environments, whether on-premises or in the cloud.
When nodes are added to a secondary region, performance will not scale linearly, because some bandwidth and
CPU/disk resources are spent on receiving and sending replication traffic across regions.
For more information, see Measuring impact of multi-dc cross-region replication (GitHub).
Hinted-handoff configuration
In a multiregion Cassandra ring, write workloads with consistency level of LOCAL_QUORUM may lose data in the
secondary region. By default, Cassandra hinted handoff is throttled to a relatively low maximum throughput and
three-hour hint lifetime. For workloads with heavy writes, we recommended increasing the hinted handoff throttle
and hint window time to ensure hints are not dropped before they are replicated.
For more information, see Observations on hinted handoff in cross-region replication (GitHub).

Next steps
For more information about these performance results, see Cassandra on Azure VMs Performance Experiments.
For information on general Cassandra settings, not specific to Azure, see:
DataStax Recommended Production Settings
Apache Cassandra Hardware Choices
Apache Cassandra Configuration File
The following reference architecture deploys Cassandra as part of an n-tier configuration:
Linux N-tier application in Azure with Apache Cassandra
Azure data platform end-to-end
11/2/2020 • 8 minutes to read • Edit Online

This example scenario demonstrates how to use the extensive family of Azure Data Services to build a modern data
platform capable of handling the most common data challenges in an organization.
The solution described in this article combines a range of Azure services that will ingest, process, store, serve, and
visualize data from different sources, both structured and unstructured.
This solution architecture demonstrates how a single, unified data platform can be used to meet the most common
requirements for:
Traditional relational data pipelines
Big data transformations
Unstructured data ingestion and enrichment with AI-based functions
Stream ingestion and processing following the Lambda architecture
Serving insights for data-driven applications and rich data visualization

Relevant use cases


This approach can also be used to:
Establish an enterprise-wide data hub consisting of a data warehouse for structured data and a data lake for
semi-structured and unstructured data. This data hub becomes the single source of truth for your data.
Integrate relational data sources with other unstructured datasets with the use of big data processing
technologies;
Use semantic modeling and powerful visualization tools for simpler data analysis.

Architecture
NOTE
The services covered by this architecture are only a subset of a much larger family of Azure services. Similar outcomes can
be achieved by using other services or features not covered by this design.
Specific business requirements for your analytics use case may also ask for the use of different services or features not
considered in this design.

The data flows through the solution as follows (from bottom-up):


Relational databases
1. Use Azure Data Factory pipelines to pull data from a wide variety of databases, both on-premises and in the
cloud. Pipelines can be triggered based on a pre-defined schedule, in response to an event or be explicitly
called via REST APIs.
2. Still part of the Azure Data Factory pipeline, use Azure Data Lake Store Gen 2 to stage the data copied from
the relational databases. You can save the data in delimited text format or compressed as Parquet files.
3. Use Azure Synapse PolyBase capabilities for fast ingestion into your data warehouse tables.
4. Load relevant data from the Azure Synapse data warehouse into Power BI datasets for data visualization.
Power BI models implement a semantic model to simplify the analysis of business data and relationships.
5. Business analysts use Power BI reports and dashboards to analyze data and derive business insights.
Semi-structured data sources
1. Use Azure Data Factory pipelines to pull data from a wide variety of semi-structured data sources, both on-
premises and in the cloud. For example, you can ingest data from file-based locations containing CSV or
JSON files. You can connect to No-SQL databases such as Cosmos DB or Mongo DB. Or you call REST APIs
provided by SaaS applications that will function as your data source for the pipeline.
2. Still part of the Azure Data Factory pipeline, use Azure Data Lake Store Gen 2 to save the original data copied
from the semi-structured data source.
3. Azure Data Factory Mapping Data Flows or Azure Databricks notebooks can now be used to process the
semi-structured data and apply the necessary transformations before data can be used for reporting. You
can save the resulting dataset as Parquet files in the data lake.
4. Use Azure Synapse PolyBase capabilities for fast ingestion into your data warehouse tables.
5. Load relevant data from the Azure Synapse data warehouse into Power BI datasets for data visualization.
Power BI models implement a semantic model to simplify the analysis of business data and relationships.
6. Business analysts use Power BI reports and dashboards to analyze data and derive business insights.
Non-structured data sources
1. Use Azure Data Factory pipelines to pull data from a wide variety of unstructured data sources, both on-
premises and in the cloud. For example, you can ingest video, image or free text log data from file-based
locations. You can also call REST APIs provided by SaaS applications that will function as your data source for
the pipeline.
2. Still part of the Azure Data Factory pipeline, use Azure Data Lake Store Gen 2 to save the original data copied
from the unstructured data source.
3. You can invoke Azure Databricks notebooks from your pipeline to process the unstructured data. The
notebook can make use of Cognitive Services APIs or invoke custom Azure Machine Learning Service
models to generate insights from the unstructured data. You can save the resulting dataset as Parquet files in
the data lake.
4. Use Azure Synapse PolyBase capabilities for fast ingestion into your data warehouse tables.
5. Load relevant data from the Azure Synapse data warehouse into Power BI datasets for data visualization.
Power BI models implement a semantic model to simplify the analysis of business data and relationships.
6. Business analysts use Power BI reports and dashboards to analyze data and derive business insights.
Streaming
1. Use Azure Event Hubs to ingest data streams generated by a client application. The Event Hub will then
ingest and store streaming data preserving the sequence of events received. Consumers can then connect to
Event Hub and retrieve the messages for processing.
2. Configure the Event Hub Capture to save a copy of the events in your data lake. This feature implements the
"Cold Path" of the Lambda architecture pattern and allows you to perform historical and trend analysis on
the stream data saved in your data lake using tools such as Azure Databricks notebooks.
3. Use a Stream Analytics job to implement the "Hot Path" of the Lambda architecture pattern and derive
insights from the stream data in transit. Define at least one input for the data stream coming from your
Event Hub, one query to process the input data stream and one Power BI output to where the query results
will be sent to.
4. Business analysts then use Power BI real-time datasets and dashboard capabilities for to visualize the fast
changing insights generated by your Stream Analytics query.

Architecture components
The following Azure services have been used in the architecture:
Azure Data Factory
Azure Data Lake Gen2
Azure Synapse Analytics
Azure Databricks
Azure Cosmos DB
Azure Cognitive Services
Azure Event Hubs
Azure Stream Analytics
Microsoft Power BI
If you need further training resources or access to technical documentation, the table below links to Microsoft
Learn and to each service's Technical Documentation.

A Z URE SERVIC E M IC RO SO F T L EA RN T EC H N IC A L DO C UM EN TAT IO N

Azure Data Factory Data ingestion with Azure Data Factory Azure Data Factory Technical
Documentation

Azure Synapse Analytics Implement a Data Warehouse with Azure Synapse Analytics Technical
Azure Synapse Analytics Documentation

Azure Data Lake Storage Gen2 Large Scale Data Processing with Azure Azure Data Lake Storage Gen2 Technical
Data Lake Storage Gen2 Documentation

Azure Cognitive Services Cognitive Services Learning Paths and Azure Cognitive Services Technical
Modules Documentation

Azure Cosmos DB Work with NoSQL data in Azure Azure Cosmos DB Technical
Cosmos DB Documentation

Azure Databricks Perform data engineering with Azure Azure Databricks Technical
Databricks Documentation

Azure Event Hubs Enable reliable messaging for Big Data Azure Event Hubs Technical
applications using Azure Event Hubs Documentation

Azure Stream Analytics Implement a Data Streaming Solution Azure Stream Analytics Technical
with Azure Streaming Analytics Documentation

Power BI Create and use analytics reports with Power BI Technical Documentation
Power BI

Alternatives
For situations where device management, authentication, and provisioning are required, Azure IOT Hub may
be a preferred solution over Event Hubs. Event Hubs should still be considered for other streaming data
sources.
In the architecture above, Azure Data Factory is the service responsible for data pipeline orchestration. Azure
Databricks can also be used to perform the same role through the execution of nested notebooks.
In the architecture above, Azure Stream Analytics is the service responsible for processing streaming data.
Azure Databricks can also be used to perform the same role through the execution of notebooks.
In the architecture above, Azure Databricks was used to invoke Cognitive Services. You can also make use of
Azure Functions to invoke Azure Cognitive Services from an Azure Data Factory Pipeline.
For comparisons of other alternatives, see:
Choosing a data pipeline orchestration technology in Azure
Choosing a batch processing technology in Azure
Choosing an analytical data store in Azure
Choosing a data analytics technology in Azure

Considerations
The technologies in this architecture were chosen because each of them provide the necessary functionality to
handle the vast majority of data challenges in an organization. These services meet the requirements for scalability
and availability, while helping them control costs.
The massively parallel processing architecture of Azure Synapse provides scalability and high performance.
Azure Synapse has guaranteed SLAs and recommended practices for achieving high availability.
When analysis activity is low, the company can scale Azure Synapse on demand, reducing or even pausing
compute to lower costs. analysis-services-bcdr).
The Azure Synapse security model provides connection security, authentication and authorization via Azure AD
or SQL Server authentication, and encryption.

Pricing
The ideal individual pricing tier and the total overall cost of each service included in the architecture is dependent
on the amount of data to be processed and stored and the acceptable performance level expected. Use the guide
below to learn more about how each service is priced:
Azure Synapse allows you to scale your compute and storage levels independently. Compute resources are
charged per hour, and you can scale or pause these resources on demand. Storage resources are billed per
terabyte, so your costs will increase as you ingest more data.
Data Factory costs are based on the number of read/write operations, monitoring operations, and orchestration
activities performed in a workload. Your Data Factory costs will increase with each additional data stream and
the amount of data processed by each one.
Power BI has different product options for different requirements. Power BI Embedded provides an Azure-based
option for embedding Power BI functionality inside your applications. A Power BI Embedded instance is included
in the pricing sample above.

Next steps
Find comprehensive architectural guidance on data pipelines, data warehousing, online analytical processing
(OLAP), and big data in the Azure Data Architecture Guide.
Big data analytics with Azure Data Explorer
11/2/2020 • 2 minutes to read • Edit Online

This architecture pattern demonstrates big data analytics over large volumes of high velocity data from various
sources. This pattern illustrates how Azure Data Explorer and Azure Synapse Analytics complement each other for
near real-time analytics and modern data warehousing use cases.
This pattern is already being used by Microsoft customers. For example, the Singapore-based ride-hailing company,
Grab, implemented real-time analytics over a huge amount of data collected from their taxi and food delivery
services as well as merchant partner apps. The team from Grab presented their solution at MS Ignite in this video
(20:30 onwards). Using this pattern, Grab processed more than a trillion events per day.

Data flow
1. Raw structured, semi-structured, and unstructured (free text) data such as any type of logs, business events, and
user activities can be ingested into Azure Data Explorer from various sources.
2. Ingest data into Azure Data Explorer with low-latency and high throughput using its connectors for Azure Data
Factory, Azure Event Hub, Azure IoT Hub, Kafka, and so on. Alternatively, ingest data through Azure Storage (Blob
or ADLS Gen2), which uses Azure Event Grid and triggers the ingestion pipeline to Azure Data Explorer. You can
also continuously export data to Azure Storage in compressed, partitioned parquet format and seamlessly query
that data as detailed in the Continuous data export overview.
3. Export pre-aggregated data from Azure Data Explorer to Azure Storage, and then ingest the data into Synapse
Analytics to build data models and reports.
4. Use Azure Data Explorer’s native capabilities to process, aggregate, and analyze data. To get insights at a
lightning speed, build near real-time analytics dashboards using Azure Data Explorer dashboards, Power BI,
Grafana, or other tools. Use Azure Synapse Analytics to build a modern data warehouse and combine it with the
Azure Data Explorer data to generate BI reports on curated and aggregated data models.
5. Azure Data Explorer provides native advanced analytics capabilities for time series analysis, pattern recognition,
anomaly detection and forecasting, and machine learning. Azure Data Explorer is also well integrated with ML
services such as Databricks and Azure Machine Learning. This integration allows you to build models using
other tools and services and export ML models to Azure Data Explorer for scoring data.

Components
Azure Event Hub: Fully managed, real-time data ingestion service that’s simple, trusted, and scalable.
Azure IoT Hub: Managed service to enable bi-directional communication between IoT devices and Azure.
Kafka on HDInsight: Easy, cost-effective, enterprise-grade service for open source analytics with Apache Kafka.
Azure Data Explorer: Fast, fully managed and highly scalable data analytics service for real-time analysis on large
volumes of data streaming from applications, websites, IoT devices, and more.
Azure Data Explorer Dashboards: Natively export Kusto queries that were explored in the Web UI to optimized
dashboards.
Azure Synapse Analytics: Analytics service that brings together enterprise data warehousing and Big Data
analytics.

Next steps
For more information, see Azure Data Explorer documentation.
Campaign Optimization with Azure HDInsight Spark
Clusters
11/2/2020 • 4 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

This solution demonstrates how to build and deploy a machine learning model with Microsoft R Server on Azure
HDInsight Spark clusters to recommend actions to maximize the purchase rate of leads targeted by a campaign.
This solution enables efficient handling of big data on Spark with Microsoft R Server.

Architecture

Download an SVG of this architecture.

Description
This solution will create an HDInisght Spark cluster with Microsoft R Server. This cluster will contain two head
nodes, two worker nodes, and one edge node with a total of 32 cores. The approximate cost for this HDInsight
Spark cluster is $8.29/hour. Billing starts once a cluster is created and stops when the cluster is deleted. Billing is
pro-rated per minute, so you should always delete your cluster when it is no longer in use. Use the Deployments
page to delete the entire solution once you are done.

Overview
When a business launches a marketing campaign to interest customers in new or existing product(s), they often use
a set of business rules to select leads for their campaign to target. Machine learning can be used to help increase
the response rate from these leads. This solution demonstrates how to use a model to predict actions that are
expected to maximize the purchase rate of leads targeted by the campaign. These predictions serve as the basis for
recommendations to be used by a renewed campaign on how to contact (for example, e-mail, SMS, or cold call) and
when to contact (day of week and time of day) the targeted leads. The solution presented here uses simulated data
from the insurance industry to model responses of the leads to the campaign. The model predictors include
demographic details of the leads, historical campaign performance, and product-specific details. The model predicts
the probability that each lead in the database makes a purchase from a channel, on each day of the week at various
times of day. Recommendations on which channel, day of week and time of day to use when targeting users are
based then on the channel and timing combination that the model predicts will have the highest probability a
purchase being made.

Business Perspective
This solution employs machine learning leveraging historical campaign data to predict customer responses and
recommend when and how to connect with your leads. The recommendations include the best channel to contact a
lead (in our example, email, SMS, or cold call), the best day of the week and the best time of day in which to make
the contact.
Microsoft R Server on HDInsight Spark clusters provides distributed and scalable machine learning capabilities for
big data, leveraging the combined power of R Server and Apache Spark. This solution demonstrates how to
develop machine learning models for marketing campaign optimization (including data processing, feature
engineering, training and evaluating models), deploy the models as a web service (on the edge node) and consume
the web service remotely with Microsoft R Server on Azure HDInsight Spark clusters. The final predictions and
recommendation table are saved to a Hive table containing recommendations for how and when to contact each
lead. This data is then visualized in Power BI.
Power BI also presents visual summaries of the effectiveness of the campaign recommendations (shown here with
simulated data). You can try out this dashboard by clicking the Try it Now button to the right.
The Recommendations tab of this dashboard shows the predicted recommendations. At the top is a table of
individual leads for our new deployment. This includes fields for the lead ID, campaign, and product, populated with
leads on which our business rules are to be applied. This is followed by the model predictions for the leads, giving
the optimal channel and time to contact each one, along with the estimated probabilities that the leads will buy our
product using these recommendations. These probabilities can be used to increase the efficiency of the campaign
by limiting the number of leads contacted to the subset most likely to buy.
Also on the Recommendations tab are various summaries of recommendations and demographic information on
the leads. The Campaign Summary tab of the dashboard shows summaries of the historical data used to create the
predicted recommendations. While this tab also shows values of Day of Week, Time of Day, and Channel, these
values are actual past observations, not to be confused with the recommendations from the model, shown on the
Recommendations tab.

Data Scientist Perspective


This solution demonstrates the end-to-end process of how to develop and deploy machine learning models for
marketing campaign optimization. It contains sample data, R code for each step of building the model (including
data processing, feature engineering, training and evaluating models along with sample data), deploying the model
as a web service (on the edge node) and consuming the web service remotely with Microsoft R Server on Azure
HDInsight Spark clusters.
Data scientists who are testing this solution can work with the provided R code from the browser-based Open
Source Edition of RStudio Server that runs on the Edge Node of the Azure HDInsight Spark cluster. By setting the
compute context the user can decide where the computation will be performed: locally on the edge node, or
distributed across the nodes in the Spark cluster. All the R code can also be found in public GitHub repository. Have
fun!
Campaign Optimization with SQL Server
11/2/2020 • 5 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

This solution demonstrates how to build and deploy a machine learning model with SQL Server 2016 with R
Services to recommend actions to maximize the purchase rate of leads targeted by a campaign.

Architecture

Download an SVG of this architecture.

Overview
When a business launches a marketing campaign to interest customers in new or existing product(s), they often use
a set of business rules to select leads for their campaign to target. Machine learning can be used to help increase
the response rate from these leads. This solution demonstrates how to use a model to predict actions that are
expected to maximize the purchase rate of leads targeted by the campaign. These predictions serve as the basis for
recommendations to be used by a renewed campaign on how to contact (for example, e-mail, SMS, or cold call) and
when to contact (day of week and time of day) the targeted leads. The solution presented here uses simulated data
from the insurance industry to model responses of the leads to the campaign. The model predictors include
demographic details of the leads, historical campaign performance, and product-specific details. The model predicts
the probability that each lead in the database makes a purchase from a channel, on each day of the week at various
times of day. Recommendations on which channel, day of week and time of day to use when targeting users are
based then on the channel and timing combination that the model predicts will have the highest probability a
purchase being made.
The Microsoft Marketing Campaign Optimization solution is a combination of a Machine learning prediction model
and an interactive visualization tool, PowerBI. The solution is used to increase the response rate to a campaign by
recommending the channel to contact (for example, e-mail, SMS, or cold call) as well as when to contact (day of
week and time of day) targeted leads for use in a new campaign. The solution uses simulated data, which can easily
be configured to use your own organization's data, to model the acquisition campaign response. The model uses
predictors such as demographics, historical campaign performance and product details. The solution predicts the
probability of a lead conversion from each channel, at various times of the day and days of the week, for every lead
in the database. The final recommendation for targeting each lead is decided based upon the combination of
channel, day of week and time of day with the highest probability of conversion. The solution has been modeled
after a standardized data science process, where the data preparation, model training and evaluation can be easily
done by a data scientist and the insights visualized and correlated to KPIs by marketing via Power BI visualization.

Business Manager Perspective


This solution template uses (simulated) historical data to predict how and when to contact leads for your campaign.
The recommendations include the best channel to contact a lead (in our example, email, SMS, or cold call), the best
day of the week and the best time of day in which to make the contact.
SQL Server R Services brings the compute to the data by allowing R to run on the same computer as the database.
It includes a database service that runs outside'the SQL Server process and communicates securely with the R
runtime.
This solution packet shows how to create and refine data, train R models, and perform predictions on the SQL
Server machine. The final predictions table in SQL Server provides recommendations for how and when to contact
each lead. This data is then visualized in Power BI.
Power BI also presents visual summaries of the effectiveness of the campaign recommendations (shown here with
simulated data). You can try out this dashboard by clicking the Try it Now link.
The Recommendations tab of this dashboard shows the predicted recommendations. At the top is a table of
individual leads for our new deployment. This includes fields for the lead ID, campaign and product, populated with
leads on which our business rules are to be applied. This is followed by the model predictions for the leads, giving
the optimal channel and time to contact each one, and then the estimated probabilities that the leads will buy our
product using these recommendations. These probabilities can be used to increase the efficiency of the campaign
by limiting the number of leads contacted to the subset most likely to buy.
Also on the Recommendations tab are various summaries of recommendations and demographic information on
the leads.
The Campaign Summary tab of the dashboard shows summaries of the historical data used to create the predicted
recommendations. While this tab also shows values of Day of Week, Time of Day, and Channel, these values are
actual past observations, not to be confused with the recommendations shown on the Recommendations tab.

Data Scientist Perspective


SQL Server R Services brings the compute to the data by running R on the computer that hosts the database. It
includes a database service that runs outside the SQL Server process and communicates securely with the R
runtime.
This solution walks through the steps to create and refine data, train R models, and perform scoring on the SQL
Server machine. The final scored database table in SQL Server gives the recommendations for how and when to
contact each lead. This data is then visualized in PowerBI, which also contains a summary of the success of the
recommendations used in your new campaign after it has completed. (Simulated data is shown in this template to
illustrate the feature.)
Data scientists who are testing and developing solutions can work from the convenience of their R IDE on their
client machine, while pushing the compute to the SQL Server machine. The completed solutions are deployed to
SQL Server 2016 by embedding calls to R in stored procedures. These solutions can then be further automated
with SQL Server Integration Services and SQL Server agent.
Click on the Deploy button to test the automation and the entire solution will be made available in your Azure
subscription.

Pricing
Your Azure subscription used for the deployment will incur consumption charges on the services used in this
solution, approximately $1.15/hour for the default VM.
Please ensure that you stop your VM instance when not actively using the solution. Running the VM will incur
higher costs.
Please delete the solution if you are not using it.
DataOps for the modern data warehouse
11/2/2020 • 6 minutes to read • Edit Online

A modern data warehouse (MDW) lets you easily bring all of your data together at any scale. It doesn't matter if it's
structured, unstructured, or semi-structured data. You can gain insights to an MDW through analytical dashboards,
operational reports, or advanced analytics for all your users.
Setting up an MDW environment for both development (dev) and production (prod) environments is complex.
Automating the process is key. It helps increase productivity while minimizing the risk of errors.
This article describes how a fictional city planning office could use this solution. The solution provides an end-to-
end data pipeline that follows the MDW architectural pattern, along with corresponding DevOps and DataOps
processes, to assess parking use and make more informed business decisions.

Solution requirements
Ability to collect data from different sources or systems.
Infrastructure as code: deploy new dev and staging (stg) environments in an automated manner.
Deploy application changes across different environments in an automated manner:
Implement Continuous Integration/Continuous Delivery (CI/CD) pipelines.
Use deployment gates for manual approvals.
Pipeline as Code: ensure the CI/CD pipeline definitions are in source control.
Carry out integration tests on changes using a sample data set.
Run pipelines on a scheduled basis.
Support future agile development, including the addition of data science workloads.
Support for both row-level and object-level security:
The security feature is available in SQL Database.
You can also find it in Azure Synapse Analytics, Azure Analysis Services (AAS) and Power BI.
Support for 10 concurrent dashboard users and 20 concurrent power users.
The data pipeline should carry out data validation and filter out malformed records to a specified store.
Support monitoring.
Centralized configuration in a secure storage like Azure Key Vault.

Potential use cases


This article uses the fictional city of Contoso to describe the use case scenario. In the narrative, Contoso owns and
manages parking sensors for the city. It also owns the APIs that connect to and get data from the sensors. They
need a platform that will collect data from many different sources. The data then must be validated, cleansed, and
transformed to a known schema. Contoso city planners can then explore and assess report data on parking use
with data visualization tools, like Power BI, to determine whether they need more parking or related resources.
Architecture
The following diagram shows the overall architecture of the solution.

Azure Data Factory (ADF) orchestrates and Azure Data Lake Storage (ADLS) Gen2 stores the data:
1. The Contoso city parking web service API is available to transfer data from the parking spots.
2. There's an ADF copy job that transfers the data into the Landing schema.
3. Next, Azure Databricks cleanses and standardizes the data. It takes the raw data and conditions it so data
scientists can use it.
4. If validation reveals any bad data, it gets dumped into the Malformed schema.

IMPORTANT
People have asked why the data isn't validated before it's stored in ADLS. The reason is that the validation might
introduce a bug that could corrupt the dataset. If you introduce a bug at this step, you can fix the bug and replay
your pipeline. If you dumped the bad data before you added it to ADLS, then the corrupted data is useless because
you can't replay your pipeline.

5. There's a second Azure Databricks transform step that converts the data into a format that you can store in
the data warehouse.
6. Finally, the pipeline serves the data in two different ways:
a. Databricks makes the data available to the data scientist so they can train models.
b. Polybase moves the data from the data lake to Azure Synapse Analytics and Power BI accesses the
data and presents it to the business user.

Components
The solution uses these components:
Azure Data Factory (ADF)
Azure Databricks
Azure Data Lake Storage (ADLS) Gen2
Azure Synapse Analytics
Azure Key Vault
Azure DevOps
Power BI

Deploy the solution


The following list contains the high-level steps required to set up the Parking Sensors solution with corresponding
Build and Release Pipelines. You can find detailed setup steps and prerequisites in this Azure Samples repository.
Setup and deployment
1. Initial setup : Install any prerequisites, import the Azure Samples GitHub repository into your own
repository, and set required environment variables.
2. Deploy Azure resources : The solution comes with an automated deployment script. It deploys all
necessary Azure resources and AAD service principals per environment. The script also deploys Azure
DevOps pipelines, variable groups, and service connections.
3. Set up git integration in dev Data Factor y : Configure git integration to work with the imported GitHub
repository.
4. Carr y out an initial build and release : Create a sample change in Data Factory, like enabling a schedule
trigger, then watch the change automatically deploy across environments.
Deployed resources
If deployment is successful, there should be three resources groups in Azure representing three environments: dev,
stg, and prod. There should also be end-to-end build and release pipelines in Azure DevOps that can automatically
deploy changes across these three environments.
For a detailed list of all resources, see the Deployed Resources section of the DataOps - Parking Sensor Demo
README.
Continuous integration and continuous delivery
The diagram below demonstrates the CI/CD process and sequence for the build and release pipelines.
1. Developers develop in their own sandbox environments within the dev resource group and commit changes
into their own short-lived git branches. For example, <developer_name>/<branch_name> .
2. When changes are complete, developers raise a pull request (PR) to the master branch for review. Doing so
automatically kicks-off the PR validation pipeline, which runs the unit tests, linting, and data-tier application
package (DACPAC) builds.
3. On completion of the PR validation, the commit to master will trigger a build pipeline that publishes all
necessary build artifacts.
4. The completion of a successful build pipeline will trigger the first stage of the release pipeline. Doing so
deploys the publish build artifacts into the dev environment, except for ADF.
Developers manually publish to the dev ADF from the collaboration branch (master). The manual publishing
updates the Azure Resource Manager (ARM) templates in the adf_publish branch.
5. The successful completion of the first stage triggers a manual approval gate.
On Approval, the release pipeline continues with the second stage, deploying changes to the stg
environment.
6. Run integration tests to test changes in the stg environment.
7. Upon successful completion of the second stage, the pipeline triggers a second manual approval gate.
On Approval, the release pipeline continues with the third stage, deploying changes to the prod
environment.
For more information, read the Build and Release Pipeline section of the README.
Testing
The solution includes support for both unit testing and integration testing. It uses pytest-adf and the Nutter Testing
Framework. For more information, see the Testing section of the README.
Observability and monitoring
The solution supports observability and monitoring for Databricks and Data Factory. For more information, see the
Observability/Monitoring section of the README.

Considerations
The following list summarizes key learnings and best practices demonstrated by this sample solution:

NOTE
Each item in the list below links out to the related Key Learnings section in the docs for the parking sensor solution
example on GitHub.

Use Data Tiering in your Data Lake.


Validate data early in your pipeline.
Make your data pipelines replayable and idempotent.
Ensure data transformation code is testable.
Have a CI/CD pipeline.
Secure and centralize configuration.
Monitor infrastructure, pipelines, and data.

Next steps
If you'd like to deploy the solution, follow the steps in the How to use the sample section of the DataOps -
Parking Sensor Demo README.

Related resources
Solution code samples on GitHub
Visit the project page on GitHub
Observability/monitoring
Azure Databricks
Monitoring Azure Databricks with Azure Monitor
Monitoring Azure Databricks Jobs with Application Insights
Data Factory
Monitor Azure Data Factory with Azure Monitor
Create alerts to proactively monitor your data factory pipelines
Synapse Analytics
Monitoring resource utilization and query activity in Azure Synapse Analytics
Monitor your Azure Synapse Analytics SQL pool workload using DMVs
Azure Storage
Monitor Azure Storage
Resiliency and disaster recovery
Azure Databricks
Regional disaster recovery for Azure Databricks clusters
Data Factory
Create and configure a self-hosted integration runtime - High availability and scalability
Synapse Analytics
Geo-backups and Disaster Recovery
Geo-restore for SQL Pool
Azure Storage
Disaster recovery and storage account failover
Best practices for using Azure Data Lake Storage Gen2 – High availability and Disaster Recovery
Azure Storage Redundancy
Videos
For a detailed walk-through of the solution and key concepts, watch the following video recording: DataDevOps for
the Modern Data Warehouse on Microsoft Azure
Data Streaming scenario
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Use AKS to easily ingest & process a real-time data stream with millions of data points collected via sensors.
Perform fast analysis and computations to develop insights into complex scenarios quickly.

Architecture

Download an SVG of this architecture.

Data Flow
1. Sensor data is generated and streamed to Azure API Management.
2. AKS cluster runs microservice that are deployed as containers behind a service mesh. Containers are built using
a DevOps process and stored in Azure Container Registry.
3. Ingest service stores data in a Azure Cosmos DB
4. Asynchronously, the Analysis service receives the data and streams it to Apache Kafka and Azure HDInsight.
5. Data scientists can analyze the large big data for use in machine learning models using Splunk.
6. Data is processed by the processing service which stores the result in Azure Database for PostgreSQL and
caches the data in an Azure Cache for Redis.
7. A web app running in Azure App Service is used to visualize the results.
Data cache
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Azure Cache for Redis perfectly complements Azure database services such as Cosmos DB. It provides a cost-
effective solution to scale read and write throughput of your data tier. Store and share database query results,
session states, static contents, and more using a common cache-aside pattern.

Architecture
Download an SVG of this architecture.
Digital Campaign Management
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Engage with customers around the world with rich, personalized digital marketing experiences. Quickly build and
launch digital campaigns that automatically scale based on customer demand.

Architecture

Download an SVG of this architecture.


Digital Marketing using Azure Database for MySQL
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Engage with customers around the world with rich, personalized digital marketing experiences. Quickly build and
launch digital campaigns that automatically scale based on customer demand.

Architecture

Download an SVG of this architecture.


Finance management apps using Azure Database for
MySQL
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Securely store critical data and provide high-value analytics and insights over aggregated data to users, using in-
built security and performance.

Architecture

Download an SVG of this architecture.


Finance management apps using Azure Database for
PostgreSQL
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Securely store critical data and provide high-value analytics and insights over aggregated data to users, using in-
built security and performance.

Architecture
Gaming using Azure Database for MySQL
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Scale your databases elastically to accommodate unpredictable bursts of traffic and deliver low-latency multi-
player experiences.

Architecture

Download an SVG of this architecture.


Gaming using Cosmos DB
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Elastically scale your database to accommodate unpredictable bursts of traffic and deliver low-latency multi-player
experiences on a global scale.

Architecture

Download an SVG of this architecture.


Globally distributed applications using Cosmos DB
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Guarantee access to users around the world with the high-availability and low-latency capabilities built into
Microsoft's global datacenters.

Architecture

Download an SVG of this architecture.


Hybrid ETL with existing on-premises SSIS and Azure
Data Factory
11/2/2020 • 5 minutes to read • Edit Online

Organizations that migrate their SQL Server databases to the cloud can realize tremendous cost savings,
performance gains, added flexibility, and greater scalability. However, reworking existing extract, transform, and
load (ETL) processes built with SQL Server Integration Services (SSIS) can be a migration roadblock. In other cases,
the data load process requires complex logic and/or specific data tool components that are not yet supported by
Azure Data Factory v2. Commonly used SSIS capabilities include Fuzzy Lookup and Fuzzy Grouping
transformations, Change Data Capture (CDC), Slowly Changing Dimensions (SCD), and Data Quality Services
(DQS).
To facilitate a "lift and shift" migration of an existing SQL database, a hybrid ETL approach may be the most suitable
option. A hybrid approach uses Data Factory as the primary orchestration engine, but continues to leverage
existing SSIS packages to clean data and work with on-premises resources. This approach uses the Data Factory
SQL Server Integrated Runtime (IR) to enable a "lift and shift" migration of existing databases into the cloud, while
using existing code and SSIS packages.
This example scenario is relevant to organizations that are moving databases to the cloud and are considering
using Data Factory as their primary cloud-based ETL engine while incorporating existing SSIS packages into their
new cloud data workflow. Many organizations have significant invested in developing SSIS ETL packages for
specific data tasks. Rewriting these packages can be daunting. Also, many existing code packages have
dependencies on local resources, preventing migration to the cloud.
Data Factory lets customers take advantage of their existing ETL packages while limiting further investment in on-
premises ETL development. This example discusses potential use cases for leveraging existing SSIS packages as
part of a new cloud data workflow using Azure Data Factory v2.

Potential use cases


Traditionally, SSIS has been the ETL tool of choice for many SQL Server data professionals for data transformation
and loading. Sometimes, specific SSIS features or third-party plugging components have been used to accelerate
the development effort. Replacement or redevelopment of these packages may not be an option, which prevents
customers from migrating their databases to the cloud. Customers are looking for low impact approaches to
migrating their existing databases to the cloud and taking advantage of their existing SSIS packages.
Several potential on-premises use cases are listed below:
Loading network router logs to a database for analysis.
Preparing human resources employment data for analytical reporting.
Loading product and sales data into a data warehouse for sales forecasting.
Automating loading of operational data stores or data warehouses for finance and accounting.

Architecture
1. Data is sourced from Blob storage into Data Factory.
2. The Data Factory pipeline invokes a stored procedure to execute an SSIS job hosted on-premises via the
Integrated Runtime.
3. The data cleansing jobs are executed to prepare the data for downstream consumption.
4. Once the data cleansing task completes successfully, a copy task is executed to load the clean data into Azure.
5. The clean data is then loaded into tables in the Azure Synapse Analytics.
Components
Blob storage is used to store files and as a source for Data Factory to retrieve data.
SQL Server Integration Services contains the on-premises ETL packages used to execute task-specific
workloads.
Azure Data Factory is the cloud orchestration engine that takes data from multiple sources and combines,
orchestrates, and loads the data into a data warehouse.
Azure Synapse centralizes data in the cloud for easy access using standard ANSI SQL queries.
Alternatives
Data Factory could invoke data cleansing procedures implemented using other technologies, such as a Databricks
notebook, Python script, or SSIS instance running in a virtual machine. Installing paid or licensed custom
components for the Azure-SSIS integration runtime may be a viable alternative to the hybrid approach.

Considerations
The Integrated Runtime (IR) supports two models: self-hosted IR or Azure-hosted IR. You first must decide between
these two options. Self-hosting is more cost effective but has more overhead for maintenance and management.
For more information, see Self-hosted IR. If you need help determining which IR to use, see Determining which IR
to use.
For the Azure-hosted approach, you should decide how much power is required to process your data. The Azure-
hosted configuration allows you to select the VM size as part of the configuration steps. To learn more about
selecting VM sizes, see VM performance considerations.
The decision is much easier when you already have existing SSIS packages that have on-premises dependencies
such as data sources or files that are not accessible from Azure. In this scenario, your only option is the self-hosted
IR. This approach provides the most flexibility to leverage the cloud as the orchestration engine, without having to
rewrite existing packages.
Ultimately, the intent is to move the processed data into the cloud for further refinement or combining with other
data stored in the cloud. As part of the design process, keep track of the number of activities used in the Data
Factory pipelines. For more information, see Pipelines and activities in Azure Data Factory.
Pricing
Data Factory is a cost-effective way to orchestrate data movement in the cloud. The cost is based on the several
factors.
Number of pipeline executions
Number of entities/activities used within the pipeline
Number of monitoring operations
Number of Integration Runs (Azure-hosted IR or self-hosted IR)
Data Factory uses consumption-based billing. Therefore, cost is only incurred during pipeline executions and
monitoring. The execution of a basic pipeline would cost as little as 50 cents and the monitoring as little as 25
cents. The Azure cost calculator can be used to create a more accurate estimate based on your specific workload.
When running a hybrid ETL workload, you must factor in the cost of the virtual machine used to host your SSIS
packages. This cost is based on the size of the VM ranging from a D1v2 (1 core, 3.5 GB RAM, 50 GB Disk) to E64V3
(64 cores, 432 GB RAM, 1600 GB disk). If you need further guidance on selection the appropriate VM size, see VM
performance considerations.

Next steps
Learn more about Azure Data Factory.
Get started with Azure Data Factory by following the Step-by-step tutorial.
Provision the Azure-SSIS Integration Runtime in Azure Data Factory.
Intelligent apps using Azure Database for MySQL
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Develop sophisticated, transformational apps using state of the art machine learning algorithms and integrated
visualization tools to get actionable insights and analytics.

Architecture
Intelligent apps using Azure Database for PostgreSQL
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Develop sophisticated, transformational apps using state of the art machine learning algorithms and integrated
visualization tools to get actionable insights and analytics.

Architecture

Download an SVG of this architecture.


Interactive querying with HDInsight
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Perform fast, interactive SQL queries at scale over structured or unstructured data with Apache Hive LLAP.

Architecture

Download an SVG of this architecture.


Loan ChargeOff Prediction with Azure HDInsight
Spark Clusters
11/2/2020 • 3 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

A charged off loan is a loan that is declared by a creditor (usually a lending institution) that an amount of debt is
unlikely to be collected, usually when the loan repayment is severely delinquent by the debtor. Given that high
chargeoff has negative impact on lending institutions' year end financials, lending institutions often monitor loan
chargeoff risk very closely to prevent loans from getting charged-off. Using Azure HDInsight R Server, a lending
institution can use machine learning predictive analytics to predict the likelihood of loans getting charged off and
run a report on the analytics result stored in HDFS and hive tables.

Architecture

Download an SVG of this architecture.

Description
This solution will create an HDInsight Spark cluster with Microsoft R Server. This cluster will contain 2 head nodes, 2
worker nodes, and 1 edge node with a total of 32 cores. The approximate cost for this HDInsight Spark cluster is
$8.29/hour. Billing starts once a cluster is created and stops when the cluster is deleted. Billing is pro-rated per
minute, so you should always delete your cluster when it is no longer in use. Use the Deployments page to delete
the entire solution once you are done.

Overview
There are multiple benefits for lending institutions to equip with loan chargeoff prediction data. Charging off a loan
is the last resort that the bank will do on a severely delinquent loan, with the prediction data at hand, the loan
officer could offer personalized incentives like lower interest rate or longer repayment period to help customers to
keep making loan payments and thus prevent the loan of getting charged off. To get to this type of prediction data,
often credit unions or banks manually handcraft the data based on customers' past payment history and
performed simple statistical regression analysis. This method is highly subject to data compilation error and not
statistically sound.
This solution template demonstrates a solution end to end to run predictive analytics on loan data and produce
scoring on chargeoff probability. A PowerBI report will also walk through the analysis and trend of credit loans and
prediction of chargeoff probability.

Business Perspective
This loan chargeoff prediction uses a simulated loan history data to predict probability of loan chargeoff in the
immediate future (next three months). The higher the score, the higher is the probability of the loan getting
charged-off in the future.
With the analytics data, loan manager is also presented with the trends and analytics of the chargeoff loans by
branch locations. Characteristics of the high chargeoff risk loans will help loan managers to make business plan for
loan offering in that specific geographical area.
Microsoft R Server on HDInsight Spark clusters provides distributed and scalable machine learning capabilities for
big data, using the combined power of R Server and Apache Spark. This solution demonstrates how to develop
machine learning models for predicting loan chargeoff (including data processing, feature engineering, training and
evaluating models), deploy the models as a web service (on the edge node) and consume the web service remotely
with Microsoft R Server on Azure HDInsight Spark clusters. The final predictions is saved to a Hive table which
could be visualized in Power BI.
Power BI also presents visual summaries of the loan payments and chargeoff predictions (shown here with
simulated data). You can try out this dashboard by clicking the "Try it Now" button to the right.

Data Scientist Perspective


This solution template walks through the end-to-end process of how to develop predict analytics using a set of
simulated loan history data to predict loan chargeoff risk. The data contains information like loan holder
demographic data, loan amount, contractual loan duration and loan payment history. The solution template also
includes a set of R scripts that perform data processing, feature engineering, and several different algorithms to
train the data, and finally select the best performant model to score the data to produce probability score for each
loan. The solution also include scripts to deploy the model as a web service (on the edge node) and consuming the
web service remotely with Microsoft R Server on Azure HDInsight Spark clusters.
Data scientists who are testing this solution can work with the provided R code from the browser-based Open
Source Edition of RStudio Server that runs on the Edge Node of the Azure HDInsight Spark cluster. By setting the
compute context the user can decide where the computation will be performed: locally on the edge node, or
distributed across the nodes in the Spark cluster. All the R code can also be found in public GitHub repository. Have
fun!
Loan ChargeOff Prediction with SQL Server
11/2/2020 • 3 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

This solution demonstrates how to build and deploy a machine learning model with SQL Server 2016 with R
Services to predict if a Bank loan will need to be charged off within next 3 months.

Architecture

Download an SVG of this architecture.

Overview
There are multiple benefits for lending institutions to equip with loan chargeoff prediction data. Charging off a loan
is the last resort that the bank will do on a severely delinquent loan, with the prediction data at hand, the loan
officer could offer personalized incentives like lower interest rate or longer repayment period to help customers to
keep making loan payments and thus prevent the loan of getting charged off. To get to this type of prediction data,
often credit unions or banks manually handcraft the data based on customers' past payment history and
performed simple statistical regression analysis. This method is highly subject to data compilation error and not
statistically sound.
This solution template demonstrates a solution end to end to run predictive analytics on loan data and produce
scoring on chargeoff probability. A PowerBI report will also walk through the analysis and trend of credit loans and
prediction of chargeoff probability.

Business Manager Perspective


This loan chargeoff prediction uses a simulated loan history data to predict probability of loan chargeoff in the
immediate future (next three months). The higher the score, the higher is the probability of the loan getting
charged-off in the future.
With the analytics data, loan manager is also presented with the trends and analytics of the chargeoff loans by
branch locations. Characteristics of the high chargeoff risk loans will help loan managers to make business plan for
loan offering in that specific geographical area.
SQL Server R Services brings the compute to the data by allowing R to run on the same computer as the database.
It includes a database service that runs outside the SQL Server process and communicates securely with the R
runtime.
This solution template walks through how to create and clean up a set of simulated data, use various algorithms to
train the R models, select the best performant model and perform chargeoff predictions and save the prediction
results back to SQL Server. A PowerBI report connects to the prediction result table and show interactive reports
with the user on the predictive analytics.

Data Scientist Perspective


SQL Server R Services brings the compute to the data by running R on the computer that hosts the database. It
includes a database service that runs outside the SQL Server process and communicates securely with the R
runtime.
This solution walks through the steps to create and refine data, train R models, and perform scoring on the SQL
Server machine. The final prediction results will be stored in SQL Server . This data is then visualized in PowerBI,
which also contains a summary of the loan chargeoff analysis and chargeoff prediction for the next three months.
(Simulated data is shown in this template to illustrate the feature)
Data scientists who are testing and developing solutions can work from the convenience of their R IDE on their
client machine, while pushing the compute to the SQL Server machine. The completed solutions are deployed to
SQL Server 2016 by embedding calls to R in stored procedures. These solutions can then be further automated
with SQL Server Integration Services and SQL Server agent.
Click on the Deploy button to test the automation and the entire solution will be made available in your Azure
subscription.

Pricing
Your Azure subscription used for the deployment will incur consumption charges on the services used in this
solution, approximately $1.15/hour for the default VM.
Please ensure that you stop your VM instance when not actively using the solution. Running the VM will incur
higher costs.
Please delete the solution if you are not using it.
Loan Credit Risk + Default Modeling
11/13/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Scoring credit risk is a complex process. Lenders carefully weigh a variety of quantitative indicators to determine
the probability of default and approve the best candidates based on the information available to them.
This solution acts as a credit-risk analyzer, helping you score credit risk and manage exposure using advanced
analytics models. SQL Server 2016 with R Services equips you with predictive analytics that help assess credit or
loan applications and accept only those that fall above certain criteria. For example, you might use the predicted
scores to help determine whether to grant a loan, then easily visualize the guidance in a Power BI Dashboard.
Data-driven credit-risk modeling reduces the number of loans offered to borrowers who are likely to default,
increasing the profitability of your loan portfolio.

Architecture

Download an SVG of this architecture.

Components
SQL Server R Services: SQL Server stores the lender and borrower data. R-based analytics provide training and
predicted models, as well as predicted results for consumption.
Azure Machine Learning: Machine Learning helps you design, test, operationalize, and manage predictive
analytics solutions in the cloud.
Power BI provides an interactive dashboard with visualization that uses data stored in SQL Server to drive
decisions on the predictions.

Next steps
Get started with SQL Server R Services
Learn more about Machine Learning
Learn more about Power BI
Loan Credit Risk with SQL Server
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Using SQL Server 2016 with R Services, a lending institution can make use of predictive analytics to reduce
number of loans they offer to those borrowers most likely to default, increasing the profitability of their loan
portfolio.

Architecture

Download an SVG of this architecture.

Overview
If we had a crystal ball, we would only loan money to someone we knew would pay us back. A lending institution
can make use of predictive analytics to reduce number of loans they offer to those borrowers most likely to default,
increasing the profitablity of their loan portfolio. This solution uses simulated data for a small personal loan
financial institution, building a model to help detect whether the borrower will default on a loan.

Business Perspective
The business user uses the predicted scores to help determine whether or not to grant a loan. He fine tunes his
prediction by using the PowerBI Dashboard to see the number of loans and the total dollar amount saved under
different scenarios. The dashboard includes a filter based on percentiles of the predicted scores. When all the values
are selected, he views all the loans in the testing sample, and can inspect information about how many of them
defaulted. Then by checking just the top percentile (100), he drills down to information about loans with a predicted
score in the top 1%. Checking multiple continuous boxes allows him to find a cutoff point he is comfortable with to
use as a future loan acceptance criteria.
Use the "Try It Now" button below to view the PowerBI Dashboard.

Data Scientist Perspective


SQL Server R Services brings the compute to the data by running R on the computer that hosts the database. It
includes a database service that runs outside the SQL Server process and communicates securely with the R
runtime.
This solution walks through the steps to create and refine data, train R models, and perform scoring on the SQL
Server machine. The final scored database table in SQL Server gives a predicted score for each potential borrower.
This data is then visualized in PowerBI.
Data scientists who are testing and developing solutions can work from the convenience of their R IDE on their
client machine, while pushing the compute to the SQL Server machine. The completed solutions are deployed to
SQL Server 2016 by embedding calls to R in stored procedures. These solutions can then be further automated
with SQL Server Integration Services and SQL Server agent.
Use the "Deploy" button below to create a Virtual Machine that includes the data, R code, SQL code, and a SQL
Server 2016 database (Loans) containingn the full solution.

Pricing
Your Azure subscription used for the deployment will incur consumption charges on the services used in this
solution, approximately $1.15/hour for the default VM.
Please ensure that you stop your VM instance when not actively using the solution. Running the VM will incur
higher costs.
Please delete the solution if you are not using it.
Messaging
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Azure Cache for Redis supports standard publish and subscribe functionalities. It's ideal for routing real-time
messages and scaling up web communication frameworks such as SignalR.

Architecture

Download an SVG of this architecture.


Modern Data Warehouse Architecture
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

A modern data warehouse lets you bring together all your data at any scale easily, and to get insights through
analytical dashboards, operational reports, or advanced analytics for all your users.

Architecture

Download an SVG of this architecture.

Data Flow
1. Combine all your structured, unstructured and semi-structured data (logs, files, and media) using Azure Data
Factory to Azure Blob Storage.
2. Leverage data in Azure Blob Storage to perform scalable analytics with Azure Databricks and achieve cleansed
and transformed data.
3. Cleansed and transformed data can be moved to Azure Synapse Analytics to combine with existing structured
data, creating one hub for all your data. Leverage native connectors between Azure Databricks and Azure
Synapse Analytics to access and move data at scale.
4. Build operational reports and analytical dashboards on top of Azure Data Warehouse to derive insights from the
data, and use Azure Analysis Services to serve thousands of end users.
5. Run ad hoc queries directly on data within Azure Databricks.

Components
Azure Synapse Analytics is the fast, flexible and trusted cloud data warehouse that lets you scale, compute and
store elastically and independently, with a massively parallel processing architecture.
Azure Data Factory is a hybrid data integration service that allows you to create, schedule and orchestrate your
ETL/ELT workflows.
Azure Blob storage is a Massively scalable object storage for any type of unstructured data-images, videos,
audio, documents, and more-easily and cost-effectively.
Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform.
Azure Analysis Services is an enterprise grade analytics as a service that lets you govern, deploy, test, and
deliver your BI solution with confidence.
Power BI is a suite of business analytics tools that deliver insights throughout your organization. Connect to
hundreds of data sources, simplify data prep, and drive ad hoc analysis. Produce beautiful reports, then publish
them for your organization to consume on the web and across mobile devices.

Next steps
Synapse Analytics Documentation
Azure Data Factory V2 Preview Documentation
Introduction to object storage in Azure
Azure Databricks Documentation
Analysis Services Documentation
Power BI Documentation

Pricing Calculator
Customize and get pricing estimates
Linux N-tier application in Azure with Apache
Cassandra
11/13/2020 • 13 minutes to read • Edit Online

This reference architecture shows how to deploy virtual machines (VMs) and a virtual network configured for an
N-tier application, using Apache Cassandra on Linux for the data tier. Deploy this solution .

Download a Visio file of this architecture.

Architecture
The architecture has the following components.
General
Resource group . Resource groups are used to group Azure resources so they can be managed by
lifetime, owner, or other criteria.
Availability zones . Availability zones are physical locations within an Azure region. Each zone consists of
one or more datacenters with independent power, cooling, and networking. By placing VMs across zones,
the application becomes resilient to failures within a zone.
Networking and load balancing
Vir tual network and subnets . Every Azure VM is deployed into a virtual network that can be
segmented into subnets. Create a separate subnet for each tier.
Application gateway . Application Gateway is a layer 7 load balancer. In this architecture, it routes HTTP
requests to the web front end. Application Gateway also provides a web application firewall (WAF) that
protects the application from common exploits and vulnerabilities.
Load balancers . Use Azure Standard Load Balancer to distribute network traffic from the web tier to the
business tier.
Network security groups (NSGs). Use NSGs to restrict network traffic within the virtual network. For
example, in the three-tier architecture shown here, the database tier does not accept traffic from the web
front end, only from the business tier and the management subnet.
DDoS Protection . Although the Azure platform provides basic protection against distributed denial of
service (DDoS) attacks, we recommend using DDoS Protection Standard, which has enhanced DDoS
mitigation features. See Security considerations.
Azure DNS . Azure DNS is a hosting service for DNS domains. It provides name resolution using Microsoft
Azure infrastructure. By hosting your domains in Azure, you can manage your DNS records using the same
credentials, APIs, tools, and billing as your other Azure services.
Virtual machines
Apache Cassandra database . Provides high availability at the data tier, by enabling replication and
failover.
OpsCenter . Deploy a monitoring solution such as DataStax OpsCenter to monitor the Cassandra cluster.
Jumpbox . Also called a bastion host. A secure VM on the network that administrators use to connect to
the other VMs. The jumpbox has an NSG that allows remote traffic only from public IP addresses on a safe
list. The NSG should permit remote desktop (RDP) traffic.

Recommendations
Your requirements might differ from the architecture described here. Use these recommendations as a starting
point.
Virtual machines
For recommendations on configuring the VMs, see Run a Linux VM on Azure.
Virtual network
When you create the virtual network, determine how many IP addresses your resources in each subnet require.
Specify a subnet mask and a network address range large enough for the required IP addresses, using CIDR
notation. Use an address space that falls within the standard private IP address blocks, which are 10.0.0.0/8,
172.16.0.0/12, and 192.168.0.0/16.
Choose an address range that doesn't overlap with your on-premises network, in case you need to set up a
gateway between the VNet and your on-premises network later. Once you create the VNet, you can't change the
address range.
Design subnets with functionality and security requirements in mind. All VMs within the same tier or role should
go into the same subnet, which can be a security boundary. For more information about designing VNets and
subnets, see Plan and design Azure Virtual Networks.
Application Gateway
For information about configuring Application Gateway, see Application Gateway configuration overview.
Load balancers
Do not expose the VMs directly to the Internet. Instead, give each VM a private IP address. Clients connect using
the IP address associated with the Application Gateway.
Define load balancer rules to direct network traffic to the VMs. For example, to enable HTTP traffic, create a rule
that maps port 80 from the front-end configuration to port 80 on the back-end address pool. When a client sends
an HTTP request to port 80, the load balancer selects a back-end IP address by using a hashing algorithm that
includes the source IP address. Client requests are distributed across all the VMs.
Network security groups
Use NSG rules to restrict traffic between tiers. For example, in the three-tier architecture shown above, the web
tier does not communicate directly with the database tier. To enforce this, the database tier should block incoming
traffic from the web tier subnet.
1. Deny all inbound traffic from the VNet. (Use the VIRTUAL_NETWORK tag in the rule.)
2. Allow inbound traffic from the business tier subnet.
3. Allow inbound traffic from the database tier subnet itself. This rule allows communication between the
database VMs, which is needed for database replication and failover.
4. Allow ssh traffic (port 22) from the jumpbox subnet. This rule lets administrators connect to the database tier
from the jumpbox.
Create rules 2 – 4 with higher priority than the first rule, so they override it.
Cassandra
We recommend DataStax Enterprise for production use, but these recommendations apply to any Cassandra
edition. For more information on running DataStax in Azure, see DataStax Enterprise Deployment Guide for Azure.
Configure nodes in rack-aware mode. Map fault domains to racks in the cassandra-rackdc.properties file.
You don't need a load balancer in front of the cluster. The client connects directly to a node in the cluster.
The deployment scripts for this architecture use name resolution to initialize the seed node for intra-cluster
communication (gossip). To enable name resolution, the deployment creates an Azure Private DNS zone with A
records for the Cassandra nodes. Depending on your initialization scripts, you might be able to use the static IP
address instead.

NOTE
Azure Private DNS is currently in public preview.

Jumpbox
Don't allow ssh access from the public Internet to the VMs that run the application workload. Instead, all ssh
access to these VMs must come through the jumpbox. An administrator logs into the jumpbox, and then logs into
the other VM from the jumpbox. The jumpbox allows ssh traffic from the Internet, but only from known, safe IP
addresses.
The jumpbox has minimal performance requirements, so select a small VM size. Create a public IP address for the
jumpbox. Place the jumpbox in the same VNet as the other VMs, but in a separate management subnet.
To secure the jumpbox, add an NSG rule that allows ssh connections only from a safe set of public IP addresses.
Configure the NSGs for the other subnets to allow ssh traffic from the management subnet.

Scalability considerations
Scale sets
For the web and business tiers, consider using virtual machine scale sets, instead of deploying separate VMs into
an availability set. A scale set makes it easy to deploy and manage a set of identical VMs, and autoscale the VMs
based on performance metrics. As the load on the VMs increases, additional VMs are automatically added to the
load balancer.
There are two basic ways to configure VMs deployed in a scale set:
Use extensions to configure the VM after it's deployed. With this approach, new VM instances may take
longer to start up than a VM with no extensions.
Deploy a managed disk with a custom disk image. This option may be quicker to deploy. However, it
requires you to keep the image up-to-date.
For more information, see Design considerations for scale sets.
TIP
When using any autoscale solution, test it with production-level workloads well in advance.

Subscription limits
Each Azure subscription has default limits in place, including a maximum number of VMs per region. You can
increase the limit by filing a support request. For more information, see Azure subscription and service limits,
quotas, and constraints.
Application Gateway
Application Gateway supports fixed capacity mode or autoscaling mode. Fixed capacity mode is useful for
scenarios with consistent and predictable workloads. Consider using autoscaling mode for workloads with
variable traffic. For more information, see Autoscaling and Zone-redundant Application Gateway v2.

Performance considerations
To get the best performance from Cassandra on Azure VMs, see the recommendations in Run Apache Cassandra
on Azure VMs.

Availability considerations
Availability zones provide the best resiliency within a single region. If you need even higher availability, consider
replicating the application across two regions.
Not all regions support availability zones, and not all VM sizes are supported in all zones. Run the following
Azure CLI command to find the supported zones for each VM size within a region:

az vm list-skus --resource-type virtualMachines --zone false --location <location> \


--query "[].{Name:name, Zones:locationInfo[].zones[] | join(','@)}" -o table

If you deploy this architecture to a region that does not support availability zones, put the VMs for each tier inside
an availability set. VMs within the same availability are deployed across multiple physical servers, compute racks,
storage units, and network switches for redundancy. Scale sets automatically use placement groups, which act as
an implicit availability set.
When deploying to availability zones, use the Standard SKU of Azure Load Balancer and the v2 SKU of Application
Gateway. These SKUs support cross-zone redundancy. For more information, see:
Standard Load Balancer and Availability Zones
Autoscaling and Zone-redundant Application Gateway v2
How does Application Gateway support high availability and scalability?
A single Application Gateway deployment can run multiple instances of the gateway. For production workloads,
run at least two instances.
Cassandra cluster
For the Cassandra cluster, the failover scenarios depend on the consistency levels used by the application and the
number of replicas. For consistency levels and usage in Cassandra, see Configuring data consistency and
Cassandra: How many nodes are talked to with Quorum? Data availability in Cassandra is determined by the
consistency level used by the application and the replication mechanism. For replication in Cassandra, see Data
Replication in NoSQL Databases Explained.
Health probes
Application Gateway and Load Balancer both use health probes to monitor the availability of VM instances.
Application Gateway always uses an HTTP probe.
Load Balancer can test either HTTP or TCP. Generally, if a VM runs an HTTP server, use an HTTP probe.
Otherwise, use TCP.
If a probe can't reach an instance within a timeout period, the gateway or load balancer stops sending traffic to
that VM. The probe continues to check and will return the VM to the back-end pool if the VM becomes available
again.
HTTP probes send an HTTP GET request to a specified path and listen for an HTTP 200 response. This path can be
the root path ("/"), or a health-monitoring endpoint that implements some custom logic to check the health of the
application. The endpoint must allow anonymous HTTP requests.
For more information about health probes, see:
Load Balancer health probes
Application Gateway health monitoring overview
For considerations about designing a health probe endpoint, see Health Endpoint Monitoring pattern.

Cost considerations
Use the Azure Pricing Calculator to estimates costs. Here are some other considerations.
Virtual machine scale sets
Virtual machine scale sets are available on all Linux VM sizes. You are only charged for the Azure VMs you deploy,
as well as any additional underlying infrastructure resources consumed such as storage and networking. There
are no incremental charges for the virtual machine scale sets service itself.
For single VMs pricing options See Linux VMs pricing.
Load balancers
You are charged only for the number of configured load-balancing and outbound rules. Inbound NAT rules are
free. There is no hourly charge for the Standard Load Balancer when no rules are configured.
For more information, see the cost section in Microsoft Azure Well-Architected Framework.

Security considerations
Virtual networks are a traffic isolation boundary in Azure. VMs in one VNet can't communicate directly with VMs
in a different VNet. VMs within the same VNet can communicate, unless you create network security groups
(NSGs) to restrict traffic. For more information, see Microsoft cloud services and network security.
For incoming Internet traffic, the load balancer rules define which traffic can reach the back end. However, load
balancer rules don't support IP safe lists, so if you want to add certain public IP addresses to a safe list, add an
NSG to the subnet.
DMZ . Consider adding a network virtual appliance (NVA) to create a DMZ between the Internet and the Azure
virtual network. NVA is a generic term for a virtual appliance that can perform network-related tasks, such as
firewall, packet inspection, auditing, and custom routing. For more information, see Implementing a DMZ
between Azure and the Internet.
Encr yption . Encrypt sensitive data at rest and use Azure Key Vault to manage the database encryption keys. Key
Vault can store encryption keys in hardware security modules (HSMs). It's also recommended to store application
secrets, such as database connection strings, in Key Vault.
DDoS protection . The Azure platform provides basic DDoS protection by default. This basic protection is
targeted at protecting the Azure infrastructure as a whole. Although basic DDoS protection is automatically
enabled, we recommend using DDoS Protection Standard. Standard protection uses adaptive tuning, based on
your application's network traffic patterns, to detect threats. This allows it to apply mitigations against DDoS
attacks that might go unnoticed by the infrastructure-wide DDoS policies. Standard protection also provides
alerting, telemetry, and analytics through Azure Monitor. For more information, see Azure DDoS Protection: Best
practices and reference architectures.

Deploy the solution


A deployment for this reference architecture is available on GitHub.
If you specify a region that supports availability zones, the VMs are deployed into availability zones. Otherwise,
the VMs are deployed into availability sets. For a list of regions that support availability zones, see Services
support by region.
Prerequisites
1. Clone, fork, or download the zip file for the reference architectures GitHub repository.
2. Install Azure CLI 2.0.
3. Install Node and NPM
4. Install the Azure building blocks npm package.

npm install -g @mspnp/azure-building-blocks

5. From a command prompt, bash prompt, or PowerShell prompt, sign into your Azure account as follows:

az login

Deploy the solution using azbb


To deploy the Linux VMs for an N-tier application reference architecture, follow these steps:
1. Navigate to the virtual-machines\n-tier-linux folder for the repository you cloned in step 1 of the
prerequisites above.
2. The parameter file specifies a default administrator user name and password for each VM in the
deployment. Change these before you deploy the reference architecture. Open the n-tier-linux.json file
and replace each adminUsername and adminPassword field with your new settings. Save the file.
3. Deploy the reference architecture using the azbb tool as shown below.

azbb -s <your subscription_id> -g <your resource_group_name> -l <azure region> -p n-tier-linux.json --


deploy

DevOps considerations
In this architecture you use an Azure Building Blocks template for provisioning the Azure resources and its
dependencies. Since all the main resources and their dependencies are in the same virtual network, they are
isolated in the same basic workload, that makes it easier to associate the workload's specific resources to a
DevOps team, so that the team can independently manage all aspects of those resources. This isolation enables
DevOps Teams and Services to perform continuous integration and continuous delivery (CI/CD).
Also, you can use different deployment templates and integrate them with Azure DevOps Services to provision
different environments in minutes, for example to replicate production like scenarios or load testing
environments only when needed, saving cost.
In this sceanario you virtual machines are configured by using Virtual Machine Extensions, since they offer the
possibility of installing certain additional software, such as Apache Cassandra. In particular, the Custom Script
Extension allows the download and execution of arbitrary code on a Virtual Machine, allowing unlimited
customization of the Operating System of an Azure VM. VM Extensions are installed and executed only at VM
creation time. That means if the Operating System gets configured incorrectly at a later stage, it will require a
manual intervention to move it back to its correct state. Configuration Management Tools can be used to address
this issue.
Consider using the Azure Monitor to Analyze and optimize the performance of your infrastructure, Monitor and
diagnose networking issues without logging into your virtual machines. Application Insights is actually one of the
components of Azure Monitor, which gives you rich metrics and logs to verify the state of your complete Azure
landscape. Azure Monitor will help you to follow the state of your infrastructure.
Make sure not only to monitor your compute elements supporting your application code, but your data platform
as well, in particular your databases, since a low performance of the data tier of an application could have serious
consequences.
In order to test the Azure environment where the applications are running, it should be version-controlled and
deployed through the same mechanisms as application code, then it can be tested and validated using DevOps
testing paradigms too.
For more information, see the Operational Excellence section in Microsoft Azure Well-Architecture Framework.

Next steps
Microsoft Learn module: Tour the N-tier architecture style
Ops automation using Event Grid
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Event Grid allows you to speed automation and simplify policy enforcement. For example, Event Grid can notify
Azure Automation when a virtual machine is created, or a SQL Database is spun up. These events can be used to
automatically check that service configurations are compliant, put metadata into operations tools, tag virtual
machines, or file work items.

Architecture

Download an SVG of this architecture.


Oracle Database Migration to Azure
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Oracle DB migrations can be accomplished in multiple ways. This architecture covers one of these options wherein
Oracle Active Data Guard is used to migrate the Database. It is assumed that Oracle Data Guard (or Active Data
Guard) is used for HA/DR purposes. Depending on the application, either the application can be migrated first or
the database. In this case, the application is migrated to Azure using Azure Load Balancer. This enables you to split
your traffic between on-premises and Azure, allowing you to gradually migrate your application tier. The database
migration is performed in multiple steps. As a first step, Oracle Data Guard is used to set up a Secondary/Standby
Database in Azure. This allows you to migrate your data to Azure. Once the secondary in Azure is in-sync with the
primary, you can flip the database in Azure to be your primary database while maintaining your secondary on-
premises. As a next step, you may set up a secondary database in a different Availability Zone (or region) for HA/DR
purposes. At this point, you can decommission your on-premises environment. All data traffic between on-
premises and Azure flows over Azure ExpressRoute or Site-to-Site VPN connectivity.

Architecture

Download an SVG of this architecture.

Data Flow
1. Connect your Azure environment with your on-premises network via site-to-site VPN or ExpressRoute.
2. Use Azure Load Balancer to migrate and balance traffic between the on-prem AppServer and your Azure
AppServer.
3. Use DataGuard to mark your OracleDB1 in Azure as your active stand-by.
4. Switch your OracleDB1 in Azure as primary and set up your OracleDB2 in Azure as your standby to finish your
migration. NOTE: This method only works when migrating to and from the same OS version and DB version.
Assumption: customer is using DataGuard on-premises.
Personalization using Cosmos DB
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Generate personalized recommendations for customers in real time, using low-latency and tunable consistency
settings for immediate insights

Architecture

Download an SVG of this architecture.


Retail and e-commerce using Azure Database for
MySQL
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Build secure and scalable e-commerce solutions that meet the demands of both customers and business. Engage
customers through customized products and offers, process transactions quickly and securely, and focus on
fulfillment and customer service.

Architecture

Download an SVG of this architecture.


Retail and e-commerce using Azure Database for
PostgreSQL
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Build secure and scalable e-commerce solutions that meet the demands of both customers and business. Engage
customers through customized products and offers, process transactions quickly and securely, and focus on
fulfillment and customer service.

Architecture

Browser Azure App Services Azure Database Azure Search


for PostgreSQL (Full-text index)
(Product Catalog)

Azure Storage (Logs, Azure Database Azure HD Insight


static catalog content) for PostgreSQL (Analytics)
(Session State)
Retail and e-commerce using Cosmos DB
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Support in-depth queries over diverse product catalogs, traffic spikes, and rapidly changing inventory.

Architecture

Download an SVG of this architecture.


Reference architecture for Oracle Database on Azure
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

This solution architecture illustrates a canonical architecture to achieve high availability for your Oracle Database
Enterprise Edition in Azure. High availability for your front-end as well as the middle tier can be obtained by using
Azure Load Balancers or Application Gateways. An uptime availability of 99.99% for your database tier can be
achieved using a combination of Azure Availability Zones and Oracle Active DataGuard with FSFO. For additional
availability and/or Disaster Recovery, consider deploying another Database VM in a different Azure region and
schedule frequent RMAN backups.

Architecture

Download an SVG of this architecture.

Data Flow
1. The client system accesses a custom application with Oracle DB backend via the web.
2. Web front end is configured in a load balancer.
3. Web front end makes a call to the appropriate Application Server to handle the work.
4. Application server queries primary Oracle Database.
5. Oracle Database has been configured using a HyperThreaded Virtual Machine with multiple Premium storage-
based Managed Disks for performance and availability.
6. Oracle databases are replicated with Oracle DataGuard (or Active DataGuard) or Oracle GoldenGate for HA and
DR purposes.
7. Oracle databases are monitored for uptime and performance by Oracle Enterprise Manager. OEM also allows
you to generate various performance and usage reports.
Serverless apps using Cosmos DB
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Use Azure Functions and Azure Cosmos DB to build globally distributed, scalable serverless applications.

Architecture

Download an SVG of this architecture.


Streaming using HDInsight
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Ingest and process millions of streaming events per second with Apache Kafka, Apache Storm, and Apache Spark
Streaming.

Architecture

Download an SVG of this architecture.


Windows N-tier application on Azure with SQL
Server
11/13/2020 • 16 minutes to read • Edit Online

This reference architecture shows how to deploy virtual machines (VMs) and a virtual network configured for an
N-tier application, using SQL Server on Windows for the data tier. Deploy this solution .

Download a Visio file of this architecture.

Architecture
The architecture has the following components.
General
Resource group . Resource groups are used to group Azure resources so they can be managed by
lifetime, owner, or other criteria.
Availability zones . Availability zones are physical locations within an Azure region. Each zone consists of
one or more datacenters with independent power, cooling, and networking. By placing VMs across zones,
the application becomes resilient to failures within a zone.
Networking and load balancing
Vir tual network and subnets . Every Azure VM is deployed into a virtual network that can be
segmented into subnets. Create a separate subnet for each tier.
Application gateway . Application Gateway is a layer 7 load balancer. In this architecture, it routes HTTP
requests to the web front end. Application Gateway also provides a web application firewall (WAF) that
protects the application from common exploits and vulnerabilities.
Load balancers . Use Azure Standard Load Balancer to distribute network traffic from the web tier to the
business tier, and from the business tier to SQL Server.
Network security groups (NSGs). Use NSGs to restrict network traffic within the virtual network. For
example, in the three-tier architecture shown here, the database tier does not accept traffic from the web
front end, only from the business tier and the management subnet.
DDoS Protection . Although the Azure platform provides basic protection against distributed denial of
service (DDoS) attacks, we recommend using DDoS Protection Standard, which has enhanced DDoS
mitigation features. See Security considerations.
Azure DNS . Azure DNS is a hosting service for DNS domains. It provides name resolution using
Microsoft Azure infrastructure. By hosting your domains in Azure, you can manage your DNS records
using the same credentials, APIs, tools, and billing as your other Azure services.
Virtual machines
SQL Ser ver Always On Availability Group . Provides high availability at the data tier, by enabling
replication and failover. It uses Windows Server Failover Cluster (WSFC) technology for failover.
Active Director y Domain Ser vices (AD DS) Ser vers . The computer objects for the failover cluster
and its associated clustered roles are created in Active Directory Domain Services (AD DS).
Cloud Witness . A failover cluster requires more than half of its nodes to be running, which is known as
having quorum. If the cluster has just two nodes, a network partition could cause each node to think it's
the primary node. In that case, you need a witness to break ties and establish quorum. A witness is a
resource such as a shared disk that can act as a tie breaker to establish quorum. Cloud Witness is a type of
witness that uses Azure Blob Storage. To learn more about the concept of quorum, see Understanding
cluster and pool quorum. For more information about Cloud Witness, see Deploy a Cloud Witness for a
Failover Cluster.
Jumpbox . Also called a bastion host. A secure VM on the network that administrators use to connect to
the other VMs. The jumpbox has an NSG that allows remote traffic only from public IP addresses on a safe
list. The NSG should permit remote desktop (RDP) traffic.

Recommendations
Your requirements might differ from the architecture described here. Use these recommendations as a starting
point.
Virtual machines
For recommendations on configuring the VMs, see Run a Windows VM on Azure.
Virtual network
When you create the virtual network, determine how many IP addresses your resources in each subnet require.
Specify a subnet mask and a network address range large enough for the required IP addresses, using CIDR
notation. Use an address space that falls within the standard private IP address blocks, which are 10.0.0.0/8,
172.16.0.0/12, and 192.168.0.0/16.
Choose an address range that does not overlap with your on-premises network, in case you need to set up a
gateway between the virtual network and your on-premises network later. Once you create the virtual network,
you can't change the address range.
Design subnets with functionality and security requirements in mind. All VMs within the same tier or role should
go into the same subnet, which can be a security boundary. For more information about designing virtual
networks and subnets, see Plan and design Azure Virtual Networks.
Application Gateway
For information about configuring Application Gateway, see Application Gateway configuration overview.
Load balancers
Don't expose the VMs directly to the Internet, but instead give each VM a private IP address. Clients connect
using the public IP address associated with the Application Gateway.
Define load balancer rules to direct network traffic to the VMs. For example, to enable HTTP traffic, map port 80
from the front-end configuration to port 80 on the back-end address pool. When a client sends an HTTP request
to port 80, the load balancer selects a back-end IP address by using a hashing algorithm that includes the source
IP address. Client requests are distributed across all the VMs in the back-end address pool.
Network security groups
Use NSG rules to restrict traffic between tiers. In the three-tier architecture shown above, the web tier does not
communicate directly with the database tier. To enforce this rule, the database tier should block incoming traffic
from the web tier subnet.
1. Deny all inbound traffic from the virtual network. (Use the VIRTUAL_NETWORK tag in the rule.)
2. Allow inbound traffic from the business tier subnet.
3. Allow inbound traffic from the database tier subnet itself. This rule allows communication between the
database VMs, which is needed for database replication and failover.
4. Allow RDP traffic (port 3389) from the jumpbox subnet. This rule lets administrators connect to the database
tier from the jumpbox.
Create rules 2 – 4 with higher priority than the first rule, so they override it.
SQL Server Always On Availability Groups
We recommend Always On Availability Groups for SQL Server high availability. Prior to Windows Server 2016,
Always On Availability Groups require a domain controller, and all nodes in the availability group must be in the
same AD domain.
Other tiers connect to the database through an availability group listener. The listener enables a SQL client to
connect without knowing the name of the physical instance of SQL Server. VMs that access the database must be
joined to the domain. The client (in this case, another tier) uses DNS to resolve the listener's virtual network
name into IP addresses.
Configure the SQL Server Always On Availability Group as follows:
1. Create a Windows Server Failover Clustering (WSFC) cluster, a SQL Server Always On Availability Group,
and a primary replica. For more information, see Getting Started with Always On Availability Groups.
2. Create an internal load balancer with a static private IP address.
3. Create an availability group listener, and map the listener's DNS name to the IP address of an internal load
balancer.
4. Create a load balancer rule for the SQL Server listening port (TCP port 1433 by default). The load balancer
rule must enable floating IP, also called Direct Server Return. This causes the VM to reply directly to the
client, which enables a direct connection to the primary replica.

NOTE
When floating IP is enabled, the front-end port number must be the same as the back-end port number in the
load balancer rule.

When a SQL client tries to connect, the load balancer routes the connection request to the primary replica. If
there is a failover to another replica, the load balancer automatically routes new requests to a new primary
replica. For more information, see Configure an ILB listener for SQL Server Always On Availability Groups.
During a failover, existing client connections are closed. After the failover completes, new connections will be
routed to the new primary replica.
If your application makes significantly more reads than writes, you can offload some of the read-only queries to
a secondary replica. See Using a Listener to Connect to a Read-Only Secondary Replica (Read-Only Routing).
Test your deployment by forcing a manual failover of the availability group.
Jumpbox
Don't allow RDP access from the public Internet to the VMs that run the application workload. Instead, all RDP
access to these VMs should go through the jumpbox. An administrator logs into the jumpbox, and then logs into
the other VM from the jumpbox. The jumpbox allows RDP traffic from the Internet, but only from known, safe IP
addresses.
The jumpbox has minimal performance requirements, so select a small VM size. Create a public IP address for
the jumpbox. Place the jumpbox in the same virtual network as the other VMs, but in a separate management
subnet.
To secure the jumpbox, add an NSG rule that allows RDP connections only from a safe set of public IP addresses.
Configure the NSGs for the other subnets to allow RDP traffic from the management subnet.

Scalability considerations
Scale sets
For the web and business tiers, consider using virtual machine scale sets instead of deploying separate VMs. A
scale set makes it easy to deploy and manage a set of identical VMs, and autoscale the VMs based on
performance metrics. As the load on the VMs increases, additional VMs are automatically added to the load
balancer. Consider scale sets if you need to quickly scale out VMs, or need to autoscale.
There are two basic ways to configure VMs deployed in a scale set:
Use extensions to configure the VM after it's deployed. With this approach, new VM instances may take
longer to start up than a VM with no extensions.
Deploy a managed disk with a custom disk image. This option may be quicker to deploy. However, it
requires you to keep the image up-to-date.
For more information, see Design considerations for scale sets.

TIP
When using any autoscale solution, test it with production-level workloads well in advance.

Subscription limits
Each Azure subscription has default limits in place, including a maximum number of VMs per region. You can
increase the limit by filing a support request. For more information, see Azure subscription and service limits,
quotas, and constraints.
Application Gateway
Application Gateway supports fixed capacity mode or autoscaling mode. Fixed capacity mode is useful for
scenarios with consistent and predictable workloads. Consider using autoscaling mode for workloads with
variable traffic. For more information, see Autoscaling and Zone-redundant Application Gateway v2

Availability considerations
Availability zones provide the best resiliency within a single region. If you need even higher availability, consider
replicating the application across two regions, using Azure Traffic Manager for failover. For more information, see
Multi-region N-tier application for high availability.
Not all regions support availability zones, and not all VM sizes are supported in all zones. Run the following
Azure CLI command to find the supported zones for each VM size within a region:

az vm list-skus --resource-type virtualMachines --zone false --location <location> \


--query "[].{Name:name, Zones:locationInfo[].zones[] | join(','@)}" -o table

If you deploy this architecture to a region that does not support availability zones, put the VMs for each tier
inside an availability set. VMs within the same availability set are deployed across multiple physical servers,
compute racks, storage units, and network switches for redundancy. Scale sets automatically use placement
groups, which act as an implicit availability set.
When deploying to availability zones, use the Standard SKU of Azure Load Balancer and the v2 SKU of
Application Gateway. These SKUs support cross-zone redundancy. For more information, see:
Standard Load Balancer and Availability Zones
Autoscaling and Zone-redundant Application Gateway v2
How does Application Gateway support high availability and scalability?
A single Application Gateway deployment can run multiple instances of the gateway. For production workloads,
run at least two instances.
Health probes
Application Gateway and Load Balancer both use health probes to monitor the availability of VM instances.
Application Gateway always uses an HTTP probe.
Load Balancer can test either HTTP or TCP. Generally, if a VM runs an HTTP server, use an HTTP probe.
Otherwise, use TCP.
If a probe can't reach an instance within a timeout period, the gateway or load balancer stops sending traffic to
that VM. The probe continues to check and will return the VM to the back-end pool if the VM becomes available
again.
HTTP probes send an HTTP GET request to a specified path and listen for an HTTP 200 response. This path can be
the root path ("/"), or a health-monitoring endpoint that implements some custom logic to check the health of
the application. The endpoint must allow anonymous HTTP requests.
For more information about health probes, see:
Load Balancer health probes
Application Gateway health monitoring overview
For considerations about designing a health probe endpoint, see Health Endpoint Monitoring pattern.

Cost considerations
Use the Azure Pricing Calculator to estimates costs. Here are some other considerations.
Virtual machine scale sets
Virtual machine scale sets are available on all Windows VM sizes. You are only charged for the Azure VMs you
deploy and any additional underlying infrastructure resources consumed such as storage and networking. There
are no incremental charges for the virtual machine scale sets service.
For single VMs pricing options See Windows VMs pricing
SQL server
If you choose Azure SQL DBaas, you can save on cost because don't need to configure an Always On Availability
Group and domain controller machines. There are several deployment options starting from single database up
to managed instance, or elastic pools. For more information see Azure SQL pricing.
For SQL server VMs pricing options see SQL VMs pricing.
Load balancers
You are charged only for the number of configured load-balancing and outbound rules. Inbound NAT rules are
free. There is no hourly charge for the Standard Load Balancer when no rules are configured.
For more information, see the cost section in Microsoft Azure Well-Architected Framework.

Security considerations
Virtual networks are a traffic isolation boundary in Azure. By default, VMs in one virtual network can't
communicate directly with VMs in a different virtual network. However, you can explicitly connect virtual
networks by using virtual network peering.
NSGs . Use network security groups (NSGs) to restrict traffic to and from the internet. For more information, see
Microsoft cloud services and network security.
DMZ . Consider adding a network virtual appliance (NVA) to create a DMZ between the Internet and the Azure
virtual network. NVA is a generic term for a virtual appliance that can perform network-related tasks, such as
firewall, packet inspection, auditing, and custom routing. For more information, see Implementing a DMZ
between Azure and the Internet.
Encr yption . Encrypt sensitive data at rest and use Azure Key Vault to manage the database encryption keys. Key
Vault can store encryption keys in hardware security modules (HSMs). For more information, see Configure
Azure Key Vault Integration for SQL Server on Azure VMs. It's also recommended to store application secrets,
such as database connection strings, in Key Vault.
DDoS protection . The Azure platform provides basic DDoS protection by default. This basic protection is
targeted at protecting the Azure infrastructure as a whole. Although basic DDoS protection is automatically
enabled, we recommend using DDoS Protection Standard. Standard protection uses adaptive tuning, based on
your application's network traffic patterns, to detect threats. This allows it to apply mitigations against DDoS
attacks that might go unnoticed by the infrastructure-wide DDoS policies. Standard protection also provides
alerting, telemetry, and analytics through Azure Monitor. For more information, see Azure DDoS Protection: Best
practices and reference architectures.

DevOps considerations
In this architecture you use [Azure Building Blocks templates][azbb-template] for provisioning the Azure
resources and its dependencies. Since all the main resources and their dependencies are in the same virtual
network, they are isolated in the same basic workload, that makes it easier to associate the workload's specific
resources to a team, so that the team can independently manage all aspects of those resources. This isolation
enables DevOps to perform continuous integration and continuous delivery (CI/CD).
Also, you can use different deployment templates and integrate them with Azure DevOps Services to provision
different environments in minutes, for example to replicate production like scenarios or load testing
environments only when needed, saving cost.
In this sceanario you virtual machines are configured by using Virtual Machine Extensions, since they offer the
possibility of installing certain additional software, such as anti malware and security agents. VM Extensions are
installed and executed only at VM creation time. That means if the Operating System gets configured incorrectly
at a later stage, it will require a manual intervention to move it back to its correct state..
Configuration Management Tools, in particular Desired State Configuration (DSC), are used in this architecture to
configure Active Directory and a SQL Server Always On Availability Group.
Consider using the Azure Monitor to Analyze and optimize the performance of your infrastructure, Monitor and
diagnose networking issues without logging into your virtual machines. Application Insights is actually one of
the components of Azure Monitor, which gives you rich metrics and logs to verify the state of your complete
Azure landscape. Azure Monitor will help you to follow the state of your infrastructure.
Make sure not only to monitor your compute elements supporting your application code, but your data platform
as well, in particular your databases, since a low performance of the data tier of an application could have serious
consequences.
In order to test the Azure environment where the applications are running, it should be version-controlled and
deployed through the same mechanisms as application code, then it can be tested and validated using DevOps
testing paradigms too.
For more information, see the Operational Excellence section in Azure Well-Architected Framework.

Deploy the solution


A deployment for this reference architecture is available on GitHub. The entire deployment can take up to an
hour, which includes running the scripts to configure AD DS, the Windows Server failover cluster, and the SQL
Server availability group.
If you specify a region that supports availability zones, the VMs are deployed into availability zones. Otherwise,
the VMs are deployed into availability sets. For a list of regions that support availability zones, see Services
support by region.
Prerequisites
1. Clone, fork, or download the zip file for the reference architectures GitHub repository.
2. Install Azure CLI 2.0.
3. Install Node and NPM
4. Install the Azure building blocks npm package.

npm install -g @mspnp/azure-building-blocks

5. From a command prompt, bash prompt, or PowerShell prompt, sign into your Azure account as follows:

az login

Deployment steps
1. Navigate to the virtual-machines\n-tier-windows folder of the reference architectures GitHub repository.
2. Open the n-tier-windows.json file.
3. In the n-tier-windows.jsonfile, search for all instances of [replace-with-password] and
[replace-with-safe-mode-password] and replace them with a strong password. Save the file.

NOTE
If you change the administrator user name, you must also update the extensions blocks in the JSON file.

4. Run the following command to deploy the architecture.


azbb -s <your subscription_id> -g <resource_group_name> -l <location> -p n-tier-windows.json --deploy

5. When the deployment is complete, open the Azure portal and navigate to the resource group. Find the
storage account that begins with 'sqlcw'. This is the storage account that will be used for the cluster's
cloud witness. Navigate into the storage account, select Access Keys , and copy the value of key1 . Also
copy the name of the storage account.
6. Open the n-tier-windows-sqlao.json file.
7. In the n-tier-windows-sqlao.json file, search for all instances of [replace-with-password] and
[replace-with-sql-password] and replace them with a strong password.

NOTE
If you change the administrator user name, you must also update the extensions blocks in the JSON file.

8. In the n-tier-windows-sqlao.json file, search for all instances of [replace-with-storageaccountname] and


[replace-with-storagekey] and replace them with the values from step 5. Save the file.
9. Run the following command to configure SQL Server Always On.

azbb -s <your subscription_id> -g <resource_group_name> -l <location> -p n-tier-windows-sqlao.json --


deploy

Next steps
Microsoft Learn module: Tour the N-tier architecture style
Building microservices on Azure
11/2/2020 • 5 minutes to read • Edit Online

Microservices are a popular architectural style for building applications that are resilient, highly scalable,
independently deployable, and able to evolve quickly. But a successful microservices architecture requires a
different approach to designing and building applications.
A microservices architecture consists of a collection of small, autonomous services. Each service is self-contained
and should implement a single business capability.

What are microservices?


Microservices are small, independent, and loosely coupled. A single small team of developers can write and
maintain a service.
Each service is a separate codebase, which can be managed by a small development team.
Services can be deployed independently. A team can update an existing service without rebuilding and
redeploying the entire application.
Services are responsible for persisting their own data or external state. This differs from the traditional
model, where a separate data layer handles data persistence.
Services communicate with each other by using well-defined APIs. Internal implementation details of each
service are hidden from other services.
Services don't need to share the same technology stack, libraries, or frameworks.
Besides for the services themselves, some other components appear in a typical microservices architecture:
Management/orchestration . This component is responsible for placing services on nodes, identifying failures,
rebalancing services across nodes, and so forth. Typically this component is an off-the-shelf technology such as
Kubernetes, rather than something custom built.
API Gateway . The API gateway is the entry point for clients. Instead of calling services directly, clients call the API
gateway, which forwards the call to the appropriate services on the back end.
Advantages of using an API gateway include:
It decouples clients from services. Services can be versioned or refactored without needing to update all of
the clients.
Services can use messaging protocols that are not web friendly, such as AMQP.
The API Gateway can perform other cross-cutting functions such as authentication, logging, SSL
termination, and load balancing.

Benefits
Agility. Because microservices are deployed independently, it's easier to manage bug fixes and feature
releases. You can update a service without redeploying the entire application, and roll back an update if
something goes wrong. In many traditional applications, if a bug is found in one part of the application, it
can block the entire release process. New features may be held up waiting for a bug fix to be integrated,
tested, and published.
Small, focused teams . A microservice should be small enough that a single feature team can build, test,
and deploy it. Small team sizes promote greater agility. Large teams tend be less productive, because
communication is slower, management overhead goes up, and agility diminishes.
Small code base . In a monolithic application, there is a tendency over time for code dependencies to
become tangled Adding a new feature requires touching code in a lot of places. By not sharing code or data
stores, a microservices architecture minimizes dependencies, and that makes it easier to add new features.
Mix of technologies . Teams can pick the technology that best fits their service, using a mix of technology
stacks as appropriate.
Fault isolation . If an individual microservice becomes unavailable, it won't disrupt the entire application,
as long as any upstream microservices are designed to handle faults correctly (for example, by
implementing circuit breaking).
Scalability . Services can be scaled independently, letting you scale out subsystems that require more
resources, without scaling out the entire application. Using an orchestrator such as Kubernetes or Service
Fabric, you can pack a higher density of services onto a single host, which allows for more efficient
utilization of resources.
Data isolation . It is much easier to perform schema updates, because only a single microservice is
affected. In a monolithic application, schema updates can become very challenging, because different parts
of the application may all touch the same data, making any alterations to the schema risky.

Challenges
The benefits of microservices don't come for free. Here are some of the challenges to consider before embarking
on a microservices architecture.
Complexity . A microservices application has more moving parts than the equivalent monolithic
application. Each service is simpler, but the entire system as a whole is more complex.
Development and testing . Writing a small service that relies on other dependent services requires a
different approach than a writing a traditional monolithic or layered application. Existing tools are not
always designed to work with service dependencies. Refactoring across service boundaries can be difficult.
It is also challenging to test service dependencies, especially when the application is evolving quickly.
Lack of governance . The decentralized approach to building microservices has advantages, but it can
also lead to problems. You may end up with so many different languages and frameworks that the
application becomes hard to maintain. It may be useful to put some project-wide standards in place,
without overly restricting teams' flexibility. This especially applies to cross-cutting functionality such as
logging.
Network congestion and latency . The use of many small, granular services can result in more
interservice communication. Also, if the chain of service dependencies gets too long (service A calls B,
which calls C...), the additional latency can become a problem. You will need to design APIs carefully. Avoid
overly chatty APIs, think about serialization formats, and look for places to use asynchronous
communication patterns.
Data integrity . With each microservice responsible for its own data persistence. As a result, data
consistency can be a challenge. Embrace eventual consistency where possible.
Management . To be successful with microservices requires a mature DevOps culture. Correlated logging
across services can be challenging. Typically, logging must correlate multiple service calls for a single user
operation.
Versioning . Updates to a service must not break services that depend on it. Multiple services could be
updated at any given time, so without careful design, you might have problems with backward or forward
compatibility.
Skillset . Microservices are highly distributed systems. Carefully evaluate whether the team has the skills
and experience to be successful.

Process for building a microservices architecture


The articles listed here present a structured approach for designing, building, and operating a microservices
architecture.
Domain analysis. To avoid some common pitfalls when designing microservices, use domain analysis to define
your microservice boundaries. Follow these steps:
1. Use domain analysis to model microservices.
2. Use tactical DDD to design microservices.
3. Identify microservice boundaries.
Design the ser vices . Microservices require a different approach to designing and building applications. For
more information, see Designing a microservices architecture.
Operate in production . Because microservices architectures are distributed, you must have robust operations
for deployment and monitoring.
CI/CD for microservices architectures
Build a CI/CD pipeline for microservices on Kubernetes
Monitor microservices running on Azure Kubernetes Service (AKS)

Microservices reference architectures for Azure


Microservices architecture on Azure Kubernetes Service (AKS)
Microservices architecture on Azure Service Fabric
Using domain analysis to model microservices
11/2/2020 • 8 minutes to read • Edit Online

One of the biggest challenges of microservices is to define the boundaries of individual services. The general rule
is that a service should do "one thing" — but putting that rule into practice requires careful thought. There is no
mechanical process that will produce the "right" design. You have to think deeply about your business domain,
requirements, and goals. Otherwise, you can end up with a haphazard design that exhibits some undesirable
characteristics, such as hidden dependencies between services, tight coupling, or poorly designed interfaces. This
article shows a domain-driven approach to designing microservices.
This article uses a drone delivery service as a running example. You can read more about the scenario and the
corresponding reference implementation here.

Introduction
Microservices should be designed around business capabilities, not horizontal layers such as data access or
messaging. In addition, they should have loose coupling and high functional cohesion. Microservices are loosely
coupled if you can change one service without requiring other services to be updated at the same time. A
microservice is cohesive if it has a single, well-defined purpose, such as managing user accounts or tracking
delivery history. A service should encapsulate domain knowledge and abstract that knowledge from clients. For
example, a client should be able to schedule a drone without knowing the details of the scheduling algorithm or
how the drone fleet is managed.
Domain-driven design (DDD) provides a framework that can get you most of the way to a set of well-designed
microservices. DDD has two distinct phases, strategic and tactical. In strategic DDD, you are defining the large-
scale structure of the system. Strategic DDD helps to ensure that your architecture remains focused on business
capabilities. Tactical DDD provides a set of design patterns that you can use to create the domain model. These
patterns include entities, aggregates, and domain services. These tactical patterns will help you to design
microservices that are both loosely coupled and cohesive.

In this article and the next, we'll walk through the following steps, applying them to the Drone Delivery application:
1. Start by analyzing the business domain to understand the application's functional requirements. The output
of this step is an informal description of the domain, which can be refined into a more formal set of domain
models.
2. Next, define the bounded contexts of the domain. Each bounded context contains a domain model that
represents a particular subdomain of the larger application.
3. Within a bounded context, apply tactical DDD patterns to define entities, aggregates, and domain services.
4. Use the results from the previous step to identify the microservices in your application.
In this article, we cover the first three steps, which are primarily concerned with DDD. In the next article, we'll
identify the microservices. However, it's important to remember that DDD is an iterative, ongoing process. Service
boundaries aren't fixed in stone. As an application evolves, you may decide to break apart a service into several
smaller services.
NOTE
This article doesn't show a complete and comprehensive domain analysis. We deliberately kept the example brief, to illustrate
the main points. For more background on DDD, we recommend Eric Evans' Domain-Driven Design, the book that first
introduced the term. Another good reference is Implementing Domain-Driven Design by Vaughn Vernon.

Scenario: Drone delivery


Fabrikam, Inc. is starting a drone delivery service. The company manages a fleet of drone aircraft. Businesses
register with the service, and users can request a drone to pick up goods for delivery. When a customer schedules
a pickup, a backend system assigns a drone and notifies the user with an estimated delivery time. While the
delivery is in progress, the customer can track the location of the drone, with a continuously updated ETA.
This scenario involves a fairly complicated domain. Some of the business concerns include scheduling drones,
tracking packages, managing user accounts, and storing and analyzing historical data. Moreover, Fabrikam wants
to get to market quickly and then iterate quickly, adding new functionality and capabilities. The application needs
to operate at cloud scale, with a high service level objective (SLO). Fabrikam also expects that different parts of the
system will have very different requirements for data storage and querying. All of these considerations lead
Fabrikam to choose a microservices architecture for the Drone Delivery application.

Analyze the domain


Using a DDD approach will help you to design microservices so that every service forms a natural fit to a
functional business requirement. It can help you to avoid the trap of letting organizational boundaries or
technology choices dictate your design.
Before writing any code, you need a bird's eye view of the system that you are creating. DDD starts by modeling
the business domain and creating a domain model. The domain model is an abstract model of the business
domain. It distills and organizes domain knowledge, and provides a common language for developers and domain
experts.
Start by mapping all of the business functions and their connections. This will likely be a collaborative effort that
involves domain experts, software architects, and other stakeholders. You don't need to use any particular
formalism. Sketch a diagram or draw on whiteboard.
As you fill in the diagram, you may start to identify discrete subdomains. Which functions are closely related?
Which functions are core to the business, and which provide ancillary services? What is the dependency graph?
During this initial phase, you aren't concerned with technologies or implementation details. That said, you should
note the place where the application will need to integrate with external systems, such as CRM, payment
processing, or billing systems.

Example: Drone delivery application


After some initial domain analysis, the Fabrikam team came up with a rough sketch that depicts the Drone
Delivery domain.

Shipping is placed in the center of the diagram, because it's core to the business. Everything else in the
diagram exists to enable this functionality.
Drone management is also core to the business. Functionality that is closely related to drone management
includes drone repair and using predictive analysis to predict when drones need servicing and
maintenance.
ETA analysis provides time estimates for pickup and delivery.
Third-par ty transpor tation will enable the application to schedule alternative transportation methods if a
package cannot be shipped entirely by drone.
Drone sharing is a possible extension of the core business. The company may have excess drone capacity
during certain hours, and could rent out drones that would otherwise be idle. This feature will not be in the
initial release.
Video sur veillance is another area that the company might expand into later.
User accounts , Invoicing , and Call center are subdomains that support the core business.
Notice that at this point in the process, we haven't made any decisions about implementation or technologies.
Some of the subsystems may involve external software systems or third-party services. Even so, the application
needs to interact with these systems and services, so it's important to include them in the domain model.

NOTE
When an application depends on an external system, there is a risk that the external system's data schema or API will leak
into your application, ultimately compromising the architectural design. This is particularly true with legacy systems that may
not follow modern best practices, and may use convoluted data schemas or obsolete APIs. In that case, it's important to
have a well-defined boundary between these external systems and the application. Consider using the Strangler pattern or
the Anti-Corruption Layer pattern for this purpose.

Define bounded contexts


The domain model will include representations of real things in the world — users, drones, packages, and so forth.
But that doesn't mean that every part of the system needs to use the same representations for the same things.
For example, subsystems that handle drone repair and predictive analysis will need to represent many physical
characteristics of drones, such as their maintenance history, mileage, age, model number, performance
characteristics, and so on. But when it's time to schedule a delivery, we don't care about those things. The
scheduling subsystem only needs to know whether a drone is available, and the ETA for pickup and delivery.
If we tried to create a single model for both of these subsystems, it would be unnecessarily complex. It would also
become harder for the model to evolve over time, because any changes will need to satisfy multiple teams
working on separate subsystems. Therefore, it's often better to design separate models that represent the same
real-world entity (in this case, a drone) in two different contexts. Each model contains only the features and
attributes that are relevant within its particular context.
This is where the DDD concept of bounded contexts comes into play. A bounded context is simply the boundary
within a domain where a particular domain model applies. Looking at the previous diagram, we can group
functionality according to whether various functions will share a single domain model.

Bounded contexts are not necessarily isolated from one another. In this diagram, the solid lines connecting the
bounded contexts represent places where two bounded contexts interact. For example, Shipping depends on User
Accounts to get information about customers, and on Drone Management to schedule drones from the fleet.
In the book Domain Driven Design, Eric Evans describes several patterns for maintaining the integrity of a domain
model when it interacts with another bounded context. One of the main principles of microservices is that services
communicate through well-defined APIs. This approach corresponds to two patterns that Evans calls Open Host
Service and Published Language. The idea of Open Host Service is that a subsystem defines a formal protocol
(API) for other subsystems to communicate with it. Published Language extends this idea by publishing the API in
a form that other teams can use to write clients. In the article Designing APIs for microservices, we discuss using
OpenAPI Specification (formerly known as Swagger) to define language-agnostic interface descriptions for REST
APIs, expressed in JSON or YAML format.
For the rest of this journey, we will focus on the Shipping bounded context.

Next steps
After completing a domain analysis, the next step is to apply tactical DDD, to define your domain models with
more precision.
Tactical DDD
Using tactical DDD to design microservices
11/2/2020 • 5 minutes to read • Edit Online

During the strategic phase of DDD, you are mapping out the business domain and defining bounded contexts for
your domain models. Tactical DDD is when you define your domain models with more precision. The tactical
patterns are applied within a single bounded context. In a microservices architecture, we are particularly interested
in the entity and aggregate patterns. Applying these patterns will help us to identify natural boundaries for the
services in our application (see the next article in this series). As a general principle, a microservice should be no
smaller than an aggregate, and no larger than a bounded context. First, we'll review the tactical patterns. Then we'll
apply them to the Shipping bounded context in the Drone Delivery application.

Overview of the tactical patterns


This section provides a brief summary of the tactical DDD patterns, so if you are already familiar with DDD, you
can probably skip this section. The patterns are described in more detail in chapters 5 – 6 of Eric Evans' book, and
in Implementing Domain-Driven Design by Vaughn Vernon.

Entities . An entity is an object with a unique identity that persists over time. For example, in a banking application,
customers and accounts would be entities.
An entity has a unique identifier in the system, which can be used to look up or retrieve the entity. That doesn't
mean the identifier is always exposed directly to users. It could be a GUID or a primary key in a database.
An identity may span multiple bounded contexts, and may endure beyond the lifetime of the application. For
example, bank account numbers or government-issued IDs are not tied to the lifetime of a particular
application.
The attributes of an entity may change over time. For example, a person's name or address might change, but
they are still the same person.
An entity can hold references to other entities.
Value objects . A value object has no identity. It is defined only by the values of its attributes. Value objects are
also immutable. To update a value object, you always create a new instance to replace the old one. Value objects
can have methods that encapsulate domain logic, but those methods should have no side-effects on the object's
state. Typical examples of value objects include colors, dates and times, and currency values.
Aggregates . An aggregate defines a consistency boundary around one or more entities. Exactly one entity in an
aggregate is the root. Lookup is done using the root entity's identifier. Any other entities in the aggregate are
children of the root, and are referenced by following pointers from the root.
The purpose of an aggregate is to model transactional invariants. Things in the real world have complex webs of
relationships. Customers create orders, orders contain products, products have suppliers, and so on. If the
application modifies several related objects, how does it guarantee consistency? How do we keep track of
invariants and enforce them?
Traditional applications have often used database transactions to enforce consistency. In a distributed application,
however, that's often not feasible. A single business transaction may span multiple data stores, or may be long
running, or may involve third-party services. Ultimately it's up to the application, not the data layer, to enforce the
invariants required for the domain. That's what aggregates are meant to model.

NOTE
An aggregate might consist of a single entity, without child entities. What makes it an aggregate is the transactional
boundary.

Domain and application ser vices . In DDD terminology, a service is an object that implements some logic
without holding any state. Evans distinguishes between domain services, which encapsulate domain logic, and
application services, which provide technical functionality, such as user authentication or sending an SMS
message. Domain services are often used to model behavior that spans multiple entities.

NOTE
The term service is overloaded in software development. The definition here is not directly related to microservices.

Domain events . Domain events can be used to notify other parts of the system when something happens. As the
name suggests, domain events should mean something within the domain. For example, "a record was inserted
into a table" is not a domain event. "A delivery was cancelled" is a domain event. Domain events are especially
relevant in a microservices architecture. Because microservices are distributed and don't share data stores, domain
events provide a way for microservices to coordinate with each other. The article Interservice communication
discusses asynchronous messaging in more detail.
There are a few other DDD patterns not listed here, including factories, repositories, and modules. These can be
useful patterns for when you are implementing a microservice, but they are less relevant when designing the
boundaries between microservice.

Drone delivery: Applying the patterns


We start with the scenarios that the Shipping bounded context must handle.
A customer can request a drone to pick up goods from a business that is registered with the drone delivery
service.
The sender generates a tag (barcode or RFID) to put on the package.
A drone will pick up and deliver a package from the source location to the destination location.
When a customer schedules a delivery, the system provides an ETA based on route information, weather
conditions, and historical data.
When the drone is in flight, a user can track the current location and the latest ETA.
Until a drone has picked up the package, the customer can cancel a delivery.
The customer is notified when the delivery is completed.
The sender can request delivery confirmation from the customer, in the form of a signature or finger print.
Users can look up the history of a completed delivery.
From these scenarios, the development team identified the following entities .
Delivery
Package
Drone
Account
Confirmation
Notification
Tag
The first four, Delivery, Package, Drone, and Account, are all aggregates that represent transactional consistency
boundaries. Confirmations and Notifications are child entities of Deliveries, and Tags are child entities of Packages.
The value objects in this design include Location, ETA, PackageWeight, and PackageSize.
To illustrate, here is a UML diagram of the Delivery aggregate. Notice that it holds references to other aggregates,
including Account, Package, and Drone.

There are two domain events:


While a drone is in flight, the Drone entity sends DroneStatus events that describe the drone's location and
status (in-flight, landed).
The Delivery entity sends DeliveryTracking events whenever the stage of a delivery changes. These include
DeliveryCreated, DeliveryRescheduled, DeliveryHeadedToDropoff, and DeliveryCompleted.
Notice that these events describe things that are meaningful within the domain model. They describe something
about the domain, and aren't tied to a particular programming language construct.
The development team identified one more area of functionality, which doesn't fit neatly into any of the entities
described so far. Some part of the system must coordinate all of the steps involved in scheduling or updating a
delivery. Therefore, the development team added two domain ser vices to the design: a Scheduler that
coordinates the steps, and a Supervisor that monitors the status of each step, in order to detect whether any steps
have failed or timed out. This is a variation of the Scheduler Agent Supervisor pattern.

Next steps
The next step is to define the boundaries for each microservice.
Identify microservice boundaries
Identifying microservice boundaries
11/2/2020 • 5 minutes to read • Edit Online

What is the right size for a microservice? You often hear something to the effect of, "not too big and not too small"
— and while that's certainly correct, it's not very helpful in practice. But if you start from a carefully designed
domain model, it's much easier to reason about microservices.

This article uses a drone delivery service as a running example. You can read more about the scenario and the
corresponding reference implementation here.

From domain model to microservices


In the previous article, we defined a set of bounded contexts for a Drone Delivery application. Then we looked
more closely at one of these bounded contexts, the Shipping bounded context, and identified a set of entities,
aggregates, and domain services for that bounded context.
Now we're ready to go from domain model to application design. Here's an approach that you can use to derive
microservices from the domain model.
1. Start with a bounded context. In general, the functionality in a microservice should not span more than one
bounded context. By definition, a bounded context marks the boundary of a particular domain model. If you
find that a microservice mixes different domain models together, that's a sign that you may need to go back
and refine your domain analysis.
2. Next, look at the aggregates in your domain model. Aggregates are often good candidates for
microservices. A well-designed aggregate exhibits many of the characteristics of a well-designed
microservice, such as:
An aggregate is derived from business requirements, rather than technical concerns such as data access
or messaging.
An aggregate should have high functional cohesion.
An aggregate is a boundary of persistence.
Aggregates should be loosely coupled.
3. Domain services are also good candidates for microservices. Domain services are stateless operations
across multiple aggregates. A typical example is a workflow that involves several microservices. We'll see an
example of this in the Drone Delivery application.
4. Finally, consider non-functional requirements. Look at factors such as team size, data types, technologies,
scalability requirements, availability requirements, and security requirements. These factors may lead you to
further decompose a microservice into two or more smaller services, or do the opposite and combine
several microservices into one.
After you identify the microservices in your application, validate your design against the following criteria:
Each service has a single responsibility.
There are no chatty calls between services. If splitting functionality into two services causes them to be overly
chatty, it may be a symptom that these functions belong in the same service.
Each service is small enough that it can be built by a small team working independently.
There are no inter-dependencies that will require two or more services to be deployed in lock-step. It should
always be possible to deploy a service without redeploying any other services.
Services are not tightly coupled, and can evolve independently.
Your service boundaries will not create problems with data consistency or integrity. Sometimes it's important to
maintain data consistency by putting functionality into a single microservice. That said, consider whether you
really need strong consistency. There are strategies for addressing eventual consistency in a distributed system,
and the benefits of decomposing services often outweigh the challenges of managing eventual consistency.
Above all, it's important to be pragmatic, and remember that domain-driven design is an iterative process. When
in doubt, start with more coarse-grained microservices. Splitting a microservice into two smaller services is easier
than refactoring functionality across several existing microservices.

Example: Defining microservices for the Drone Delivery application


Recall that the development team had identified the four aggregates — Delivery, Package, Drone, and Account —
and two domain services, Scheduler and Supervisor.
Delivery and Package are obvious candidates for microservices. The Scheduler and Supervisor coordinate the
activities performed by other microservices, so it makes sense to implement these domain services as
microservices.
Drone and Account are interesting because they belong to other bounded contexts. One option is for the
Scheduler to call the Drone and Account bounded contexts directly. Another option is to create Drone and Account
microservices inside the Shipping bounded context. These microservices would mediate between the bounded
contexts, by exposing APIs or data schemas that are more suited to the Shipping context.
The details of the Drone and Account bounded contexts are beyond the scope of this guidance, so we created
mock services for them in our reference implementation. But here are some factors to consider in this situation:
What is the network overhead of calling directly into the other bounded context?
Is the data schema for the other bounded context suitable for this context, or is it better to have a schema
that's tailored to this bounded context?
Is the other bounded context a legacy system? If so, you might create a service that acts as an anti-
corruption layer to translate between the legacy system and the modern application.
What is the team structure? Is it easy to communicate with the team that's responsible for the other
bounded context? If not, creating a service that mediates between the two contexts can help to mitigate the
cost of cross-team communication.
So far, we haven't considered any non-functional requirements. Thinking about the application's throughput
requirements, the development team decided to create a separate Ingestion microservice that is responsible for
ingesting client requests. This microservice will implement load leveling by putting incoming requests into a buffer
for processing. The Scheduler will read the requests from the buffer and execute the workflow.
Non-functional requirements led the team to create one additional service. All of the services so far have been
about the process of scheduling and delivering packages in real time. But the system also needs to store the
history of every delivery in long-term storage for data analysis. The team considered making this the
responsibility of the Delivery service. However, the data storage requirements are quite different for historical
analysis versus in-flight operations (see Data considerations). Therefore, the team decided to create a separate
Delivery History service, which will listen for DeliveryTracking events from the Delivery service and write the
events into long-term storage.
The following diagram shows the design at this point:

Next steps
At this point, you should have a clear understanding of the purpose and functionality of each microservice in your
design. Now you can architect the system.
Design a microservices architecture
Designing a microservices architecture
11/2/2020 • 2 minutes to read • Edit Online

Microservices have become a popular architectural style for building cloud applications that are resilient, highly
scalable, independently deployable, and able to evolve quickly. To be more than just a buzzword, however,
microservices require a different approach to designing and building applications.
In this set of articles, we explore how to build a microservices architecture on Azure. Topics include:
Compute options for microservices
Interservice communication
API design
API gateways
Data considerations
Design patterns

Prerequisites
Before reading these articles, you might start with the following:
Introduction to microservices architectures. Understand the benefits and challenges of microservices, and
when to use this style of architecture.
Using domain analysis to model microservices. Learn a domain-driven approach to modeling microservices.

Reference implementation
To illustrate best practices for a microservices architecture, we created a reference implementation that we call the
Drone Delivery application. This implementation runs on Kubernetes using Azure Kubernetes Service (AKS). You
can find the reference implementation on GitHub.

Scenario
Fabrikam, Inc. is starting a drone delivery service. The company manages a fleet of drone aircraft. Businesses
register with the service, and users can request a drone to pick up goods for delivery. When a customer schedules
a pickup, a backend system assigns a drone and notifies the user with an estimated delivery time. While the
delivery is in progress, the customer can track the location of the drone, with a continuously updated ETA.
This scenario involves a fairly complicated domain. Some of the business concerns include scheduling drones,
tracking packages, managing user accounts, and storing and analyzing historical data. Moreover, Fabrikam wants
to get to market quickly and then iterate quickly, adding new functionality and capabilities. The application needs
to operate at cloud scale, with a high service level objective (SLO). Fabrikam also expects that different parts of
the system will have very different requirements for data storage and querying. All of these considerations lead
Fabrikam to choose a microservices architecture for the Drone Delivery application.

NOTE
For help in choosing between a microservices architecture and other architectural styles, see the Azure Application
Architecture Guide.

Our reference implementation uses Kubernetes with Azure Kubernetes Service (AKS). However, many of the high-
level architectural decisions and challenges will apply to any container orchestrator, including Azure Service
Fabric.

Next steps
Choose a compute option
Choosing an Azure compute option for microservices
11/2/2020 • 4 minutes to read • Edit Online

The term compute refers to the hosting model for the computing resources that your application runs on. For a
microservices architecture, two approaches are especially popular:
A service orchestrator that manages services running on dedicated nodes (VMs).
A serverless architecture using functions as a service (FaaS).
While these aren't the only options, they are both proven approaches to building microservices. An application
might include both approaches.

Service orchestrators
An orchestrator handles tasks related to deploying and managing a set of services. These tasks include placing
services on nodes, monitoring the health of services, restarting unhealthy services, load balancing network traffic
across service instances, service discovery, scaling the number of instances of a service, and applying
configuration updates. Popular orchestrators include Kubernetes, Service Fabric, DC/OS, and Docker Swarm.
On the Azure platform, consider the following options:
Azure Kubernetes Service (AKS) is a managed Kubernetes service. AKS provisions Kubernetes and exposes
the Kubernetes API endpoints, but hosts and manages the Kubernetes control plane, performing automated
upgrades, automated patching, autoscaling, and other management tasks. You can think of AKS as being
"Kubernetes APIs as a service."
Service Fabric is a distributed systems platform for packaging, deploying, and managing microservices.
Microservices can be deployed to Service Fabric as containers, as binary executables, or as Reliable Services.
Using the Reliable Services programming model, services can directly use Service Fabric programming APIs
to query the system, report health, receive notifications about configuration and code changes, and discover
other services. A key differentiation with Service Fabric is its strong focus on building stateful services using
Reliable Collections.
Other options such as Docker Enterprise Edition and Mesosphere DC/OS can run in an IaaS environment on
Azure. You can find deployment templates on Azure Marketplace.

Containers
Sometimes people talk about containers and microservices as if they were the same thing. While that's not true —
you don't need containers to build microservices — containers do have some benefits that are particularly relevant
to microservices, such as:
Por tability . A container image is a standalone package that runs without needing to install libraries or
other dependencies. That makes them easy to deploy. Containers can be started and stopped quickly, so you
can spin up new instances to handle more load or to recover from node failures.
Density . Containers are lightweight compared with running a virtual machine, because they share OS
resources. That makes it possible to pack multiple containers onto a single node, which is especially useful
when the application consists of many small services.
Resource isolation . You can limit the amount of memory and CPU that is available to a container, which
can help to ensure that a runaway process doesn't exhaust the host resources. See the Bulkhead pattern for
more information.
Serverless (Functions as a Service)
With a serverless architecture, you don't manage the VMs or the virtual network infrastructure. Instead, you deploy
code and the hosting service handles putting that code onto a VM and executing it. This approach tends to favor
small granular functions that are coordinated using event-based triggers. For example, a message being placed
onto a queue might trigger a function that reads from the queue and processes the message.
Azure Functions is a serverless compute service that supports various function triggers, including HTTP requests,
Service Bus queues, and Event Hubs events. For a complete list, see Azure Functions triggers and bindings
concepts. Also consider Azure Event Grid, which is a managed event routing service in Azure.

Orchestrator or serverless?
Here are some factors to consider when choosing between an orchestrator approach and a serverless approach.
Manageability A serverless application is easy to manage, because the platform manages all the of compute
resources for you. While an orchestrator abstracts some aspects of managing and configuring a cluster, it does not
completely hide the underlying VMs. With an orchestrator, you will need to think about issues such as load
balancing, CPU and memory usage, and networking.
Flexibility and control . An orchestrator gives you a great deal of control over configuring and managing your
services and the cluster. The tradeoff is additional complexity. With a serverless architecture, you give up some
degree of control because these details are abstracted.
Por tability . All of the orchestrators listed here (Kubernetes, DC/OS, Docker Swarm, and Service Fabric) can run
on-premises or in multiple public clouds.
Application integration . It can be challenging to build a complex application using a serverless architecture, due
to the need to coordinate, deploy, and manage many small independent functions. One option in Azure is to use
Azure Logic Apps to coordinate a set of Azure Functions. For an example of this approach, see Create a function
that integrates with Azure Logic Apps.
Cost . With an orchestrator, you pay for the VMs that are running in the cluster. With a serverless application, you
pay only for the actual compute resources consumed. In both cases, you need to factor in the cost of any additional
services, such as storage, databases, and messaging services.
Scalability . Azure Functions scales automatically to meet demand, based on the number of incoming events. With
an orchestrator, you can scale out by increasing the number of service instances running in the cluster. You can also
scale by adding additional VMs to the cluster.
Our reference implementation primarily uses Kubernetes, but we did use Azure Functions for one service, namely
the Delivery History service. Azure Functions was a good fit for this particular service, because it's is an event-
driven workload. By using an Event Hubs trigger to invoke the function, the service needed a minimal amount of
code. Also, the Delivery History service is not part of the main workflow, so running it outside of the Kubernetes
cluster doesn't affect the end-to-end latency of user-initiated operations.

Next steps
Interservice communication
Designing interservice communication for
microservices
11/2/2020 • 12 minutes to read • Edit Online

Communication between microservices must be efficient and robust. With lots of small services interacting to
complete a single transaction, this can be a challenge. In this article, we look at the tradeoffs between
asynchronous messaging versus synchronous APIs. Then we look at some of the challenges in designing resilient
interservice communication.

Challenges
Here are some of the main challenges arising from service-to-service communication. Service meshes, described
later in this article, are designed to handle many of these challenges.
Resiliency . There may be dozens or even hundreds of instances of any given microservice. An instance can fail for
any number of reasons. There can be a node-level failure, such as a hardware failure or a VM reboot. An instance
might crash, or be overwhelmed with requests and unable to process any new requests. Any of these events can
cause a network call to fail. There are two design patterns that can help make service-to-service network calls
more resilient:
Retr y . A network call may fail because of a transient fault that goes away by itself. Rather than fail outright,
the caller should typically retry the operation a certain number of times, or until a configured time-out
period elapses. However, if an operation is not idempotent, retries can cause unintended side effects. The
original call might succeed, but the caller never gets a response. If the caller retries, the operation may be
invoked twice. Generally, it's not safe to retry POST or PATCH methods, because these are not guaranteed to
be idempotent.
Circuit Breaker . Too many failed requests can cause a bottleneck, as pending requests accumulate in the
queue. These blocked requests might hold critical system resources such as memory, threads, database
connections, and so on, which can cause cascading failures. The Circuit Breaker pattern can prevent a
service from repeatedly trying an operation that is likely to fail.
Load balancing . When service "A" calls service "B", the request must reach a running instance of service "B". In
Kubernetes, the Service resource type provides a stable IP address for a group of pods. Network traffic to the
service's IP address gets forwarded to a pod by means of iptable rules. By default, a random pod is chosen. A
service mesh (see below) can provide more intelligent load balancing algorithms based on observed latency or
other metrics.
Distributed tracing . A single transaction may span multiple services. That can make it hard to monitor the
overall performance and health of the system. Even if every service generates logs and metrics, without some way
to tie them together, they are of limited use. The article Logging and monitoring talks more about distributed
tracing, but we mention it here as a challenge.
Ser vice versioning . When a team deploys a new version of a service, they must avoid breaking any other
services or external clients that depend on it. In addition, you might want to run multiple versions of a service
side-by-side, and route requests to a particular version. See API Versioning for more discussion of this issue.
TLS encr yption and mutual TLS authentication . For security reasons, you may want to encrypt traffic
between services with TLS, and use mutual TLS authentication to authenticate callers.
Synchronous versus asynchronous messaging
There are two basic messaging patterns that microservices can use to communicate with other microservices.
1. Synchronous communication. In this pattern, a service calls an API that another service exposes, using a
protocol such as HTTP or gRPC. This option is a synchronous messaging pattern because the caller waits for
a response from the receiver.
2. Asynchronous message passing. In this pattern, a service sends message without waiting for a response,
and one or more services process the message asynchronously.
It's important to distinguish between asynchronous I/O and an asynchronous protocol. Asynchronous I/O means
the calling thread is not blocked while the I/O completes. That's important for performance, but is an
implementation detail in terms of the architecture. An asynchronous protocol means the sender doesn't wait for a
response. HTTP is a synchronous protocol, even though an HTTP client may use asynchronous I/O when it sends a
request.
There are tradeoffs to each pattern. Request/response is a well-understood paradigm, so designing an API may
feel more natural than designing a messaging system. However, asynchronous messaging has some advantages
that can be useful in a microservices architecture:
Reduced coupling . The message sender does not need to know about the consumer.
Multiple subscribers . Using a pub/sub model, multiple consumers can subscribe to receive events. See
Event-driven architecture style.
Failure isolation . If the consumer fails, the sender can still send messages. The messages will be picked up
when the consumer recovers. This ability is especially useful in a microservices architecture, because each
service has its own lifecycle. A service could become unavailable or be replaced with a newer version at any
given time. Asynchronous messaging can handle intermittent downtime. Synchronous APIs, on the other
hand, require the downstream service to be available or the operation fails.
Responsiveness . An upstream service can reply faster if it does not wait on downstream services. This is
especially useful in a microservices architecture. If there is a chain of service dependencies (service A calls
B, which calls C, and so on), waiting on synchronous calls can add unacceptable amounts of latency.
Load leveling . A queue can act as a buffer to level the workload, so that receivers can process messages at
their own rate.
Workflows . Queues can be used to manage a workflow, by check-pointing the message after each step in
the workflow.
However, there are also some challenges to using asynchronous messaging effectively.
Coupling with the messaging infrastructure . Using a particular messaging infrastructure may cause
tight coupling with that infrastructure. It will be difficult to switch to another messaging infrastructure later.
Latency . End-to-end latency for an operation may become high if the message queues fill up.
Cost . At high throughputs, the monetary cost of the messaging infrastructure could be significant.
Complexity . Handling asynchronous messaging is not a trivial task. For example, you must handle
duplicated messages, either by de-duplicating or by making operations idempotent. It's also hard to
implement request-response semantics using asynchronous messaging. To send a response, you need
another queue, plus a way to correlate request and response messages.
Throughput . If messages require queue semantics, the queue can become a bottleneck in the system. Each
message requires at least one queue operation and one dequeue operation. Moreover, queue semantics
generally require some kind of locking inside the messaging infrastructure. If the queue is a managed
service, there may be additional latency, because the queue is external to the cluster's virtual network. You
can mitigate these issues by batching messages, but that complicates the code. If the messages don't
require queue semantics, you might be able to use an event stream instead of a queue. For more
information, see Event-driven architectural style.

Drone Delivery: Choosing the messaging patterns


With these considerations in mind, the development team made the following design choices for the Drone
Delivery application
The Ingestion service exposes a public REST API that client applications use to schedule, update, or cancel
deliveries.
The Ingestion service uses Event Hubs to send asynchronous messages to the Scheduler service.
Asynchronous messages are necessary to implement the load-leveling that is required for ingestion.
The Account, Delivery, Package, Drone, and Third-party Transport services all expose internal REST APIs. The
Scheduler service calls these APIs to carry out a user request. One reason to use synchronous APIs is that
the Scheduler needs to get a response from each of the downstream services. A failure in any of the
downstream services means the entire operation failed. However, a potential issue is the amount of latency
that is introduced by calling the backend services.
If any downstream service has a nontransient failure, the entire transaction should be marked as failed. To
handle this case, the Scheduler service sends an asynchronous message to the Supervisor, so that the
Supervisor can schedule compensating transactions.
The Delivery service exposes a public API that clients can use to get the status of a delivery. In the article API
gateway, we discuss how an API gateway can hide the underlying services from the client, so the client
doesn't need to know which services expose which APIs.
While a drone is in flight, the Drone service sends events that contain the drone's current location and
status. The Delivery service listens to these events in order to track the status of a delivery.
When the status of a delivery changes, the Delivery service sends a delivery status event, such as
DeliveryCreated or DeliveryCompleted . Any service can subscribe to these events. In the current design, the
Delivery service is the only subscriber, but there might be other subscribers later. For example, the events
might go to a real-time analytics service. And because the Scheduler doesn't have to wait for a response,
adding more subscribers doesn't affect the main workflow path.
Notice that delivery status events are derived from drone location events. For example, when a drone reaches a
delivery location and drops off a package, the Delivery service translates this into a DeliveryCompleted event. This
is an example of thinking in terms of domain models. As described earlier, Drone Management belongs in a
separate bounded context. The drone events convey the physical location of a drone. The delivery events, on the
other hand, represent changes in the status of a delivery, which is a different business entity.

Using a service mesh


A service mesh is a software layer that handles service-to-service communication. Service meshes are designed to
address many of the concerns listed in the previous section, and to move responsibility for these concerns away
from the microservices themselves and into a shared layer. The service mesh acts as a proxy that intercepts
network communication between microservices in the cluster. Currently, the service mesh concept applies mainly
to container orchestrators, rather than serverless architectures.

NOTE
Service mesh is an example of the Ambassador pattern — a helper service that sends network requests on behalf of the
application.

Right now, the main options for a service mesh in Kubernetes are linkerd and Istio. Both of these technologies are
evolving rapidly. However, some features that both linkerd and Istio have in common include:
Load balancing at the session level, based on observed latencies or number of outstanding requests. This
can improve performance over the layer-4 load balancing that is provided by Kubernetes.
Layer-7 routing based on URL path, Host header, API version, or other application-level rules.
Retry of failed requests. A service mesh understands HTTP error codes, and can automatically retry failed
requests. You can configure that maximum number of retries, along with a timeout period in order to
bound the maximum latency.
Circuit breaking. If an instance consistently fails requests, the service mesh will temporarily mark it as
unavailable. After a backoff period, it will try the instance again. You can configure the circuit breaker based
on various criteria, such as the number of consecutive failures,
Service mesh captures metrics about interservice calls, such as the request volume, latency, error and
success rates, and response sizes. The service mesh also enables distributed tracing by adding correlation
information for each hop in a request.
Mutual TLS Authentication for service-to-service calls.
Do you need a service mesh? It depends. Without a service mesh, you'll need to consider each of the challenges
mentioned at the beginning of this article. You can solve problems like retry, circuit breaker, and distributed tracing
without a service mesh, but a service mesh moves these concerns out of the individual services and into a
dedicated layer. On the other hand, a service mesh adds complexity to the setup and configuration of the cluster.
There may be performance implications, because requests now get routed through the service mesh proxy, and
because extra services are now running on every node in the cluster. You should do thorough performance and
load testing before deploying a service mesh in production.

Distributed transactions
A common challenge in microservices is correctly handling transactions that span multiple services. Often in this
scenario, the success of a transaction is all or nothing — if one of the participating services fails, the entire
transaction must fail.
There are two cases to consider:
A service may experience a transient failure such as a network timeout. These errors can often be resolved
simply by retrying the call. If the operation still fails after a certain number of attempts, it's considered a
nontransient failure.
A nontransient failure is any failure that's unlikely to go away by itself. Nontransient failures include normal
error conditions, such as invalid input. They also include unhandled exceptions in application code or a
process crashing. If this type of error occurs, the entire business transaction must be marked as a failure. It
may be necessary to undo other steps in the same transaction that already succeeded.
After a nontransient failure, the current transaction might be in a partially failed state, where one or more steps
already completed successfully. For example, if the Drone service already scheduled a drone, the drone must be
canceled. In that case, the application needs to undo the steps that succeeded, by using a Compensating
Transaction. In some cases, this must be done by an external system or even by a manual process.
If the logic for compensating transactions is complex, consider creating a separate service that is responsible for
this process. In the Drone Delivery application, the Scheduler service puts failed operations onto a dedicated
queue. A separate microservice, called the Supervisor, reads from this queue and calls a cancellation API on the
services that need to compensate. This is a variation of the Scheduler Agent Supervisor pattern. The Supervisor
service might take other actions as well, such as notify the user by text or email, or send an alert to an operations
dashboard.
The Scheduler service itself might fault (for example, because a node crashes). In that case, a new instance can spin
up and take over. However, any transactions that were already in progress must be resumed.
One approach is to save a checkpoint to a durable store after each step in the workflow is completed. If an instance
of the Scheduler service crashes in the middle of a transaction, a new instance can use the checkpoint to resume
where the previous instance left off. However, writing checkpoints can create a performance overhead.
Another option is to design all operations to be idempotent. An operation is idempotent if it can be called multiple
times without producing additional side-effects after the first call. Essentially, the downstream service should
ignore duplicate calls, which means the service must be able to detect duplicate calls. It's not always
straightforward to implement idempotent methods. For more information, see Idempotent operations.

Next steps
For microservices that talk directly to each other, it's important to create well-designed APIs.
API design
Designing APIs for microservices
11/2/2020 • 12 minutes to read • Edit Online

Good API design is important in a microservices architecture, because all data exchange between services happens
either through messages or API calls. APIs must be efficient to avoid creating chatty I/O. Because services are
designed by teams working independently, APIs must have well-defined semantics and versioning schemes, so
that updates don't break other services.

It's important to distinguish between two types of API:


Public APIs that client applications call.
Backend APIs that are used for interservice communication.
These two use cases have somewhat different requirements. A public API must be compatible with client
applications, typically browser applications or native mobile applications. Most of the time, that means the public
API will use REST over HTTP. For the backend APIs, however, you need to take network performance into account.
Depending on the granularity of your services, interservice communication can result in a lot of network traffic.
Services can quickly become I/O bound. For that reason, considerations such as serialization speed and payload
size become more important. Some popular alternatives to using REST over HTTP include gRPC, Apache Avro, and
Apache Thrift. These protocols support binary serialization and are generally more efficient than HTTP.

Considerations
Here are some things to think about when choosing how to implement an API.
REST versus RPC . Consider the tradeoffs between using a REST-style interface versus an RPC-style interface.
REST models resources, which can be a natural way express your domain model. It defines a uniform
interface based on HTTP verbs, which encourages evolvability. It has well-defined semantics in terms of
idempotency, side effects, and response codes. And it enforces stateless communication, which improves
scalability.
RPC is more oriented around operations or commands. Because RPC interfaces look like local method calls,
it may lead you to design overly chatty APIs. However, that doesn't mean RPC must be chatty. It just means
you need to use care when designing the interface.
For a RESTful interface, the most common choice is REST over HTTP using JSON. For an RPC-style interface, there
are several popular frameworks, including gRPC, Apache Avro, and Apache Thrift.
Efficiency . Consider efficiency in terms of speed, memory, and payload size. Typically a gRPC-based interface is
faster than REST over HTTP.
Interface definition language (IDL) . An IDL is used to define the methods, parameters, and return values of an
API. An IDL can be used to generate client code, serialization code, and API documentation. IDLs can also be
consumed by API testing tools such as Postman. Frameworks such as gRPC, Avro, and Thrift define their own IDL
specifications. REST over HTTP does not have a standard IDL format, but a common choice is OpenAPI (formerly
Swagger). You can also create an HTTP REST API without using a formal definition language, but then you lose the
benefits of code generation and testing.
Serialization . How are objects serialized over the wire? Options include text-based formats (primarily JSON) and
binary formats such as protocol buffer. Binary formats are generally faster than text-based formats. However,
JSON has advantages in terms of interoperability, because most languages and frameworks support JSON
serialization. Some serialization formats require a fixed schema, and some require compiling a schema definition
file. In that case, you'll need to incorporate this step into your build process.
Framework and language suppor t . HTTP is supported in nearly every framework and language. gRPC, Avro,
and Thrift all have libraries for C++, C#, Java, and Python. Thrift and gRPC also support Go.
Compatibility and interoperability . If you choose a protocol like gRPC, you may need a protocol translation
layer between the public API and the back end. A gateway can perform that function. If you are using a service
mesh, consider which protocols are compatible with the service mesh. For example, linkerd has built-in support for
HTTP, Thrift, and gRPC.
Our baseline recommendation is to choose REST over HTTP unless you need the performance benefits of a binary
protocol. REST over HTTP requires no special libraries. It creates minimal coupling, because callers don't need a
client stub to communicate with the service. There is rich ecosystems of tools to support schema definitions,
testing, and monitoring of RESTful HTTP endpoints. Finally, HTTP is compatible with browser clients, so you don't
need a protocol translation layer between the client and the backend.
However, if you choose REST over HTTP, you should do performance and load testing early in the development
process, to validate whether it performs well enough for your scenario.

RESTful API design


There are many resources for designing RESTful APIs. Here are some that you might find helpful:
API design
API implementation
Microsoft REST API Guidelines
Here are some specific considerations to keep in mind.
Watch out for APIs that leak internal implementation details or simply mirror an internal database schema.
The API should model the domain. It's a contract between services, and ideally should only change when
new functionality is added, not just because you refactored some code or normalized a database table.
Different types of client, such as mobile application and desktop web browser, may require different
payload sizes or interaction patterns. Consider using the Backends for Frontends pattern to create separate
backends for each client, that expose an optimal interface for that client.
For operations with side effects, consider making them idempotent and implementing them as PUT
methods. That will enable safe retries and can improve resiliency. The article Interservice communication
discuss this issue in more detail.
HTTP methods can have asynchronous semantics, where the method returns a response immediately, but
the service carries out the operation asynchronously. In that case, the method should return an HTTP 202
response code, which indicates the request was accepted for processing, but the processing is not yet
completed. For more information, see Asynchronous Request-Reply pattern.

Mapping REST to DDD patterns


Patterns such as entity, aggregate, and value object are designed to place certain constraints on the objects in your
domain model. In many discussions of DDD, the patterns are modeled using object-oriented (OO) language
concepts like constructors or property getters and setters. For example, value objects are supposed to be
immutable. In an OO programming language, you would enforce this by assigning the values in the constructor
and making the properties read-only:

export class Location {


readonly latitude: number;
readonly longitude: number;

constructor(latitude: number, longitude: number) {


if (latitude < -90 || latitude > 90) {
throw new RangeError('latitude must be between -90 and 90');
}
if (longitude < -180 || longitude > 180) {
throw new RangeError('longitude must be between -180 and 180');
}
this.latitude = latitude;
this.longitude = longitude;
}
}

These sorts of coding practices are particularly important when building a traditional monolithic application. With
a large code base, many subsystems might use the Location object, so it's important for the object to enforce
correct behavior.
Another example is the Repository pattern, which ensures that other parts of the application do not make direct
reads or writes to the data store:

In a microservices architecture, however, services don't share the same code base and don't share data stores.
Instead, they communicate through APIs. Consider the case where the Scheduler service requests information
about a drone from the Drone service. The Drone service has its internal model of a drone, expressed through
code. But the Scheduler doesn't see that. Instead, it gets back a representation of the drone entity — perhaps a
JSON object in an HTTP response.
The Scheduler service can't modify the Drone service's internal models, or write to the Drone service's data store.
That means the code that implements the Drone service has a smaller exposed surface area, compared with code
in a traditional monolith. If the Drone service defines a Location class, the scope of that class is limited — no other
service will directly consume the class.
For these reasons, this guidance doesn't focus much on coding practices as they relate to the tactical DDD patterns.
But it turns out that you can also model many of the DDD patterns through REST APIs.
For example:
Aggregates map naturally to resources in REST. For example, the Delivery aggregate would be exposed as a
resource by the Delivery API.
Aggregates are consistency boundaries. Operations on aggregates should never leave an aggregate in an
inconsistent state. Therefore, you should avoid creating APIs that allow a client to manipulate the internal
state of an aggregate. Instead, favor coarse-grained APIs that expose aggregates as resources.
Entities have unique identities. In REST, resources have unique identifiers in the form of URLs. Create
resource URLs that correspond to an entity's domain identity. The mapping from URL to domain identity
may be opaque to client.
Child entities of an aggregate can be reached by navigating from the root entity. If you follow HATEOAS
principles, child entities can be reached via links in the representation of the parent entity.
Because value objects are immutable, updates are performed by replacing the entire value object. In REST,
implement updates through PUT or PATCH requests.
A repository lets clients query, add, or remove objects in a collection, abstracting the details of the
underlying data store. In REST, a collection can be a distinct resource, with methods for querying the
collection or adding new entities to the collection.
When you design your APIs, think about how they express the domain model, not just the data inside the model,
but also the business operations and the constraints on the data.

DDD C O N C EP T REST EQ UIVA L EN T EXA M P L E

Aggregate Resource { "1":1234, "status":"pending"...


}

Identity URL https://delivery-


service/deliveries/1

Child entities Links { "href":


"/deliveries/1/confirmation" }

Update value objects PUT or PATCH PUT https://delivery-


service/deliveries/1/dropoff
DDD C O N C EP T REST EQ UIVA L EN T EXA M P L E

Repository Collection https://delivery-


service/deliveries?status=pending

API versioning
An API is a contract between a service and clients or consumers of that service. If an API changes, there is a risk of
breaking clients that depend on the API, whether those are external clients or other microservices. Therefore, it's a
good idea to minimize the number of API changes that you make. Often, changes in the underlying
implementation don't require any changes to the API. Realistically, however, at some point you will want to add
new features or new capabilities that require changing an existing API.
Whenever possible, make API changes backward compatible. For example, avoid removing a field from a model,
because that can break clients that expect the field to be there. Adding a field does not break compatibility, because
clients should ignore any fields they don't understand in a response. However, the service must handle the case
where an older client omits the new field in a request.
Support versioning in your API contract. If you introduce a breaking API change, introduce a new API version.
Continue to support the previous version, and let clients select which version to call. There are a couple of ways to
do this. One is simply to expose both versions in the same service. Another option is to run two versions of the
service side-by-side, and route requests to one or the other version, based on HTTP routing rules.

The diagram has two parts. "Service supports two versions" shows the v1 Client and the v2 Client both pointing to
one Service. "Side-by-side deployment" shows the v1 Client pointing to a v1 Service, and the v2 Client pointing to
a v2 Service.
There's a cost to supporting multiple versions, in terms of developer time, testing, and operational overhead.
Therefore, it's good to deprecate old versions as quickly as possible. For internal APIs, the team that owns the API
can work with other teams to help them migrate to the new version. This is when having a cross-team governance
process is useful. For external (public) APIs, it can be harder to deprecate an API version, especially if the API is
consumed by third parties or by native client applications.
When a service implementation changes, it's useful to tag the change with a version. The version provides
important information when troubleshooting errors. It can be very helpful for root cause analysis to know exactly
which version of the service was called. Consider using semantic versioning for service versions. Semantic
versioning uses a MAJOR.MINOR.PATCH format. However, clients should only select an API by the major version
number, or possibly the minor version if there are significant (but non-breaking) changes between minor versions.
In other words, it's reasonable for clients to select between version 1 and version 2 of an API, but not to select
version 2.1.3. If you allow that level of granularity, you risk having to support a proliferation of versions.
For further discussion of API versioning, see Versioning a RESTful web API.

Idempotent operations
An operation is idempotent if it can be called multiple times without producing additional side-effects after the
first call. Idempotency can be a useful resiliency strategy, because it allows an upstream service to safely invoke an
operation multiple times. For a discussion of this point, see Distributed transactions.
The HTTP specification states that GET, PUT, and DELETE methods must be idempotent. POST methods are not
guaranteed to be idempotent. If a POST method creates a new resource, there is generally no guarantee that this
operation is idempotent. The specification defines idempotent this way:

A request method is considered "idempotent" if the intended effect on the server of multiple identical requests
with that method is the same as the effect for a single such request. (RFC 7231)

It's important to understand the difference between PUT and POST semantics when creating a new entity. In both
cases, the client sends a representation of an entity in the request body. But the meaning of the URI is different.
For a POST method, the URI represents a parent resource of the new entity, such as a collection. For
example, to create a new delivery, the URI might be /api/deliveries . The server creates the entity and
assigns it a new URI, such as /api/deliveries/39660 . This URI is returned in the Location header of the
response. Each time the client sends a request, the server will create a new entity with a new URI.
For a PUT method, the URI identifies the entity. If there already exists an entity with that URI, the server
replaces the existing entity with the version in the request. If no entity exists with that URI, the server creates
one. For example, suppose the client sends a PUT request to api/deliveries/39660 . Assuming there is no
delivery with that URI, the server creates a new one. Now if the client sends the same request again, the
server will replace the existing entity.
Here is the Delivery service's implementation of the PUT method.
[HttpPut("{id}")]
[ProducesResponseType(typeof(Delivery), 201)]
[ProducesResponseType(typeof(void), 204)]
public async Task<IActionResult> Put([FromBody]Delivery delivery, string id)
{
logger.LogInformation("In Put action with delivery {Id}: {@DeliveryInfo}", id, delivery.ToLogInfo());
try
{
var internalDelivery = delivery.ToInternal();

// Create the new delivery entity.


await deliveryRepository.CreateAsync(internalDelivery);

// Create a delivery status event.


var deliveryStatusEvent = new DeliveryStatusEvent { DeliveryId = delivery.Id, Stage =
DeliveryEventType.Created };
await deliveryStatusEventRepository.AddAsync(deliveryStatusEvent);

// Return HTTP 201 (Created)


return CreatedAtRoute("GetDelivery", new { id= delivery.Id }, delivery);
}
catch (DuplicateResourceException)
{
// This method is mainly used to create deliveries. If the delivery already exists then update it.
logger.LogInformation("Updating resource with delivery id: {DeliveryId}", id);

var internalDelivery = delivery.ToInternal();


await deliveryRepository.UpdateAsync(id, internalDelivery);

// Return HTTP 204 (No Content)


return NoContent();
}
}

It's expected that most requests will create a new entity, so the method optimistically calls CreateAsync on the
repository object, and then handles any duplicate-resource exceptions by updating the resource instead.

Next steps
Learn about using an API gateway at the boundary between client applications and microservices.
API gateways
Using API gateways in microservices
11/2/2020 • 5 minutes to read • Edit Online

In a microservices architecture, a client might interact with more than one front-end service. Given this fact, how
does a client know what endpoints to call? What happens when new services are introduced, or existing services
are refactored? How do services handle SSL termination, authentication, and other concerns? An API gateway can
help to address these challenges.

What is an API gateway?


An API gateway sits between clients and services. It acts as a reverse proxy, routing requests from clients to
services. It may also perform various cross-cutting tasks such as authentication, SSL termination, and rate
limiting. If you don't deploy a gateway, clients must send requests directly to front-end services. However, there
are some potential problems with exposing services directly to clients:
It can result in complex client code. The client must keep track of multiple endpoints, and handle failures in a
resilient way.
It creates coupling between the client and the backend. The client needs to know how the individual services
are decomposed. That makes it harder to maintain the client and also harder to refactor services.
A single operation might require calls to multiple services. That can result in multiple network round trips
between the client and the server, adding significant latency.
Each public-facing service must handle concerns such as authentication, SSL, and client rate limiting.
Services must expose a client-friendly protocol such as HTTP or WebSocket. This limits the choice of
communication protocols.
Services with public endpoints are a potential attack surface, and must be hardened.
A gateway helps to address these issues by decoupling clients from services. Gateways can perform a number of
different functions, and you may not need all of them. The functions can be grouped into the following design
patterns:
Gateway Routing. Use the gateway as a reverse proxy to route requests to one or more backend services, using
layer 7 routing. The gateway provides a single endpoint for clients, and helps to decouple clients from services.
Gateway Aggregation. Use the gateway to aggregate multiple individual requests into a single request. This
pattern applies when a single operation requires calls to multiple backend services. The client sends one request
to the gateway. The gateway dispatches requests to the various backend services, and then aggregates the results
and sends them back to the client. This helps to reduce chattiness between the client and the backend.
Gateway Offloading. Use the gateway to offload functionality from individual services to the gateway, particularly
cross-cutting concerns. It can be useful to consolidate these functions into one place, rather than making every
service responsible for implementing them. This is particularly true for features that requires specialized skills to
implement correctly, such as authentication and authorization.
Here are some examples of functionality that could be offloaded to a gateway:
SSL termination
Authentication
IP allow/block list
Client rate limiting (throttling)
Logging and monitoring
Response caching
Web application firewall
GZIP compression
Servicing static content

Choosing a gateway technology


Here are some options for implementing an API gateway in your application.
Reverse proxy ser ver . Nginx and HAProxy are popular reverse proxy servers that support features such
as load balancing, SSL, and layer 7 routing. They are both free, open-source products, with paid editions
that provide additional features and support options. Nginx and HAProxy are both mature products with
rich feature sets and high performance. You can extend them with third-party modules or by writing
custom scripts in Lua. Nginx also supports a JavaScript-based scripting module referred to as 'NGINX
JavaScript'. This module was formally named nginScript.
Ser vice mesh ingress controller . If you are using a service mesh such as linkerd or Istio, consider the
features that are provided by the ingress controller for that service mesh. For example, the Istio ingress
controller supports layer 7 routing, HTTP redirects, retries, and other features.
Azure Application Gateway. Application Gateway is a managed load balancing service that can perform
layer-7 routing and SSL termination. It also provides a web application firewall (WAF).
Azure API Management. API Management is a turnkey solution for publishing APIs to external and internal
customers. It provides features that are useful for managing a public-facing API, including rate limiting, IP
restrictions, and authentication using Azure Active Directory or other identity providers. API Management
doesn't perform any load balancing, so it should be used in conjunction with a load balancer such as
Application Gateway or a reverse proxy. For information about using API Management with Application
Gateway, see Integrate API Management in an internal VNet with Application Gateway.
When choosing a gateway technology, consider the following:
Features . The options listed above all support layer 7 routing, but support for other features will vary.
Depending on the features that you need, you might deploy more than one gateway.
Deployment . Azure Application Gateway and API Management are managed services. Nginx and HAProxy will
typically run in containers inside the cluster, but can also be deployed to dedicated VMs outside of the cluster. This
isolates the gateway from the rest of the workload, but incurs higher management overhead.
Management . When services are updated or new services are added, the gateway routing rules may need to be
updated. Consider how this process will be managed. Similar considerations apply to managing SSL certificates,
IP allow lists, and other aspects of configuration.

Deploying Nginx or HAProxy to Kubernetes


You can deploy Nginx or HAProxy to Kubernetes as a ReplicaSet or DaemonSet that specifies the Nginx or
HAProxy container image. Use a ConfigMap to store the configuration file for the proxy, and mount the
ConfigMap as a volume. Create a service of type LoadBalancer to expose the gateway through an Azure Load
Balancer.
An alternative is to create an Ingress Controller. An Ingress Controller is a Kubernetes resource that deploys a load
balancer or reverse proxy server. Several implementations exist, including Nginx and HAProxy. A separate
resource called an Ingress defines settings for the Ingress Controller, such as routing rules and TLS certificates.
That way, you don't need to manage complex configuration files that are specific to a particular proxy server
technology.
The gateway is a potential bottleneck or single point of failure in the system, so always deploy at least two
replicas for high availability. You may need to scale out the replicas further, depending on the load.
Also consider running the gateway on a dedicated set of nodes in the cluster. Benefits to this approach include:
Isolation. All inbound traffic goes to a fixed set of nodes, which can be isolated from backend services.
Stable configuration. If the gateway is misconfigured, the entire application may become unavailable.
Performance. You may want to use a specific VM configuration for the gateway for performance reasons.

Next steps
The previous articles have looked at the interfaces between microservices or between microservices and client
applications. By design, these interfaces treat each service as a opaque box. In particular, microservices should
never expose implementation details about how they manage data. That has implications for data integrity and
data consistency, explored in the next article.
Data considerations for microservices
Data considerations for microservices
11/2/2020 • 7 minutes to read • Edit Online

This article describes considerations for managing data in a microservices architecture. Because every
microservice manages its own data, data integrity and data consistency are critical challenges.
A basic principle of microservices is that each service manages its own data. Two services should not share a data
store. Instead, each service is responsible for its own private data store, which other services cannot access
directly.
The reason for this rule is to avoid unintentional coupling between services, which can result if services share the
same underlying data schemas. If there is a change to the data schema, the change must be coordinated across
every service that relies on that database. By isolating each service's data store, we can limit the scope of change,
and preserve the agility of truly independent deployments. Another reason is that each microservice may have its
own data models, queries, or read/write patterns. Using a shared data store limits each team's ability to optimize
data storage for their particular service.

This approach naturally leads to polyglot persistence — the use of multiple data storage technologies within a
single application. One service might require the schema-on-read capabilities of a document database. Another
might need the referential integrity provided by an RDBMS. Each team is free to make the best choice for their
service. For more about the general principle of polyglot persistence, see Use the best data store for the job.

NOTE
It's fine for services to share the same physical database server. The problem occurs when services share the same schema,
or read and write to the same set of database tables.

Challenges
Some challenges arise from this distributed approach to managing data. First, there may be redundancy across the
data stores, with the same item of data appearing in multiple places. For example, data might be stored as part of
a transaction, then stored elsewhere for analytics, reporting, or archiving. Duplicated or partitioned data can lead
to issues of data integrity and consistency. When data relationships span multiple services, you can't use
traditional data management techniques to enforce the relationships.
Traditional data modeling uses the rule of "one fact in one place." Every entity appears exactly once in the schema.
Other entities may hold references to it but not duplicate it. The obvious advantage to the traditional approach is
that updates are made in a single place, which avoids problems with data consistency. In a microservices
architecture, you have to consider how updates are propagated across services, and how to manage eventual
consistency when data appears in multiple places without strong consistency.

Approaches to managing data


There is no single approach that's correct in all cases, but here are some general guidelines for managing data in a
microservices architecture.
Embrace eventual consistency where possible. Understand the places in the system where you need strong
consistency or ACID transactions, and the places where eventual consistency is acceptable.
When you need strong consistency guarantees, one service may represent the source of truth for a given
entity, which is exposed through an API. Other services might hold their own copy of the data, or a subset of
the data, that is eventually consistent with the master data but not considered the source of truth. For
example, imagine an e-commerce system with a customer order service and a recommendation service.
The recommendation service might listen to events from the order service, but if a customer requests a
refund, it is the order service, not the recommendation service, that has the complete transaction history.
For transactions, use patterns such as Scheduler Agent Supervisor and Compensating Transaction to keep
data consistent across several services. You may need to store an additional piece of data that captures the
state of a unit of work that spans multiple services, to avoid partial failure among multiple services. For
example, keep a work item on a durable queue while a multi-step transaction is in progress.
Store only the data that a service needs. A service might only need a subset of information about a domain
entity. For example, in the Shipping bounded context, we need to know which customer is associated to a
particular delivery. But we don't need the customer's billing address — that's managed by the Accounts
bounded context. Thinking carefully about the domain, and using a DDD approach, can help here.
Consider whether your services are coherent and loosely coupled. If two services are continually
exchanging information with each other, resulting in chatty APIs, you may need to redraw your service
boundaries, by merging two services or refactoring their functionality.
Use an event driven architecture style. In this architecture style, a service publishes an event when there are
changes to its public models or entities. Interested services can subscribe to these events. For example,
another service could use the events to construct a materialized view of the data that is more suitable for
querying.
A service that owns events should publish a schema that can be used to automate serializing and
deserializing the events, to avoid tight coupling between publishers and subscribers. Consider JSON
schema or a framework like Microsoft Bond, Protobuf, or Avro.
At high scale, events can become a bottleneck on the system, so consider using aggregation or batching to
reduce the total load.

Example: Choosing data stores for the Drone Delivery application


The previous articles in this series discuss a drone delivery service as a running example. You can read more about
the scenario and the corresponding reference implementation here.
To recap, this application defines several microservices for scheduling deliveries by drone. When a user schedules
a new delivery, the client request includes information about the delivery, such as pickup and dropoff locations,
and about the package, such as size and weight. This information defines a unit of work.
The various backend services care about different portions of the information in the request, and also have
different read and write profiles.
Delivery service
The Delivery service stores information about every delivery that is currently scheduled or in progress. It listens
for events from the drones, and tracks the status of deliveries that are in progress. It also sends domain events
with delivery status updates.
It's expected that users will frequently check the status of a delivery while they are waiting for their package.
Therefore, the Delivery service requires a data store that emphasizes throughput (read and write) over long-term
storage. Also, the Delivery service does not perform any complex queries or analysis, it simply fetches the latest
status for a given delivery. The Delivery service team chose Azure Cache for Redis for its high read-write
performance. The information stored in Redis is relatively short-lived. Once a delivery is complete, the Delivery
History service is the system of record.
Delivery History service
The Delivery History service listens for delivery status events from the Delivery service. It stores this data in long-
term storage. There are two different use-cases for this historical data, which have different data storage
requirements.
The first scenario is aggregating the data for the purpose of data analytics, in order to optimize the business or
improve the quality of the service. Note that the Delivery History service doesn't perform the actual analysis of the
data. It's only responsible for the ingestion and storage. For this scenario, the storage must be optimized for data
analysis over a large set of data, using a schema-on-read approach to accommodate a variety of data sources.
Azure Data Lake Store is a good fit for this scenario. Data Lake Store is an Apache Hadoop file system compatible
with Hadoop Distributed File System (HDFS), and is tuned for performance for data analytics scenarios.
The other scenario is enabling users to look up the history of a delivery after the delivery is completed. Azure Data
Lake is not optimized for this scenario. For optimal performance, Microsoft recommends storing time-series data
in Data Lake in folders partitioned by date. (See Tuning Azure Data Lake Store for performance). However, that
structure is not optimal for looking up individual records by ID. Unless you also know the timestamp, a lookup by
ID requires scanning the entire collection. Therefore, the Delivery History service also stores a subset of the
historical data in Cosmos DB for quicker lookup. The records don't need to stay in Cosmos DB indefinitely. Older
deliveries can be archived — say, after a month. This could be done by running an occasional batch process.
Package service
The Package service stores information about all of the packages. The storage requirements for the Package are:
Long-term storage.
Able to handle a high volume of packages, requiring high write throughput.
Support simple queries by package ID. No complex joins or requirements for referential integrity.
Because the package data is not relational, a document-oriented database is appropriate, and Cosmos DB can
achieve high throughput by using sharded collections. The team that works on the Package service is familiar with
the MEAN stack (MongoDB, Express.js, AngularJS, and Node.js), so they select the MongoDB API for Cosmos DB.
That lets them leverage their existing experience with MongoDB, while getting the benefits of Cosmos DB, which is
a managed Azure service.
Design patterns for microservices
11/2/2020 • 2 minutes to read • Edit Online

The goal of microservices is to increase the velocity of application releases, by decomposing the application into
small autonomous services that can be deployed independently. A microservices architecture also brings some
challenges. The design patterns shown here can help mitigate these challenges.

Ambassador can be used to offload common client connectivity tasks such as monitoring, logging, routing, and
security (such as TLS) in a language agnostic way. Ambassador services are often deployed as a sidecar (see
below).
Anti-corruption layer implements a façade between new and legacy applications, to ensure that the design of a
new application is not limited by dependencies on legacy systems.
Backends for Frontends creates separate backend services for different types of clients, such as desktop and
mobile. That way, a single backend service doesn't need to handle the conflicting requirements of various client
types. This pattern can help keep each microservice simple, by separating client-specific concerns.
Bulkhead isolates critical resources, such as connection pool, memory, and CPU, for each workload or service. By
using bulkheads, a single workload (or service) can't consume all of the resources, starving others. This pattern
increases the resiliency of the system by preventing cascading failures caused by one service.
Gateway Aggregation aggregates requests to multiple individual microservices into a single request, reducing
chattiness between consumers and services.
Gateway Offloading enables each microservice to offload shared service functionality, such as the use of SSL
certificates, to an API gateway.
Gateway Routing routes requests to multiple microservices using a single endpoint, so that consumers don't
need to manage many separate endpoints.
Sidecar deploys helper components of an application as a separate container or process to provide isolation and
encapsulation.
Strangler supports incremental refactoring of an application, by gradually replacing specific pieces of functionality
with new services.
For the complete catalog of cloud design patterns on the Azure Architecture Center, see Cloud Design Patterns.
Monitoring a microservices architecture in Azure
Kubernetes Service (AKS)
11/2/2020 • 14 minutes to read • Edit Online

This article describes best practices for monitoring a microservices application that runs on Azure Kubernetes
Service (AKS).
In any complex application, at some point something will go wrong. In a microservices application, you need to
track what's happening across dozens or even hundreds of services. To make sense of what's happening, you must
collect telemetry from the application. Telemetry can be divided into logs and metrics.
Logs are text-based records of events that occur while the application is running. They include things like
application logs (trace statements) or web server logs. Logs are primarily useful for forensics and root cause
analysis.
Metrics are numerical values that can be analyzed. You can use them to observe the system in real time (or close
to real time), or to analyze performance trends over time. To understand the system holistically, you must collect
metrics at various levels of the architecture, from the physical infrastructure to the application, including:
Node-level metrics, including CPU, memory, network, disk, and file system usage. System metrics help
you to understand resource allocation for each node in the cluster, and troubleshoot outliers.
Container metrics. For containerized applications, you need to collect metrics at the container level, not
just at the VM level.
Application metrics. This includes any metrics that are relevant to understanding the behavior of a service.
Examples include the number of queued inbound HTTP requests, request latency, or message queue length.
Applications can also create custom metrics that are specific to the domain, such as the number of business
transactions processed per minute.
Dependent ser vice metrics. Services may call external services or endpoints, such as managed PaaS
services or SaaS services. Third-party services may or may not provide any metrics. If not, you'll have to
rely on your own application metrics to track statistics for latency and error rate.

Monitoring cluster status


Use Azure Monitor to monitor the overall health of your clusters. The following screenshot shows a cluster with
critical errors in user-deployed pods.

From here, you can drill in further to find the issue. For example, if the pod status is ImagePullBackoff , it means
that Kubernetes could not pull the container image from the registry. This could be caused by an invalid container
tag or an authentication error trying to pull from the registry.
Note that a container crashing will put the container state into State = Waiting ,with Reason =
CrashLoopBackOff . For a typical scenario where a pod is part of a replica set and the retry policy is Always , this
won't show as an error in the cluster status. However, you can run queries or set up alerts for this condition. For
more information, see Understand AKS cluster performance with Azure Monitor for containers.

Metrics
We recommend using Azure Monitor to collect and view metrics for your AKS clusters and any other dependent
Azure services.
For cluster and container metrics, enable Azure Monitor for containers. When this feature is enabled, Azure
Monitor collects memory and processor metrics from controllers, nodes, and containers via the Kubernetes
metrics API. For more information about the metrics that are available through Azure Monitor for
containers, see Understand AKS cluster performance with Azure Monitor for containers.
Use Application Insights to collect application metrics. Application Insights is an extensible Application
Performance Management (APM) service. To use it, you install an instrumentation package in your
application. This package monitors the app and sends telemetry data to the Application Insights service. It
can also pull telemetry data from the host environment. The data is then sent to Azure Monitor. Application
Insights also provides built-in correlation and dependency tracking (see Distributed tracing, below).
Application Insights has a maximum throughput measured in events/second, and it throttles if the data rate
exceeds the limit. For details, see Application Insights limits. Create different Application Insights instances per
environment, so that dev/test environments don't compete against the production telemetry for quota.
A single operation may generate several telemetry events, so if the application experiences a high volume of
traffic, it is likely to get throttled. To mitigate this problem, you can perform sampling to reduce the telemetry
traffic. The tradeoff is that your metrics will be less precise. For more information, see Sampling in Application
Insights. You can also reduce the data volume by pre-aggregating metrics — that is, calculating statistical values
such as average and standard deviation, and sending those values instead of the raw telemetry. The following blog
post describes an approach to using Application Insights at scale: Azure Monitoring and Analytics at Scale.
If your data rate is high enough to trigger throttling, and sampling or aggregation are not acceptable, consider
exporting metrics to a time-series database such as Prometheus or InfluxDB running in the cluster.
InfluxDB is a push-based system. An agent needs to push the metrics. You can use TICK stack, to setup
monitoring of Kubernetes, and push it to InfluxDB using Telegraf, which is an agent for collecting and
reporting metrics. InfluxDB can be used for irregular events and string data types.
Prometheus is a pull-based system. It periodically scrapes metrics from configured locations. Prometheus
can scrape metrics generated by cAdvisor or kube-state-metrics. kube-state-metrics is a service that collects
metrics from the Kubernetes API server and makes them available to Prometheus (or a scraper that is
compatible with a Prometheus client endpoint). For system metrics, use Node exporter, which is a
Prometheus exporter for system metrics. Prometheus supports floating point data, but not string data, so it
is appropriate for system metrics but not logs. Kubernetes Metrics Server is a cluster-wide aggregator of
resource usage data.

Logging
Here are some of the general challenges of logging in a microservices application:
Understanding the end-to-end processing of a client request, where multiple services might be invoked to
handle a single request.
Consolidating logs from multiple services into a single aggregated view.
Parsing logs that come from multiple sources, which use their own logging schemas or have no particular
schema. Logs may be generated by third-party components that you don't control.
Microservices architectures often generate a larger volume of logs than traditional monoliths, because there
are more services, network calls, and steps in a transaction. That means logging itself can be a performance or
resource bottleneck for the application.
There are some additional challenges for a Kubernetes-based architecture:
Containers can move around and be rescheduled.
Kubernetes has a networking abstraction that uses virtual IP addresses and port mappings.
In Kubernetes, the standard approach to logging is for a container to write logs to stdout and stderr. The container
engine redirects these streams to a logging driver. For ease of querying, and to prevent possible loss of log data if
a node crashes, the usual approach is to collect the logs from each node and send them to a central storage
location.
Azure Monitor integrates with AKS to support this approach. Azure Monitor collects container logs and sends
them to a Log Analytics workspace. From there, you can use the Kusto query language to write queries across the
aggregated logs. For example, here is a Kusto query to show the container logs for a specified pod:

let ContainerIdList = KubePodInventory


| where ClusterName =~ '<cluster-name>'
| where Name =~ '<pod-name>'
| distinct ContainerID;
ContainerLog
| where ContainerID in (ContainerIdList)

Azure Monitor is a managed service, and configuring an AKS cluster to use Azure Monitor is a simple
configuration switch in the CLI or Resource Manager template. (For more information, see How to enable Azure
Monitor for containers.) Another advantage of using Azure Monitoring is that it consolidates your AKS logs with
other Azure platform logs, providing a unified monitoring experience.
Azure Monitor is billed per gigabyte (GB) of data ingested into the service (see Azure Monitor pricing). At very
high volumes, cost may become a consideration. There are many open-source alternatives available for the
Kubernetes ecosystem. For example, many organizations use Fluentd with Elasticsearch . Fluentd is an open-
source data collector, and Elasticsearch is a document database that is for search. A challenge with these options is
that they require additional configuration and management of the cluster. For a production workload, you may
need to experiment with configuration settings. You'll also need to monitor the performance of the logging
infrastructure.
Application Insights
For richer log data, we recommend instrumenting your code with Application Insights. This requires adding an
Application Insights package to your code and configuring your code to send logging statements to Application
Insights. The details depend on the platform, such as .NET, Java, or Node.js. The Application Insights package sends
telemetry data to Azure Monitor.
If you are using .NET Core, we recommend also using the Application Insights for Kubernetes library. This library
enriches Application Insights traces with additional information such as the container, node, pod, labels, and replica
set.
Advantages of this approach include:
Application Insights logs HTTP requests, including latency and result code.
Distributed tracing is enabled by default.
Traces include an operation ID, so you can match all traces for a particular operation.
Traces generated by Application Insights often have additional contextual information. For example, ASP.NET
traces are decorated with the action name and a category such as ControllerActionInvoker , which give you
insights into the ASP.NET request pipeline.
Application Insights collects performance metrics for performance troubleshooting and optimization.
Considerations:
Application Insights throttles the telemetry if the data rate exceeds a maximum limit; for details, see Application
Insights limits. A single operation may generate several telemetry events, so if the application experiences a
high volume of traffic, it is likely to get throttled.
Because Application Insights batches data, it's possible to lose a batch if a process crashes with an unhandled
exception.
Application Insights is billed based on data volume. For more information, see Manage pricing and data volume
in Application Insights.

Structured logging
To make logs easier to parse, use structured logging where possible. Structured logging is approach where the
application writes logs in a structured format, such as JSON, rather than outputting unstructured text strings. There
are many structured logging libraries available. For example, here is a logging statement that uses the Serilog
library for .NET Core:

public async Task<IActionResult> Put([FromBody]Delivery delivery, string id)


{
logger.LogInformation("In Put action with delivery {Id}: {@DeliveryInfo}", id, delivery.ToLogInfo());

...
}

Here, the call to LogInformation includes an Id parameter and DeliveryInfo parameter. With structured logging,
these values are not interpolated into the message string. Instead, the log output will look something like this:

{"@t":"2019-06-13T00:57:09.9932697Z","@mt":"In Put action with delivery {Id}: {@DeliveryInfo}","Id":"36585f2d-


c1fa-4a3d-9e06-a7f40b7d04ef","DeliveryInfo":{...

This is a JSON string, where the "@t" field is a timestamp, "@mt" is the message string, and the remaining
key/value pairs are the parameters. Outputting JSON format makes it easier to query the data in a structured way.
For example, the following Log Analytics query, written in the Kusto query language, searches for instances of this
particular message from all containers named fabrikam-delivery :

traces
| where customDimensions.["Kubernetes.Container.Name"] == "fabrikam-delivery"
| where customDimensions.["{OriginalFormat}"] == "In Put action with delivery {Id}: {@DeliveryInfo}"
| project message, customDimensions["Id"], customDimensions["@DeliveryInfo"]

Viewing the result in the Azure portal shows that DeliveryInfo is a structured record that contains the serialized
representation of the DeliveryInfo model:
Here is the JSON from this example:

{
"Id": "36585f2d-c1fa-4a3d-9e06-a7f40b7d04ef",
"Owner": {
"UserId": "user id for logging",
"AccountId": "52dadf0c-0067-43e7-af76-86e32b48bc5e"
},
"Pickup": {
"Altitude": 0.29295161612934972,
"Latitude": 0.26815900219052985,
"Longitude": 0.79841844309047727
},
"Dropoff": {
"Altitude": 0.31507750848078986,
"Latitude": 0.753494655598651,
"Longitude": 0.89352830773849423
},
"Deadline": "string",
"Expedited": true,
"ConfirmationRequired": 0,
"DroneId": "AssignedDroneId01ba4d0b-c01a-4369-ba75-51bde0e76cc9"
}

The previous code snippet used the Serilog library, but structured logging libraries are available for other
languages as well. For example, here's an example using the SLF4J library for Java:

MDC.put("DeliveryId", deliveryId);

log.info("In schedule delivery action with delivery request {}", externalDelivery.toString());

Distributed tracing
A significant challenge of microservices is to understand the flow of events across services. A single transaction
may involve calls to multiple services. To reconstruct the entire sequence of steps, each service should propagate a
correlation ID that acts as a unique identifier for that operation. The correlation ID enables distributed tracing
across services.
The first service that receives a client request should generate the correlation ID. If the service makes an HTTP call
to another service, it puts the correlation ID in a request header. If the service sends an asynchronous message, it
puts the correlation ID into the message. Downstream services continue to propagate the correlation ID, so that it
flows through the entire system. In addition, all code that writes application metrics or log events should include
the correlation ID.
When service calls are correlated, you can calculate operational metrics such as the end-to-end latency for a
complete transaction, the number of successful transactions per second, and the percentage of failed transactions.
Including correlation IDs in application logs makes it possible to perform root cause analysis. If an operation fails,
you can find the log statements for all of the service calls that were part of the same operation.
We recommend using Application Insights for distributed tracing. The Application Insights SDK automatically
injects correlation context into HTTP headers, and includes the correlation ID in Application Insights logs. Some
services may still need to explicitly propagate the correlation headers, depending on the frameworks and libraries
being used. For more information, see Telemetry correlation in Application Insights.
Some additional considerations when implementing distributed tracing:
There is now a standard HTTP header for correlation IDs, a W3C proposal has been accepted as an official
recommendation recently. Your team should standardize on a custom header value. The choice may be
decided by your logging framework, such as Application Insights, or choice of service mesh.
For asynchronous messages, if your messaging infrastructure supports adding metadata to messages, you
should include the correlation ID as metadata. Otherwise, include it as part of the message schema. For
example, see Distributed tracing and correlation through Service Bus messaging.
Rather than a single opaque identifier, you might send a correlation context that includes richer information,
such as caller-callee relationships.
If you are using Istio or linkerd as a service mesh, these technologies automatically generate correlation
headers when HTTP calls are routed through the service mesh proxies. Services should forward the relevant
headers.
Example of distributed tracing
This example follows a distributed transaction through a set of microservices. The example is taken from a
reference implementation described here.

In this scenario, the distributed transaction has the following steps:


1. The Ingestion service puts a message on a Service Bus queue.
2. The Workflow service pulls the message from the queue.
3. The Workflow service calls three backend services to process the request (Drone Scheduler, Package, and
Delivery).
The following screenshot shows the application map for the Drone Delivery application. This map shows calls to
the public API endpoint that result in a workflow that involves five microservices.
The arrows from fabrikam-workflow and fabrikam-ingestion to a Service Bus queue show where the messages
are sent and received. You can't tell from the diagram which service is sending messages and which is receiving —
the arrows just show that both services are calling Service Bus — but this information is available in the details:

Because every call includes an operation ID, you can also view the end-to-end steps in a single transaction,
including timing information and the HTTP calls at each step. Here is the visualization of one such transaction:
This visualization shows the steps from the ingestion service to the queue, from the queue to the workflow service,
and from the workflow service to the other backend services. The last step is the Workflow service marking the
Service Bus message as completed.
Now here is an example when calls to a backend service were failing:

This shows that a large fraction (36%) of calls to the Drone Scheduler service failed during the period being
queried. In the end-to-end transaction view, it shows that an exception occurs when sending an HTTP PUT request
to the service.
Further drilling in, the exception turns out to be a socket exception, "No such device or address."

Fabrikam.Workflow.Service.Services.BackendServiceCallFailedException: No such device or address ---u003e


System.Net.Http.HttpRequestException: No such device or address ---u003e
System.Net.Sockets.SocketException: No such device or address

This is a hint that the backend service is not reachable. At this point, you might use kubectl to view the deployment
configuration. In this example, it turned out the service hostname was not resolving, due to an error in the
Kubernetes configuration files. The article Debug Services in the Kubernetes documentation has tips for
diagnosing this sort of error.
Here are some common causes of errors:
Code bugs. These might manifest as:
Exceptions. Look in the Application Insights logs to view the exception details.
Process crashing. Look at container and pod status, and view container logs or Application Insights
traces.
HTTP 5xx errors
Resource exhaustion:
Look for throttling (HTTP 429) or request timeouts.
Examine container metrics for CPU, memory, and disk
Look at the configurations for container and pod resource limits.
Service discovery. Examine the Kubernetes service configuration and port mappings.
API mismatch. Look for HTTP 400 errors. If APIs are versioned, look at which version is being called.
Error pulling a container image. Look at the pod specification. Also make sure the cluster is authorized to pull
from the container registry.
RBAC issues.

Next steps
Learn more about features in Azure Monitor that support monitoring of applications on AKS:
Azure Monitor for containers overview
Understand AKS cluster performance with Azure Monitor for containers
For more information about using metrics for performance tuning, see see Performance tuning a distributed
application.
CI/CD for microservices architectures
11/2/2020 • 8 minutes to read • Edit Online

Faster release cycles are one of the major advantages of microservices architectures. But without a good CI/CD
process, you won't achieve the agility that microservices promise. This article describes the challenges and
recommends some approaches to the problem.

What is CI/CD?
When we talk about CI/CD, we're really talking about several related processes: Continuous integration, continuous
delivery, and continuous deployment.
Continuous integration . Code changes are frequently merged into the main branch. Automated build
and test processes ensure that code in the main branch is always production-quality.
Continuous deliver y . Any code changes that pass the CI process are automatically published to a
production-like environment. Deployment into the live production environment may require manual
approval, but is otherwise automated. The goal is that your code should always be ready to deploy into
production.
Continuous deployment . Code changes that pass the previous two steps are automatically deployed into
production.
Here are some goals of a robust CI/CD process for a microservices architecture:
Each team can build and deploy the services that it owns independently, without affecting or disrupting
other teams.
Before a new version of a service is deployed to production, it gets deployed to dev/test/QA environments
for validation. Quality gates are enforced at each stage.
A new version of a service can be deployed side by side with the previous version.
Sufficient access control policies are in place.
For containerized workloads, you can trust the container images that are deployed to production.

Why a robust CI/CD pipeline matters


In a traditional monolithic application, there is a single build pipeline whose output is the application executable.
All development work feeds into this pipeline. If a high-priority bug is found, a fix must be integrated, tested, and
published, which can delay the release of new features. You can mitigate these problems by having well-factored
modules and using feature branches to minimize the impact of code changes. But as the application grows more
complex, and more features are added, the release process for a monolith tends to become more brittle and likely
to break.
Following the microservices philosophy, there should never be a long release train where every team has to get in
line. The team that builds service "A" can release an update at any time, without waiting for changes in service "B"
to be merged, tested, and deployed.
To achieve a high release velocity, your release pipeline must be automated and highly reliable to minimize risk. If
you release to production one or more times daily , regressions or service disruptions must be rare. At the same
time, if a bad update does get deployed, you must have a reliable way to quickly roll back or roll forward to a
previous version of a service.

Challenges
Many small independent code bases . Each team is responsible for building its own service, with its own
build pipeline. In some organizations, teams may use separate code repositories. Separate repositories can
lead to a situation where the knowledge of how to build the system is spread across teams, and nobody in
the organization knows how to deploy the entire application. For example, what happens in a disaster
recovery scenario, if you need to quickly deploy to a new cluster?
Mitigation : Have a unified and automated pipeline to build and deploy services, so that this knowledge is
not "hidden" within each team.
Multiple languages and frameworks . With each team using its own mix of technologies, it can be
difficult to create a single build process that works across the organization. The build process must be
flexible enough that every team can adapt it for their choice of language or framework.
Mitigation : Containerize the build process for each service. That way, the build system just needs to be
able to run the containers.
Integration and load testing . With teams releasing updates at their own pace, it can be challenging to
design robust end-to-end testing, especially when services have dependencies on other services. Moreover,
running a full production cluster can be expensive, so it's unlikely that every team will run its own full
cluster at production scales, just for testing.
Release management . Every team should be able to deploy an update to production. That doesn't mean
that every team member has permissions to do so. But having a centralized Release Manager role can
reduce the velocity of deployments.
Mitigation : The more that your CI/CD process is automated and reliable, the less there should be a need
for a central authority. That said, you might have different policies for releasing major feature updates
versus minor bug fixes. Being decentralized doesn't mean zero governance.
Ser vice updates . When you update a service to a new version, it shouldn't break other services that
depend on it.
Mitigation : Use deployment techniques such as blue-green or canary release for non-breaking changes.
For breaking API changes, deploy the new version side by side with the previous version. That way, services
that consume the previous API can be updated and tested for the new API. See Updating services, below.

Monorepo vs. multi-repo


Before creating a CI/CD workflow, you must know how the code base will be structured and managed.
Do teams work in separate repositories or in a monorepo (single repository)?
What is your branching strategy?
Who can push code to production? Is there a release manager role?
The monorepo approach has been gaining favor but there are advantages and disadvantages to both.

M O N O REP O M ULT IP L E REP O S

Advantages Code sharing Clear ownership per team


Easier to standardize code and tooling Potentially fewer merge conflicts
Easier to refactor code Helps to enforce decoupling of
Discoverability - single view of the code microservices

Challenges Changes to shared code can affect Harder to share code


multiple microservices Harder to enforce coding standards
Greater potential for merge conflicts Dependency management
Tooling must scale to a large code base Diffuse code base, poor discoverability
Access control Lack of shared infrastructure
More complex deployment process

Updating services
There are various strategies for updating a service that's already in production. Here we discuss three common
options: Rolling update, blue-green deployment, and canary release.
Rolling updates
In a rolling update, you deploy new instances of a service, and the new instances start receiving requests right
away. As the new instances come up, the previous instances are removed.
Example. In Kubernetes, rolling updates are the default behavior when you update the pod spec for a
Deployment. The Deployment controller creates a new ReplicaSet for the updated pods. Then it scales up the new
ReplicaSet while scaling down the old one, to maintain the desired replica count. It doesn't delete old pods until the
new ones are ready. Kubernetes keeps a history of the update, so you can roll back an update if needed.
Example. Azure Service Fabric uses the rolling update strategy by default. This strategy is best suited for
deploying a version of a service with new features without changing existing APIs. Service Fabric starts an upgrade
deployment by updating the application type to a subset of the nodes or an update domain. It then rolls forward to
the next update domain until all domains are upgraded. If an upgrade domain fails to update, the application type
rolls back to the previous version across all domains. Be aware that an application type with multiple services (and
if all services are updated as part of one upgrade deployment) is prone to failure. If one service fails to update, the
entire application is rolled back to the previous version and the other services are not updated.
One challenge of rolling updates is that during the update process, a mix of old and new versions are running and
receiving traffic. During this period, any request could get routed to either of the two versions.
For breaking API changes, a good practice is to support both versions side by side, until all clients of the previous
version are updated. See API versioning.
Blue -green deployment
In a blue-green deployment, you deploy the new version alongside the previous version. After you validate the
new version, you switch all traffic at once from the previous version to the new version. After the switch, you
monitor the application for any problems. If something goes wrong, you can swap back to the old version.
Assuming there are no problems, you can delete the old version.
With a more traditional monolithic or N-tier application, blue-green deployment generally meant provisioning two
identical environments. You would deploy the new version to a staging environment, then redirect client traffic to
the staging environment — for example, by swapping VIP addresses. In a microservices architecture, updates
happen at the microservice level, so you would typically deploy the update into the same environment and use a
service discovery mechanism to swap.
Example . In Kubernetes, you don't need to provision a separate cluster to do blue-green deployments. Instead,
you can take advantage of selectors. Create a new Deployment resource with a new pod spec and a different set of
labels. Create this deployment, without deleting the previous deployment or modifying the service that points to it.
Once the new pods are running, you can update the service's selector to match the new deployment.
One drawback of blue-green deployment is that during the update, you are running twice as many pods for the
service (current and next). If the pods require a lot of CPU or memory resources, you may need to scale out the
cluster temporarily to handle the resource consumption.
Canary release
In a canary release, you roll out an updated version to a small number of clients. Then you monitor the behavior of
the new service before rolling it out to all clients. This lets you do a slow rollout in a controlled fashion, observe
real data, and spot problems before all customers are affected.
A canary release is more complex to manage than either blue-green or rolling update, because you must
dynamically route requests to different versions of the service.
Example . In Kubernetes, you can configure a Service to span two replica sets (one for each version) and adjust the
replica counts manually. However, this approach is rather coarse-grained, because of the way Kubernetes load
balances across pods. For example, if you have a total of 10 replicas, you can only shift traffic in 10% increments. If
you are using a service mesh, you can use the service mesh routing rules to implement a more sophisticated
canary release strategy.

Next steps
Learn specific CI/CD practices for microservices running on Kubernetes.
CI/CD for microservices on Kubernetes
Building a CI/CD pipeline for microservices on
Kubernetes
11/13/2020 • 12 minutes to read • Edit Online

It can be challenging to create a reliable CI/CD process for a microservices architecture. Individual teams must be
able to release services quickly and reliably, without disrupting other teams or destabilizing the application as a
whole.
This article describes an example CI/CD pipeline for deploying microservices to Azure Kubernetes Service (AKS).
Every team and project is different, so don't take this article as a set of hard-and-fast rules. Instead, it's meant to be
a starting point for designing your own CI/CD process.
The example pipeline described here was created for a microservices reference implementation called the Drone
Delivery application, which you can find on GitHub. The application scenario is described here.
The goals of the pipeline can be summarized as follows:
Teams can build and deploy their services independently.
Code changes that pass the CI process are automatically deployed to a production-like environment.
Quality gates are enforced at each stage of the pipeline.
A new version of a service can be deployed side by side with the previous version.
For more background, see CI/CD for microservices architectures.

Assumptions
For purposes of this example, here are some assumptions about the development team and the code base:
The code repository is a monorepo, with folders organized by microservice.
The team's branching strategy is based on trunk-based development.
The team uses release branches to manage releases. Separate releases are created for each microservice.
The CI/CD process uses Azure Pipelines to build, test, and deploy the microservices to AKS.
The container images for each microservice are stored in Azure Container Registry.
The team uses Helm charts to package each microservice.
These assumptions drive many of the specific details of the CI/CD pipeline. However, the basic approach described
here be adapted for other processes, tools, and services, such as Jenkins or Docker Hub.

Validation builds
Suppose that a developer is working on a microservice called the Delivery Service. While developing a new
feature, the developer checks code into a feature branch. By convention, feature branches are named feature/* .
The build definition file includes a trigger that filters by the branch name and the source path:

trigger:
batch: true
branches:
include:
# for new release to production: release flow strategy
- release/delivery/v*
- refs/release/delivery/v*
- master
- feature/delivery/*
- topic/delivery/*
paths:
include:
- /src/shipping/delivery/

See the source file.


Using this approach, each team can have its own build pipeline. Only code that is checked into the
/src/shipping/delivery folder triggers a build of the Delivery Service. Pushing commits to a branch that matches
the filter triggers a CI build. At this point in the workflow, the CI build runs some minimal code verification:
1. Build the code.
2. Run unit tests.
The goal is to keep build times short, so the developer can get quick feedback. Once the feature is ready to merge
into master, the developer opens a PR. This triggers another CI build that performs some additional checks:
1. Build the code.
2. Run unit tests.
3. Build the runtime container image.
4. Run vulnerability scans on the image.

NOTE
In Azure DevOps Repos, you can define policies to protect branches. For example, the policy could require a successful CI
build plus a sign-off from an approver in order to merge into master.

Full CI/CD build


At some point, the team is ready to deploy a new version of the Delivery service. The release manager creates a
branch from master with this naming pattern: release/<microservice name>/<semver> . For example,
release/delivery/v1.0.2 .
Creation of this branch triggers a full CI build that runs all of the previous steps plus:
1. Push the container image to Azure Container Registry. The image is tagged with the version number taken
from the branch name.
2. Run helm package to package the Helm chart for the service. The chart is also tagged with a version number.
3. Push the Helm package to Container Registry.
Assuming this build succeeds, it triggers a deployment (CD) process using an Azure Pipelines release pipeline. This
pipeline has the following steps:
1. Deploy the Helm chart to a QA environment.
2. An approver signs off before the package moves to production. See Release deployment control using
approvals.
3. Retag the Docker image for the production namespace in Azure Container Registry. For example, if the current
tag is myrepo.azurecr.io/delivery:v1.0.2 , the production tag is myrepo.azurecr.io/prod/delivery:v1.0.2 .
4. Deploy the Helm chart to the production environment.
Even in a monorepo, these tasks can be scoped to individual microservices, so that teams can deploy with high
velocity. The process has some manual steps: Approving PRs, creating release branches, and approving
deployments into the production cluster. These steps are manual by policy — they could be automated if the
organization prefers.

Isolation of environments
You will have multiple environments where you deploy services, including environments for development, smoke
testing, integration testing, load testing, and finally production. These environments need some level of isolation.
In Kubernetes, you have a choice between physical isolation and logical isolation. Physical isolation means
deploying to separate clusters. Logical isolation uses namespaces and policies, as described earlier.
Our recommendation is to create a dedicated production cluster along with a separate cluster for your dev/test
environments. Use logical isolation to separate environments within the dev/test cluster. Services deployed to the
dev/test cluster should never have access to data stores that hold business data.

Build process
When possible, package your build process into a Docker container. That allows you to build your code artifacts
using Docker, without needing to configure the build environment on each build machine. A containerized build
process makes it easy to scale out the CI pipeline by adding new build agents. Also, any developer on the team can
build the code simply by running the build container.
By using multi-stage builds in Docker, you can define the build environment and the runtime image in a single
Dockerfile. For example, here's a Dockerfile that builds an ASP.NET Core application:
FROM mcr.microsoft.com/dotnet/core/runtime:3.1 AS base
WORKDIR /app

FROM mcr.microsoft.com/dotnet/core/sdk:3.1 AS build


WORKDIR /src/Fabrikam.Workflow.Service

COPY Fabrikam.Workflow.Service/Fabrikam.Workflow.Service.csproj .
RUN dotnet restore Fabrikam.Workflow.Service.csproj

COPY Fabrikam.Workflow.Service/. .
RUN dotnet build Fabrikam.Workflow.Service.csproj -c Release -o /app

FROM build AS publish


RUN dotnet publish Fabrikam.Workflow.Service.csproj -c Release -o /app

FROM base AS final


WORKDIR /app
COPY --from=publish /app .
ENTRYPOINT ["dotnet", "Fabrikam.Workflow.Service.dll"]

See the source file.


This Dockerfile defines several build stages. Notice that the stage named base uses the ASP.NET Core runtime,
while the stage named build uses the full ASP.NET Core SDK. The build stage is used to build the ASP.NET Core
project. But the final runtime container is built from base , which contains just the runtime and is significantly
smaller than the full SDK image.
Building a test runner
Another good practice is to run unit tests in the container. For example, here is part of a Docker file that builds a
test runner:

FROM build AS testrunner


WORKDIR /src/tests

COPY Fabrikam.Workflow.Service.Tests/*.csproj .
RUN dotnet restore Fabrikam.Workflow.Service.Tests.csproj

COPY Fabrikam.Workflow.Service.Tests/. .
ENTRYPOINT ["dotnet", "test", "--logger:trx"]

A developer can use this Docker file to run the tests locally:

docker build . -t delivery-test:1 --target=testrunner


docker run -p 8080:8080 delivery-test:1

The CI pipeline should also run the tests as part of the build verification step.
Note that this file uses the Docker ENTRYPOINT command to run the tests, not the Docker RUN command.
If you use the RUN command, the tests run every time you build the image. By using ENTRYPOINT , the tests are
opt-in. They run only when you explicitly target the testrunner stage.
A failing test doesn't cause the Docker build command to fail. That way you can distinguish container build
failures from test failures.
Test results can be saved to a mounted volume.
Container best practices
Here are some other best practices to consider for containers:
Define organization-wide conventions for container tags, versioning, and naming conventions for resources
deployed to the cluster (pods, services, and so on). That can make it easier to diagnose deployment issues.
During the development and test cycle, the CI/CD process will build many container images. Only some of
those images are candidates for release, and then only some of those release candidates will get promoted
to production. Have a clear versioning strategy, so that you know which images are currently deployed to
production, and can roll back to a previous version if necessary.
Always deploy specific container version tags, not latest .
Use namespaces in Azure Container Registry to isolate images that are approved for production from
images that are still being tested. Don't move an image into the production namespace until you're ready to
deploy it into production. If you combine this practice with semantic versioning of container images, it can
reduce the chance of accidentally deploying a version that wasn't approved for release.
Follow the principle of least privilege by running containers as a nonprivileged user. In Kubernetes, you can
create a pod security policy that prevents containers from running as root. See Prevent Pods From Running
With Root Privileges.

Helm charts
Consider using Helm to manage building and deploying services. Here are some of the features of Helm that help
with CI/CD:
Often a single microservice is defined by multiple Kubernetes objects. Helm allows these objects to be
packaged into a single Helm chart.
A chart can be deployed with a single Helm command, rather than a series of kubectl commands.
Charts are explicitly versioned. Use Helm to release a version, view releases, and roll back to a previous version.
Tracking updates and revisions, using semantic versioning, along with the ability to roll back to a previous
version.
Helm charts use templates to avoid duplicating information, such as labels and selectors, across many files.
Helm can manage dependencies between charts.
Charts can be stored in a Helm repository, such as Azure Container Registry, and integrated into the build
pipeline.
For more information about using Container Registry as a Helm repository, see Use Azure Container Registry as a
Helm repository for your application charts.

IMPORTANT
This feature is currently in preview. Previews are made available to you on the condition that you agree to the supplemental
terms of use. Some aspects of this feature may change prior to general availability (GA).

A single microservice may involve multiple Kubernetes configuration files. Updating a service can mean touching
all of these files to update selectors, labels, and image tags. Helm treats these as a single package called a chart
and allows you to easily update the YAML files by using variables. Helm uses a template language (based on Go
templates) to let you write parameterized YAML configuration files.
For example, here's part of a YAML file that defines a deployment:
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: {{ include "package.fullname" . | replace "." "" }}
labels:
app.kubernetes.io/name: {{ include "package.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
annotations:
kubernetes.io/change-cause: {{ .Values.reason }}

...

spec:
containers:
- name: &package-container_name fabrikam-package
image: {{ .Values.dockerregistry }}/{{ .Values.image.repository }}:{{ .Values.image.tag }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
env:
- name: LOG_LEVEL
value: {{ .Values.log.level }}

See the source file.


You can see that the deployment name, labels, and container spec all use template parameters, which are provided
at deployment time. For example, from the command line:

helm install $HELM_CHARTS/package/ \


--set image.tag=0.1.0 \
--set image.repository=package \
--set dockerregistry=$ACR_SERVER \
--namespace backend \
--name package-v0.1.0

Although your CI/CD pipeline could install a chart directly to Kubernetes, we recommend creating a chart archive
(.tgz file) and pushing the chart to a Helm repository such as Azure Container Registry. For more information, see
Package Docker-based apps in Helm charts in Azure Pipelines.
Consider deploying Helm to its own namespace and using role-based access control (RBAC) to restrict which
namespaces it can deploy to. For more information, see Role-based Access Control in the Helm documentation.
Revisions
Helm charts always have a version number, which must use semantic versioning. A chart can also have an
appVersion . This field is optional, and doesn't have to be related to the chart version. Some teams might want to
application versions separately from updates to the charts. But a simpler approach is to use one version number,
so there's a 1:1 relation between chart version and application version. That way, you can store one chart per
release and easily deploy the desired release:

helm install <package-chart-name> --version <desiredVersion>

Another good practice is to provide a change-cause annotation in the deployment template:


apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: {{ include "delivery.fullname" . | replace "." "" }}
labels:
...
annotations:
kubernetes.io/change-cause: {{ .Values.reason }}

This lets you view the change-cause field for each revision, using the kubectl rollout history command. In the
previous example, the change-cause is provided as a Helm chart parameter.

$ kubectl rollout history deployments/delivery-v010 -n backend


deployment.extensions/delivery-v010
REVISION CHANGE-CAUSE
1 Initial deployment

You can also use the helm list command to view the revision history:

~$ helm list
NAME REVISION UPDATED STATUS CHART APP VERSION
NAMESPACE
delivery-v0.1.0 1 Sun Apr 7 00:25:30 2019 DEPLOYED delivery-v0.1.0 v0.1.0 backend

TIP
Use the --history-max flag when initializing Helm. This setting limits the number of revisions that Tiller saves in its history.
Tiller stores revision history in configmaps. If you're releasing updates frequently, the configmaps can grow very large unless
you limit the history size.

Azure DevOps Pipeline


In Azure Pipelines, pipelines are divided into build pipelines and release pipelines. The build pipeline runs the CI
process and creates build artifacts. For a microservices architecture on Kubernetes, these artifacts are the
container images and Helm charts that define each microservice. The release pipeline runs that CD process that
deploys a microservice into a cluster.
Based on the CI flow described earlier in this article, a build pipeline might consist of the following tasks:
1. Build the test runner container.

- task: Docker@1
inputs:
azureSubscriptionEndpoint: $(AzureSubscription)
azureContainerRegistry: $(AzureContainerRegistry)
arguments: '--pull --target testrunner'
dockerFile: $(System.DefaultWorkingDirectory)/$(dockerFileName)
imageName: '$(imageName)-test'

2. Run the tests, by invoking docker run against the test runner container.
- task: Docker@1
inputs:
azureSubscriptionEndpoint: $(AzureSubscription)
azureContainerRegistry: $(AzureContainerRegistry)
command: 'run'
containerName: testrunner
volumes: '$(System.DefaultWorkingDirectory)/TestResults:/app/tests/TestResults'
imageName: '$(imageName)-test'
runInBackground: false

3. Publish the test results. See Build an image.

- task: PublishTestResults@2
inputs:
testResultsFormat: 'VSTest'
testResultsFiles: 'TestResults/*.trx'
searchFolder: '$(System.DefaultWorkingDirectory)'
publishRunAttachments: true

4. Build the runtime container.

- task: Docker@1
inputs:
azureSubscriptionEndpoint: $(AzureSubscription)
azureContainerRegistry: $(AzureContainerRegistry)
dockerFile: $(System.DefaultWorkingDirectory)/$(dockerFileName)
includeLatestTag: false
imageName: '$(imageName)'

5. Push to the container to Azure Container Registry (or other container registry).

- task: Docker@1
inputs:
azureSubscriptionEndpoint: $(AzureSubscription)
azureContainerRegistry: $(AzureContainerRegistry)
command: 'Push an image'
imageName: '$(imageName)'
includeSourceTags: false

6. Package the Helm chart.

- task: HelmDeploy@0
inputs:
command: package
chartPath: $(chartPath)
chartVersion: $(Build.SourceBranchName)
arguments: '--app-version $(Build.SourceBranchName)'

7. Push the Helm package to Azure Container Registry (or other Helm repository).

task: AzureCLI@1
inputs:
azureSubscription: $(AzureSubscription)
scriptLocation: inlineScript
inlineScript: |
az acr helm push $(System.ArtifactsDirectory)/$(repositoryName)-$(Build.SourceBranchName).tgz --
name $(AzureContainerRegistry);
See the source file.
The output from the CI pipeline is a production-ready container image and an updated Helm chart for the
microservice. At this point, the release pipeline can take over. It performs the following steps:
Deploy to dev/QA/staging environments.
Wait for an approver to approve or reject the deployment.
Retag the container image for release
Push the release tag to the container registry.
Upgrade the Helm chart in the production cluster.
For more information about creating a release pipeline, see Release pipelines, draft releases, and release options.
The following diagram shows the end-to-end CI/CD process described in this article:

Next steps
This article was based on a reference implementation that you can find on GitHub.
Monoliths to microservices using domain-driven
design
11/2/2020 • 6 minutes to read • Edit Online

This article describes how to use domain-driven design (DDD) to migrate a monolithic application to microservices.
A monolithic application is typically an application system in which all of the relevant modules are packaged
together as a single deployable unit of execution. For example, it might be a Java Web Application (WAR) running
on Tomcat or an ASP.NET application running on IIS. A typical monolithic application uses a layered design, with
separate layers for UI, application logic, and data access.

These systems start small but tend to grow over time to meet business needs. At some point, as new features are
added, a monolithic application can begin to suffer from the following problems:
The individual parts of the system cannot be scaled independently, because they are tightly coupled.
It is hard to maintain the code, because of tight coupling and hidden dependencies.
Testing becomes harder, increasing the probability of introducing vulnerabilities.
These problems can become an obstacle to future growth and stability. Teams become wary of making changes,
especially if the original developers are no longer working on the project and design documents are sparse or
outdated.
Despite these limitations, a monolithic design can make sense as a starting point for an application. Monoliths are
often the quickest path to building a proof-of-concept or minimal viable product. In the early phases of
development, monoliths tend to be:
Easier to build, because there is a single shared code base.
Easier to debug, because the code runs within a single process and memory space.
Easier to reason about, because there are fewer moving parts.
As the application grows in complexity, however, these advantages can disappear. Large monoliths often become
progressively harder to build, debug, and reason about. At some point, the problems outweigh the benefits. This is
the point when it can make sense to migrate the application to a microservices architecture. Unlike monoliths,
microservices are typically decentralized, loosely coupled units of execution. The following diagram shows a typical
microservices architecture:
Migrating a monolith to a microservice requires significant time and investment to avoid failures or overruns. To
ensure that any migration is successful, it's good to understand both the benefits and also challenges that
microservices bring. The benefits include:
Services can evolve independently based on user needs.
Services can scale independently to meet user demand.
Over time, development cycles become faster as features can be released to market quicker.
Services are isolated and are more tolerant of failure.
A single service that fails will not bring down the entire application.
Testing becomes more coherent and consistent, using behavior-driven development.
For more information about the benefits and challenges of microservices, see Microservices architecture style.

Apply domain-driven design


Any migration strategy should allow teams to incrementally refactor the application into smaller services, while still
providing continuity of service to end users. Here's the general approach:
Stop adding functionality to the monolith.
Split the front end from the back end.
Decompose and decouple the monolith into a series of microservices.
To help facilitate this decomposition, a viable software development approach is to apply the principles of domain-
driven design (DDD).
Domain Driven Design (DDD) is a software development approach first introduced by Eric Evans. DDD requires a
good understanding of the domain for which the application will be written. The necessary domain knowledge to
create the application resides within the people who understand it — the domain experts.
The DDD approach can be applied retroactively to an existing application, as a way to begin decomposing the
application.
1. Start with a ubiquitous language, a common vocabulary that is shared between all stakeholders.
2. Identify the relevant modules in the monolithic application and apply the common vocabulary to them.
3. Define the domain models of the monolithic application. The domain model is an abstract model of the
business domain.
4. Define bounded contexts for the models. A bounded context is the boundary within a domain where a
particular domain model applies. Apply explicit boundaries with clearly defined models and responsibilities.
The bounded contexts identified in step 4 are candidates for refactoring into smaller microservices. The following
diagram shows the existing monolith with the bounded contexts overlaid:

For more information about using a DDD approach for microservices architectures, see Using domain analysis to
model microservices.

Use glue code (anti-corruption layer)


While this investigative work is carried out to inventory the monolithic application, new functionality can be added
by applying the principles of DDD as separate services. "Glue code" allows the monolithic application to proxy calls
to the new service to obtain new functionality.

The glue code (adapter pattern) effectively acts as an anti-corruption layer, ensuring that the new service is not
polluted by data models required by the monolithic application. The glue code helps to mediate interactions
between the two and ensures that only data required by the new service is passed to enable compatibility.
Through the process of refactoring, teams can inventory the monolithic application and identify candidates for
microservices refactoring while also establishing new functionality with new services.
For more information about anti-corruption layers, see Anti-Corruption Layer pattern.

Create a presentation layer


The next step is to separate the presentation layer from the backend layer. In a traditional n-tier application, the
application (business) layer tends to be the components that are core to the application and have domain logic
within them. These coarse-grained APIs interact with the data access layer to retrieve persisted data from within a
database. These APIs establish a natural boundary to the presentation tier, and help to decouple the presentation
tier into a separate application space.
The follow diagram shows the presentation layer (UI) split out from the application logic and data access layers.

This diagram also introduces another layer, the API gateway, that sits between the presentation layer and the
application logic. The API gateway is a façade layer that provides a consistent and uniform interface for the
presentation layer to interact with, while allowing downstream services to evolve independently, without affecting
the application. The API Gateway may use a technology such as Azure API Management, and allows the application
to interact in a RESTful manner.
The presentation tier can be developed in any language or framework that the team has expertise in, such as a
single page application or an MVC application. These applications interact with the microservices via the gateway,
using standard HTTP calls. For more information about API Gateways, see Using API gateways in microservices.

Start to retire the monolith


At this stage, the team can begin peeling away the monolithic application and slowly extract the services that have
been established by their bounded contexts into their own set of microservices. The microservices can expose a
RESTful interface for the application layer to interact with, through the API gateway, with glue code in place to
communicate with the monolith in specific circumstances.
As you continue to peel away the monolith, eventually there will come the point when it no longer needs to exist,
and the microservices have been successfully extracted from the monolith. At this point, the anti-corruption layer
(glue code) can safely be removed.
This approach is an example of the Strangler pattern and allows for a controlled decomposition of a monolith into a
set of microservices. Over time, as existing functionality is moved into microservices, the monolith will shrink in
size and complexity, to the point that it no longer exists.

Next steps
When the application has been decomposed into constituent microservices, it becomes possible to use modern
orchestration tools such as Azure DevOps to manage the lifecycle of each service. For more information, see CI/CD
for microservices architectures.
Modernize enterprise applications with Azure Service
Fabric
11/2/2020 • 28 minutes to read • Edit Online

This article provides guidelines for moving Windows applications to an Azure compute platform without rewriting.
This migration uses container support in Azure Service Fabric.
A typical approach for migrating existing workloads to the cloud is the lift-and-shift strategy. In IaaS virtual machine
(VM) migrations, you provision VMs with network and storage components and deploy the existing applications
onto those VMs. Unfortunately, lift-and-shift often results in overprovisioning and overpaying for compute
resources. Another approach is to move to PaaS platforms or refactor code into microservices and run in newer
serverless platforms. But those options typically involve changing existing code.
Containers and container orchestration offer improvements. Containerizing an existing application enables it to run
on a cluster with other applications. It provides tight control over resources, scaling, and shared monitoring and
DevOps.
Optimizing and provisioning the right amount of compute resources for containerization isn't trivial. Service
Fabric's orchestration allows an organization to migrate Windows and Linux applications to a runtime platform
without changing code and to scale the needs of the application without overprovisioning VMs. The result is better
density, better hardware use, simplified operations, and overall lower cloud-compute costs.
An enterprise can use Service Fabric as a platform to run a large set of existing Windows-based web applications
with improved density, monitoring, consistency, and DevOps, all within a secure extended private network in the
cloud. The principle is to use Docker and Service Fabric's containerization support that packages and hosts existing
web applications on a shared cluster with shared monitoring and operations, in order to maximize cloud compute
resources for the ideal performance-to-cost ratio.
This article describes the processes, capabilities, and Service Fabric features that enable containerizing in an optimal
environment for a large enterprise. The guidance is scoped to web applications and Windows containers. Before
reading this article, get familiar with core Windows container and Service Fabric concepts. For more information,
see:
Create your first Service Fabric container application on Windows
Service Fabric terminology overview
Service Fabric best practices overview

Resources
Sample: Modernization templates and scripts.
The repo has these resources:
An example Azure Resource Manager template to bring up an Azure Service Fabric cluster.
A reverse-proxy solution for brokering web requests into the Service Fabric cluster to the destination containers.
Sample Service Fabric application configuration and scripts that show the use of placement, resource
constraints, and autoscaling.
Sample scripts and Docker files that build and package an existing web application.
Customize the templates in this repo for your cluster. The templates implement the best practices described in this
article.
Evaluate requirements
Before containerizing existing applications, evaluate requirements. Select applications that are right for migration,
choose the right developer workstation, and determine network requirements.
Application selection
First, determine the type of applications that are best suited for a containerized platform, full virtual machines, and
pure PaaS environment. The application could be a shared application that is built with Service Fabric to share
Windows Server hosts across various containerized applications. Each Service Fabric host can run multiple different
applications running in isolated Windows containers.
Consider creating a set of criteria to determine such applications. Here are some example criteria of containerized
Windows applications in Service Fabric.
HTTP/HTTPS web and application tiers without database dependency.
Stateless web applications.
Built with .NET Framework versions 3.5 and later.
Do not have hardware dependency or access device drivers.
Applications can run on Windows Server 2016 and later versions.
All dependencies can be containerized, such as are most .NET assemblies, WCF, COM+.

NOTE
Dependencies that cannot be containerized include MSMQ (Currently supported in preview releases of Windows
Server Core post 1709).

Applications can compile and build in Visual Studio.


For the web applications, databases, and other required servers (such as Active Directory) exist outside the Service
Fabric cluster in IaaS VMs, PaaS, or on-premises.
Developer workstation requirements
From an application development perspective, determine the workstation requirements.
Docker for Windows is required for developers to containerize and test their applications prior to deployment.
Visual Studio Docker support is required. Standardize on the latest version of Visual Studio for the best Docker
compatibility.
If workstations don't have enough hardware resources to oversee those requirements, use Azure compute
resources for speed and productivity gains. An option is the Azure DevTest Labs Service. Docker for Windows,
and Visual Studio 2017 require a minimum of 8 GB of memory.
Networking requirements
Service Fabric orchestration provides a platform for hosting, deploying, scaling, and operating applications at
enterprise scale. Most large enterprises that use Azure:
Extend their corporate network with a private address space to an Azure subscription. use either ExpressRoute or
a Site-to-Site VPN to provide secure on-premises connectivity.
Want to control inbound and outbound network traffic through third-party firewall appliances and/or Azure
Network Security Group rules.
Want tight control over the address space requirements and subnets.
Service Fabric is suitable as a containerization platform. It plugs into an existing cloud infrastructure and doesn't
require open public ingress endpoints. You just need to carve out the necessary address space for Service Fabric's
IP address requirements. For details, see the Service Fabric Networking section in this article.

Containerize existing Windows applications


After you've determined the applications that meet the selection criteria, containerize them into Docker images. The
result is containerized .NET web application running in IIS where all tiers run in one container.

NOTE
You can use multiple containers; one per tier.

Here are the basic steps for containerizing an application.


1. Open the project in Visual Studio.
2. Make sure the project compiles and runs locally on the developer workstation.
3. Add a Dockerfile to the project. This Dockerfile example shows a basic .NET MVC application.

FROM mcr.microsoft.com/dotnet/framework/aspnet:4.8
ADD PublishOutput/ /inetpub/wwwroot

# add a certificate and configure SSL


SHELL ["powershell","-command"]
RUN New-Item -ItemType Directory C:\install
ADD installwebsite.ps1 c:\install\
ADD wildcard.pfx c:\install\
# Powershell script to configure SSL on the website
RUN c:\install\configurewebsite

# plugin into SF healthcheck ensuring the container website is running


HEALTHCHECK --interval=30s --timeout=30s --start-period=60s --retries=3 CMD curl -f http://localhost/ ||
exit 1

4. Test locally by using Docker For Windows. The application must successfully run in a Docker container by
using the Visual Studio debug experience. For more information, see Deploy a .NET app using Docker
Compose.
5. Build (if needed), tag, and push the tested image to a Docker registry, like the Azure Container Registry
service. This example uses an existing Azure Container Registry named MyAcr and Docker build/tag/push to
build/deploy appA to the registry.

docker login myacr.azurecr.io -u myacr -p <pwd>


docker build -t appa .
docker tag appa myacr.azurecr.io/appa:1.0
docker push myacr.azurecr.io/appa:1.0

The image is tagged with a version number that Service Fabric references when it deploys and versions the
container. Azure DevOps encapsulates and executes the manual Docker build/tag/push process. DevOps details are
described in the DevOps and CI/CD section.

NOTE
In the preceding example, the base image is "mcr.microsoft.com/dotnet/framework/aspnet:4.8" from DockerHub.

Here are some considerations about the base images:


The base image could be a locked-down custom enterprise image that enforces enterprise requirements. For a
shared application, isolation boundaries can be created through credentials or by using separate registry. It's
recommended that enterprise-supported docker images be kept separately and stored in an isolated container
registry.
Avoid storing the registry login credentials in configuration files. Instead, use (role-based access control) RBAC
and Azure Active Directory service principals with Azure Container Registry. Provide read-only access to
registries depending on your enterprise requirements.
For information about running an IIS ASP.net MVC application in a Windows container, see Migrating ASP.NET MVC
Applications to Windows Containers.

Service Fabric cluster configuration for enterprise deployments


To deploy a Service Fabric cluster, start with the sample Azure Resource Manager template in this GitHub Repo and
customize it to fit your requirements. You also deploy a cluster through the Azure portal, but that option should be
used for development/test provisioning.
Service Fabric node types
A Service Fabric cluster contains one or more node types. A node type typically maps to an Azure virtual machine
scale set which is a set of one or more VM instances with same properties. The cluster must have at least one node
type-primary node type, which runs the Service Fabric system services. Do not run your application container on
the primary node type because the container can compete with the system services for resources. Consider,
designing a cluster with two or more node types depending on the types of applications. For example, if you have
two node types, run HTTP web and application tier containers on a non-primary node type.
This example infrastructure uses two Service Fabric node types: application and primary. You can scale in or scale
out the scale set individually. It doesn't require extensive planning and testing up front to determine the correct
application node type size (scale set instance count), because the actual size can grow and reduce in real time
driven by monitoring and metrics.

In the preceding image, the application node type has three nodes that run three containerized applications A, B,
and C. The containers can scale on demand based on CPU and memory usage. The hosts can scale out as more
applications are deployed to the cluster. Azure Load Balancer has a public IP. The primary nodes are in an internal
subnet. It has a load balancer that isn't publicly exposed.
Placement constraints
Use placement constraints that target the non-primary node type to reserve the primary node type for system
services.
This approach enables you to configure each scale set with a separate virtual network subnet. A unique subnet for
each node type uses network security group rules for inbound/outbound access to and from the subnet and
controls network flow. You can configure the primary node type with a private load balancer so that external traffic
can't access Service Fabric management and application deployment. If the application node type wants to expose
application endpoints publicly, configure it with a separate load balancer with security configuration. For an
example template that uses Windows Service with NSG rule and multiple node types, see 7 Node, 3 node type
secure Windows Service Fabric Cluster with NSG.
A scale set associated with a node type can reliably scale out to 100 VM instances by using a single placement
group as applications are added to the cluster. The primary node type often doesn't need as many instances and
can run with 5-7 nodes, depending on durability and reliability requirements.
Service Fabric networking
Service Fabric supports with two networking modes for containerized applications; nat and Open. For large
enterprise clusters that host multiple applications, use the Open mode. For more information, see Container
Networking and Constraints.
nat
By default, the cluster brings up containers by using a NAT-bridge mode to the host VM. The NAT bridge
routes requests over a defined port to the container. With this mode, only one IP address is needed per host
VM for the host's primary NIC.
To route traffic to each application container, a unique port is exposed through the load balancer. However,
that port is exposed to end users. If you don't want the port exposed, provide a URL rewrite mechanism.
Rewrite the application domain name with a unique application port. Traffic is routed to the load balancer
that front ends the cluster. One option for the rewrite mechanism is Azure Application Gateway.
Another benefit of this approach is simplistic load balancing with Azure Load Balancer. The load balancer's
probe mechanism balance traffic across the VM instances that are running the application's containers.
Open
The Open mode assigns an IP address to each running container on the host VM from the cluster's virtual
network subnet. Each host is pre-allocated with a set of IP addresses. Each container on the host is assigned
an IP from the virtual network range. You can configure this mode in the cluster Azure Resource Manager
template during cluster creation. The example infrastructure demonstrates the Open mode.
Benefits of the Open mode:
Makes connecting to application containers simple.
Provides application traceability that is, the assigned enterprise-friendly IP is constant for the life of the
container.
Is efficient with Windows containers.
There are downsides:
The number of IP addresses must be set aside during cluster creation and that number is fixed. For example, if
10 IP addresses are assigned to each host, the host supports up to 9 application containers; one IP is reserved
for the host VM's primary NIC, the remaining addresses for each container. The number of IP addresses to
configure for each host is determined based on the hardware size for the node type, and the maximum number
of containers on each host. Both factors depend on application size and needs. If you need more containers in
the cluster, add more application node type VM instances. That is because you can't change the IP number per
instance without rebuilding the cluster.
You need a reverse proxy to route traffic to the correct destination container. The Service Fabric DNS Service can
be used by the reverse proxy to look up the application container and rewrite the HTTP request to this container.
That solution is more complex to implement than the nat mode.
For more information, see Service Fabric container networking modes.
Service Fabric runtime and Windows versions
Choose the version of Windows Server you want to run as the infrastructure host OS. The choice depends on the
testing and readiness at the time of first deployment. Supported versions are Windows Server with Containers
1607, 1709, 1803.
Also choose a version of the Service Fabric runtime to start the cluster. For a new containerization initiative, use the
latest version (Service Fabric 6.4 or later).
Other considerations:
The host OS running Service Fabric can be built by using locked down enterprise images with antivirus and
malware enforcement.
The host OS for all the Service Fabric cluster VMs can be domain joined. Domain joining isn't a requirement and
can add complexity to any Azure virtual machine. However, there are benefits. For example, if an application
requires Windows-integrated security to connect to domain-joined resources, then the domain service account
is typically used at the process level. The account is used to execute the application instead of connection string
credentials and secrets. Windows containers do not currently support direct domain joining but can be
configured to use Group Managed Storage Accounts (gMSA). The gMSA capability is supported by Service
Fabric for Windows-integrated security requirements. Each containerized application in the cluster can have its
own gMSA. gMSA requires the Service Fabric host VMs that run containers to be Active Directory domain
joined.
DNS service
Service Fabric has an internal DNS service that maps a containerized application name to its location in the cluster.
For the Open mode, the application's IP address is used instead. This service is enabled on the cluster. If you name
each service with a DNS name, traffic is routed to the application by using a reverse proxy. For information about
reverse proxy, see the Reverse proxy for inbound traffic section.
Monitoring and diagnostics
The Service Fabric Log Analytics workspace and Service Fabric solution provide detailed information about cluster
events. For information about setting it up, see Configure Azure Monitor logs to collect cluster events.
Install the [monitoring agent] for Azure Log Analytics in the Service Fabric cluster. You can then use the container
monitoring solution and view the running containers in the cluster.
Use Docker performance statistics to monitor memory and CPU use for each container.
Install the Azure Diagnostics extension that collects diagnostic data on a deployed application. For more
information, see What is Azure Diagnostics extension.
Unused container images
Docker images are downloaded to each Service Fabric host and can consume space on the host disk. To free up disk
space, consider image pruning to remove images that are no longer referenced and used by running containers.
Configure this option in the Host section of the cluster manifest.
Secrets and certificates management with Key Vault
Avoid hardcoded secrets in the template. Modify the Azure Resource Manager template to add certificates (cluster
certificate) and secrets (host OS admin password) from Azure Key Vault during cluster deployment. To safely store
and retrieve secrets:
1. Create a Key Vault in a separate resource group in the same region as the cluster.
2. Protect the Key Vault with RBAC.
3. Load the cluster certificate and cluster password in Key Vault.
4. Reference this external Key Vault from the Azure Resource Manager template.
Cluster capacity planning
For an enterprise production cluster, start with 5-7 primary nodes, depending on the intended size of the overall
cluster. For more information, see Service Fabric cluster capacity planning considerations.
For a large cluster that hosts stateless web applications, the typical hardware size for the primary node type can be
Standard_D2s_v3. For the primary node type, a 5-node cluster with the Silver durability tier should scale to
containers across 50 application VM instances. Consider doing simulation tests and add more node types as
partitioning needs become apparent. Also perform tests to determine the VM size for application node type and the
number of running containers on each VM instance. Typically, 10 or more containers run on each VM, but that
number is dependent on these factors:
Resource needs, such as CPU and memory use, of the containerized applications. Web applications tend to be
more CPU intensive than memory intensive.
The compute size and available resources of the chosen node VM for the application node type.
The number of IP addresses configured per node. Each container requires an IP with Open mode. Given 15
containers on a VM, 16 IP addresses must be allocated for each VM in the application node type.
Resource needs of the underline OS. The recommendation is to leave 25% of the resources on the VM for OS
system processes.
Here are some recommendations: Choose recent Windows Server builds because older builds of Windows Server
(1607) use larger container image sizes compared to newer versions (1709 and later).
Select the most appropriate VM compute size and initial application node type VM count. Standard_D8s_v3,
Standard_D16s_v3, and Standard_DSv32_v3 are example compute sizes that are known to work well for this tier for
running containerized web applications.
Use premium storage for the production cluster because container images must be read quickly from disk. Attach
one data disk to each VM and move the Docker EE installation to that disk. That way there is enough log space for
each running container. A standard VM with the default 127-GB OS disk fills up because of container logs. A 500 GB
to 1 TB data disk is recommended per VM for the application node type scale set.
Availability and resiliency planning
Service Fabric spreads VM instances across fault and update domains to ensure maximum availability and
resilience. For an enterprise-scale production cluster, as a minimum target the Silver tier (5-node cluster) for
reliability and durability when hosting containerized stateless web applications.
Scale planning
There are two aspects to consider:
Scale in or scale out the scale set associated with the application node type by adding or removing VM
instances. VMs can be added and removed automatically through autoscaling or manually through scripts. For
an enterprise cluster, do not add or remove nodes frequently. It is important that you disable nodes in Service
Fabric before deleting them. Do not delete seed nodes from your cluster. Monitor the cluster VMs with alerts to
trigger when VM resources exceed threshold values.
Scale the application's container count across the scale set instances in the application node type.
For the application node type, start with the minimal required nodes to support the containerized applications and
ensure high availability. Have extra nodes in the cluster. A node can be removed from the cluster for maintenance
and its running containers can be temporarily moved to other nodes. The cluster can grow statically using the Add-
AzureRmSer viceFabricNode cmdlet, or dynamically by using scale set autoscale capability.
Starting with Service Fabric 6.2, application containers can autoscale individually through configuration. The
configuration is based on Docker statistics for CPU and memory use on the host. Together, those capabilities can
optimize the compute cost.
Rather than overprovisioning a VM to run an application, deploy a container to the cluster, monitor it, and then
scale out with additional containers, as necessary. Choosing the right size is easier because Docker statistics are
used to determine the number of containers.
In the example infrastructure, application A has two containers deployed across the cluster that divide load. This
approach allows the application and container combination to be adjusted later for optimization.

NOTE
Docker statistics showing individual container resource utilization is sent to Log Analytics and can be analyzed in Azure
Monitor.

Service Fabric offers constant monitoring and heath checks across the cluster. If a node is unhealthy, applications
on that node automatically move to a healthy node and the bad node stops receiving requests. Regardless of the
number of containers hosting an application, Service Fabric ensures that the application is healthy and running.
For an application that is infrequently used and can be offline, run it in the cluster with just one container instance
(such as application B and C). Service Fabric makes sure that the application is up and healthy during upgrades or
when the container needs to move to a new VM. Heath checking can reduce cost compared to hosting that
application on two redundant and overprovisioned VMs in the traditional IaaS model.

Container networking and constraints


Use the Open mode for hosting containerized web applications in the cluster. After deployment, the application is
immediately discoverable through the Service Fabric DNS service. The DNS service is a name-value lookup
between a configured service name and the resultant IP address of a container hosting the application.
To route web requests to an application, use an ingress reverse proxy. If application containers listen on different
ports (AppA port 8080, AppB on 8081), the default host NAT bridge works without issues. Azure Load Balancer
probes route the traffic appropriately. However, if you want incoming traffic over SSL/443 routed to one port for all
applications, use a reverse proxy to route traffic appropriately.
Reverse proxy for inbound traffic
Service Fabric has a built-in reverse proxy but is limited in its feature set. Therefore, deploy a different reverse
proxy. An option is the IIS Application Request Routing (ARR) extension for IIS hosted web applications. The ARR
can be deployed to a container and configured to take inbound requests and route them to the appropriate
application container. In this example, the ARR uses a NAT bridge over port 80/443, accepts all inbound web traffic,
inspects the traffic, looks up the destination container using Service Fabric DNS service, and rewrites the request to
the destination container. The traffic can be secured with SSL to the destination container. Follow the IIS Application
Request Routing sample for building an ARR reverse proxy. For information, see Using the Application Request
Routing Module.
Here is the network flow for the example infrastructure.
The key aspect of the ingress reverse proxy is inspecting inbound traffic and rewriting that traffic to the destination
container.
For example, application A is registered with the Service Fabric DNS service with the domain name:
appA.container.myorg.com. External users access the application with https://appA.myorg.com . Use public or
organizational DNS and register appA.myorg.com to point to the public IP for the application node type.
1. Requests for appA.myorg.com are routed to the Service Fabric cluster and handed off to the ARR container
listening on port 443. Service Fabric and Azure Load Balancer set that configuration value when the ARR
container is deployed.
2. When ARR gets the request, it has a condition to look for any request with the pattern='..*', and its action
rewrites the request to https://{C:1}.container.{C:2}.{C:3}/{REQUEST_URI}. Because the ARR is running in the
cluster, the Service Fabric DNS service is invoked. The service returns the destination container IP address.
3. The request is routed to the destination container. Certificates can be used for the initial request to ARR and the
rewrite to the destination container.
Here is an example ApplicationManifest.xml for Container A in the example infrastructure.
<?xml version="1.0" encoding="utf-8"?>
<ApplicationManifest ApplicationTypeName="sfapp02Type"
ApplicationTypeVersion="1.0.0"
xmlns="http://schemas.microsoft.com/2011/01/fabric"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Parameters>
<Parameter Name="appAsvc_InstanceCount" DefaultValue="2" />
</Parameters>
<ServiceManifestImport>
<ServiceManifestRef ServiceManifestName="appAsvcPkg" ServiceManifestVersion="1.0.0" />
<ConfigOverrides />
<Policies>
<ContainerHostPolicies CodePackageRef="Code" Isolation="default">
<!--SecurityOption Value="credentialspec=file://container_gmsa1.json"/-->
<RepositoryCredentials AccountName="myacr" Password=" " PasswordEncrypted="false"/>
<NetworkConfig NetworkType="Open"/>
</ContainerHostPolicies>
<ServicePackageResourceGovernancePolicy CpuCores="1" MemoryInMB="1024" />
</Policies>
</ServiceManifestImport>
<DefaultServices>
<Service Name="appAsvc" ServicePackageActivationMode="ExclusiveProcess"
ServiceDnsName="appA.container.myorg.com">
<StatelessService ServiceTypeName="appAsvcType" InstanceCount="[appAsvc_InstanceCount]">
<SingletonPartition />
<PlacementConstraints>(NodeTypeName==applicationNT)</PlacementConstraints>
</StatelessService>
</Service>
</DefaultServices>
</ApplicationManifest>

Uses the Open mode for the containers.


Registers the application domain name appA.container.myorg.com with the Azure DNS service.
Optionally configures the container to use an Active Directory gMSA (commented).
Uses placement constraints to deploy the container to the application node type, named applicationNT. It
instructs Service Fabric to run the container in the correct node type in the secured network subnet.
Optionally applies resource constraints. Each container is resource governed to use 1 vCPU and 1 GB of
memory on the VM host. Setting a resource governance policy is recommended because Service Fabric uses the
policy to distribute containers across the cluster, as opposed to the default even distribution of containers across
the cluster.
Here is the example ServiceManifest.xml for the containerized application appA.
<?xml version="1.0" encoding="utf-8"?>
<ServiceManifest Name="appAsvcPkg"
Version="1.0.0"
xmlns="http://schemas.microsoft.com/2011/01/fabric"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ServiceTypes>
<StatelessServiceType ServiceTypeName="appAsvcType" UseImplicitHost="true" />
</ServiceTypes>
<CodePackage Name="Code" Version="1.0.0">
<EntryPoint>
<ContainerHost>
<ImageName>myacr.azurecr.io/appa:1.0</ImageName>
</ContainerHost>
</EntryPoint>
</CodePackage>
<ConfigPackage Name="Config" Version="1.0.0" />
<Resources>
<Endpoints>
<Endpoint Name="appAsvcTypeEndpoint1" UriScheme="http" Port="80" Protocol="http" CodePackageRef="Code"/>
<Endpoint Name="appAsvcTypeEndpoint2" UriScheme="https" Port="443" Protocol="http"
CodePackageRef="Code"/>
</Endpoints>
</Resources>
</ServiceManifest>

The application is configured to listen on ports 80 and 443. By using the Open mode and reverse proxy, all
applications can share the same ports.
The application is containerized to an image named: myacr.azurecr.io/appa:1.0. Service Fabric invokes the Docker
daemon to pull down the image when the application is deployed. Service Fabric handles all interactions with
Docker.
The reverse proxy container uses similar manifests but isn't configured to use the Open mode. You can update
containers by versioning the Docker image, then redeploying the versioned Service Fabric package with the Star t-
Ser viceFabricApplicationUpgrade cmdlet.
For information about manifests, see Service Fabric application and service manifests.
Environmental configuration
Do not hardcode configuration values in the container image by using environment variables to pass values to a
container. A DevOps pipeline can build a Docker image, test in a test environment, promote to staging (or pre-
production), and promote to production. Do not rebuild a docker image for each environment.
Docker can pass environment variables directly to a Docker container when running a container. In this example,
Docker passes the eShopTitle variable to the eshopweb container:

docker run -p 80:80 -d --name eshoptest -e eShopTitle=SomeName eshopweb:1.0

In a Service Fabric cluster, Service Fabric controls Docker execution and lists environment variables in the
ServiceManifest. Those variables are passed automatically when Service Fabric runs the container. You can override
the variables in ApplicationManifest.xml by using the EnvironmentOverrides element, which can be parameterized
and built from Visual Studio publish profiles for each environment.
For information about specifying environment variables in, see How to specify environment variables for services
in Service Fabric.

Security considerations
Here are some articles about container security:
Azure Service Fabric security
Service Fabric application and service security
Set up an encryption certificate and encrypt secrets on Windows clusters

Logging and monitoring


Monitoring and logging are critical to operational success and is achieved through integration with Azure Monitor
and Log Analytics.
Monitor the Service Fabric cluster and each executing containers by using the scale set extension agent for Log
Analytics and its associated Container Monitoring Solution. Make sure that you configure and install the extension
agent and solution during cluster creation. Docker statistics for container CPU and memory utilization are sent to
Log Analytics and can be queried for proactive monitoring and alerting. Set up proactive alerts through Azure
Monitor. Here are metrics that you should monitor.
High CPU and memory utilization of a container.
Container count hitting a per VM threshold.
HTTP error codes and count.
Docker Enterprise Edition (Docker EE) up or down on the host VM.
Response time of a service.
Host VM-based alerts on CPU, memory, disk, file consumption.
These Service Fabric virtual machine scale set extensions are installed on a typical node that results in logging.

ServiceFabricNode. Links the node to a storage account (support log) for tracing support. This log is used when
a ticket is opened.
IaaSDiagnostics. Collects platform events, such as ServiceFabricSystemEventTable, and stores that data in a blob
storage account (app log). The account is consumed in Log Analytics.
MicrosoftMonitoringAgent. Contains all the performance data such as Docker statistics. The data (such as
ContainerInventory and ContainerLog) is sent to Log Analytics.
Application log
If your containerized application runs in a shared cluster, you can get logs such as IIS and custom logs from the
container into Log Analytics. This option is recommended because of speed, scalability, and the ability to handle
large amounts of unstructured data.
Set up log rotation through Docker to keep the logs size manageable. For more information, see Rotating Docker
Logs - Keeping your overlay folder small.
Here are two approaches for getting application logs into Log Analytics.
Use the existing Container Monitoring Solution installed in Log Analytics. The solution automatically sends
data from the container log directories on the VM (C:\ProgramData\docker\windowsfilter*) to the
configured Log Analytics workspace. Each container creates a directory underneath the \WindowsFilter path,
and the contents are streamed to Log Analytics from MicrosoftMonitoringAgent on the VM. This way you
can send application logs to a shared directory(s) in the container and relay the logs by using Docker Logs to
the monitored container log folders.
1. Write a process script that runs in the container periodically and analyzes log files.
2. Monitor the log file changes in the shared folder and write the log changes to the command window
where Docker Logs can capture the information outside the container.
Each container records any output sent to the command line of a container. Access the output outside the
container, which is automatically executed by Container Monitoring Solution.

Docker Logs <ContainerID>

Move Docker to an attached VM data disk with enough storage to make sure the OS drive doesn't fill up with
container logs.
The automatic Container Monitoring Solution sends all logs to a single Log Analytics workspace. Different
containerized applications running on the same host send application logs to that shared workspace. If you
need to isolate logs such that each containerized application sends the log to a separate workspace, supply a
custom solution. That content is outside the scope of this article.
Mount external storage to each running container by using a file management service such as Azure Files.
The container logs are sent to the external storage location and don't take up disk space on the host VM.
You don't need, a data disk attached to each VM to hold Docker logs; move Docker Enterprise to the data
disk.
Create a job to monitor the Azure Files location and send logs to the appropriate Log Analytics workspace
for each installed application. The job doesn't need to run in the container. It just observes the Azure Files
location.

DevOps and CI/CD


Application containerization ensures consistency. It makes sure all Service Fabric-hosted applications use the latest
approved corporate image and provides an automatable image updating process that is consistent through
DevOps. Azure Pipelines provides the automation process. For more information, see Tutorial: Deploy an
application with CI/CD to a Service Fabric cluster.
An enterprise may want to control the base container images in a centralized registry. The preceding workflow
shows one image registry. There could be multiple registries that are used to share central IT-built enterprise
images with application teams. One way to centralize control is for the central IT registry to allow application
teams with read-only access to the enterprise base image repository. Application teams each have their own
container registry with their Docker files and build off the central IT base image repository.
There are various third-party image scanning tools that can plug into this process on Docker push/pulls from
the Azure Container Registry. Those solutions are available in Azure Marketplace and referenced in the Azure
portal Container Registry blade. For example, Aqua and TwistLock. After the source code is pushed to a git-
based repository, set up CI/CD by creating an Azure DevOps build definition, selecting the source repository,
and choosing the Azure Ser vice Fabric Application and Docker Suppor t template.

The template sets up the build process and tasks for CI/CD by building and containerizing the application, pushing
the container image to a Docker registry (Azure Container Registry is the default), and deploying the Service Fabric
application with the containerized services to the cluster. Each application code change creates a version of the code
and an updated containerized image. Service Fabric's rolling upgrade feature deploys service upgrades gracefully.
Here is an example of a build starting the full DevOps process on an Azure-provided hosted build agent. Some
enterprises may require the build agents to run internally within their private Azure virtual network corporate
network. Set up a Windows build agent VM and instruct Azure DevOps to use the private VM for building and
deploying code. For information about using custom build agents, see Self-hosted Windows agents.
Conclusion
Here is the summary of best practices:
Before containerizing existing applications, selecting applications that are suitable for this migration, choose the
right developer workstation, and determine network requirements.
Do not run your application container on the primary node type. Instead, configure the cluster with two or more
node types and run the application tier containers on a non-primary node type. Use placement constraints that
target the non-primary node type to reserve the primary node type for system services.
Use the Open networking mode.
Use an ingress reverse proxy, such as the IIS Application Request Routing. The reverse proxy inspects inbound
traffic and rewrites the traffic to the destination container.
Do not hardcode configuration values in the container image and use environment variables to pass values to a
container.
Monitor the application, platform events, and infrastructure metrics by using IaasDiagnostics and
MicrosoftMonitoringAgent extensions. View the logs in Application Insights and Log Analytics.
Use the latest approved corporate image and provide an automatable image updating process that is consistent
through DevOps.

Next steps
Get the latest version of the tools you need for containerizing, such as Visual Studio and Docker for Windows.
Customize these templates to meet your requirements. Sample: Modernization templates and scripts.
Migrate an Azure Cloud Services application to
Azure Service Fabric
11/2/2020 • 18 minutes to read • Edit Online

Sample code
This article describes migrating an application from Azure Cloud Services to Azure Service Fabric. It focuses on
architectural decisions and recommended practices.
For this project, we started with a Cloud Services application called Surveys and ported it to Service Fabric. The
goal was to migrate the application with as few changes as possible. Later in the article, we show how to optimize
the application for Service Fabric.
Before reading this article, it will be useful to understand the basics of Service Fabric. See Overview of Azure
Service Fabric

About the Surveys application


A fictional company named Tailspin created an application called the Surveys app that allows customers to create
surveys. After a customer signs up for the application, users can create and publish surveys, and collect the results
for analysis. The application includes a public website where people can take a survey. Read more about the
original Tailspin scenario here.
Now Tailspin wants to move the Surveys application to a microservices architecture, using Service Fabric running
on Azure. Because the application is already deployed as a Cloud Services application, Tailspin adopts a multi-phase
approach:
1. Port the cloud services to Service Fabric, while minimizing changes to the application.
2. Optimize the application for Service Fabric, by moving to a microservices architecture.
In a real-world project, it's likely that both stages would overlap. While porting to Service Fabric, you would also
start to rearchitect the application into micro-services. Later you might refine the architecture further, perhaps
dividing coarse-grained services into smaller services.
The application code is available on GitHub. This repo contains both the Cloud Services application and the Service
Fabric version.

Why Service Fabric?


Service Fabric is a good fit for this project, because most of the features needed in a distributed system are built
into Service Fabric, including:
Cluster management . Service Fabric automatically handles node failover, health monitoring, and other cluster
management functions.
Horizontal scaling . When you add nodes to a Service Fabric cluster, the application automatically scales, as
services are distributed across the new nodes.
Ser vice discover y . Service Fabric provides a discovery service that can resolve the endpoint for a named
service.
Stateless and stateful ser vices . Stateful services use reliable collections, which can take the place of a cache
or queue, and can be partitioned.
Application lifecycle management . Services can be upgraded independently and without application
downtime.
Ser vice orchestration across a cluster of machines.
Higher density for optimizing resource consumption. A single node can host multiple services.
Service Fabric is used by various Microsoft services, including Azure SQL Database, Cosmos DB, Azure Event Hubs,
and others, making it a proven platform for building distributed cloud applications.

Comparing Cloud Services with Service Fabric


The following table summarizes some of the important differences between Cloud Services and Service Fabric
applications. For a more in-depth discussion, see Learn about the differences between Cloud Services and Service
Fabric before migrating applications.

A REA C LO UD SERVIC ES SERVIC E FA B RIC

Application composition Roles Services

Density One role instance per VM Multiple services in a single node

Minimum number of nodes 2 per role 5 per cluster, for production


deployments

State management Stateless Stateless or stateful*

Hosting Azure Cloud or on-premises

Web hosting IIS** Self-hosting

Deployment model Classic deployment model Resource Manager

Packaging Cloud service package files (.cspkg) Application and service packages

Application update VIP swap or rolling update Rolling update

Autoscaling Built-in service Virtual machine scale sets for auto scale
out

Debugging Local emulator Local cluster

* Stateful services use reliable collections to store state across replicas, so that all reads are local to the nodes in the
cluster. Writes are replicated across nodes for reliability. Stateless services can have external state, using a database
or other external storage.
** Worker roles can also self-host ASP.NET Web API using OWIN.

The Surveys application on Cloud Services


The following diagram shows the architecture of the Surveys application running on Cloud Services.
The application consists of two web roles and a worker role.
The Tailspin.Web web role hosts an ASP.NET website that Tailspin customers use to create and manage
surveys. Customers also use this website to sign up for the application and manage their subscriptions.
Finally, Tailspin administrators can use it to see the list of tenants and manage tenant data.
The Tailspin.Web.Sur vey.Public web role hosts an ASP.NET website where people can take the surveys
that Tailspin customers publish.
The Tailspin.Workers.Sur vey worker role does background processing. The web roles put work items
onto a queue, and the worker role processes the items. Two background tasks are defined: Exporting survey
answers to Azure SQL Database, and calculating statistics for survey answers.
In addition to Cloud Services, the Surveys application uses some other Azure services:
Azure Storage to store surveys, surveys answers, and tenant information.
Azure Cache for Redis to cache some of the data that is stored in Azure Storage, for faster read access.
Azure Active Director y (Azure AD) to authenticate customers and Tailspin administrators.
Azure SQL Database to store the survey answers for analysis.

Moving to Service Fabric


As mentioned, the goal of this phase was migrating to Service Fabric with the minimum necessary changes. To that
end, we created stateless services corresponding to each cloud service role in the original application:
Intentionally, this architecture is similar to the original application. However, the diagram hides some important
differences. In the rest of this article, we'll explore those differences.

Converting the cloud service roles to services


For the initial migration, Tailspin followed the steps outlined in Guide to converting Web and Worker Roles to
Service Fabric stateless services.
Creating the web front-end services
In Service Fabric, a service runs inside a process created by the Service Fabric runtime. For a web front end, that
means the service is not running inside IIS. Instead, the service must host a web server. This approach is called self-
hosting, because the code that runs inside the process acts as the web server host.
The original Surveys application uses ASP.NET MVC. Because ASP.NET MVC cannot be self-hosted in Service Fabric,
Tailspin considered the following migration options:
Port the web roles to ASP.NET Core, which can be self-hosted.
Convert the web site into a single-page application (SPA) that calls a web API implemented using ASP.NET Web
API. This would have required a complete redesign of the web front end.
Keep the existing ASP.NET MVC code and deploy IIS in a Windows Server container to Service Fabric. This
approach would require little or no code change.
The first option, porting to ASP.NET Core, allowed Tailspin to take advantage of the latest features in ASP.NET Core.
To do the conversion, Tailspin followed the steps described in Migrating From ASP.NET MVC to ASP.NET Core MVC.

NOTE
When using ASP.NET Core with Kestrel, you should place a reverse proxy in front of Kestrel to handle traffic from the Internet,
for security reasons. For more information, see Kestrel web server implementation in ASP.NET Core. The section Deploying
the application describes a recommended Azure deployment.

HTTP listeners
In Cloud Services, a web or worker role exposes an HTTP endpoint by declaring it in the service definition file. A
web role must have at least one endpoint.
<!-- Cloud service endpoint -->
<Endpoints>
<InputEndpoint name="HttpIn" protocol="http" port="80" />
</Endpoints>

Similarly, Service Fabric endpoints are declared in a service manifest:

<!-- Service Fabric endpoint -->


<Endpoints>
<Endpoint Protocol="http" Name="ServiceEndpoint" Type="Input" Port="8002" />
</Endpoints>

Unlike a cloud service role, Service Fabric services can be colocated within the same node. Therefore, every service
must listen on a distinct port. Later in this article, we'll discuss how client requests on port 80 or port 443 get
routed to the correct port for the service.
A service must explicitly create listeners for each endpoint. The reason is that Service Fabric is agnostic about
communication stacks. For more information, see Build a web service front end for your application using ASP.NET
Core.

Packaging and configuration


A cloud service contains the following configuration and package files:

F IL E DESC RIP T IO N

Service definition (.csdef) Settings used by Azure to configure the cloud service. Defines
the roles, endpoints, startup tasks, and the names of
configuration settings.

Service configuration (.cscfg) Per-deployment settings, including the number of role


instances, endpoint port numbers, and the values of
configuration settings.

Service package (.cspkg) Contains the application code and configurations, and the
service definition file.

There is one .csdef file for the entire application. You can have multiple .cscfg files for different environments, such
as local, test, or production. When the service is running, you can update the .cscfg but not the .csdef. For more
information, see What is the Cloud Service model and how do I package it?
Service Fabric has a similar division between a service definition and service settings, but the structure is more
granular. To understand Service Fabric's configuration model, it helps to understand how a Service Fabric
application is packaged. Here is the structure:

Application package
- Service packages
- Code package
- Configuration package
- Data package (optional)

The application package is what you deploy. It contains one or more service packages. A service package contains
code, configuration, and data packages. The code package contains the binaries for the services, and the
configuration package contains configuration settings. This model allows you to upgrade individual services
without redeploying the entire application. It also lets you update just the configuration settings, without
redeploying the code or restarting the service.
A Service Fabric application contains the following configuration files:

F IL E LO C AT IO N DESC RIP T IO N

ApplicationManifest.xml Application package Defines the services that compose the


application.

ServiceManifest.xml Service package Describes one or more services.

Settings.xml Configuration package Contains configuration settings for the


services defined in the service package.

For more information, see Model an application in Service Fabric.


To support different configuration settings for multiple environments, use the following approach, described in
Manage application parameters for multiple environments:
1. Define the setting in the Setting.xml file for the service.
2. In the application manifest, define an override for the setting.
3. Put environment-specific settings into application parameter files.

Deploying the application


Whereas Azure Cloud Services is a managed service, Service Fabric is a runtime. You can create Service Fabric
clusters in many environments, including Azure and on premises. The following diagram shows a recommended
deployment for Azure:

The Service Fabric cluster is deployed to a virtual machine scale set. Scale sets are an Azure Compute resource that
can be used to deploy and manage a set of identical VMs.
As mentioned, it's recommended to place the Kestrel web server behind a reverse proxy for security reasons. This
diagram shows Azure Application Gateway, which is an Azure service that offers various layer 7 load-balancing
capabilities. It acts as a reverse-proxy service, terminating the client connection and forwarding requests to back-
end endpoints. You might use a different reverse proxy solution, such as nginx.
Layer 7 routing
In the original Surveys application, one web role listened on port 80, and the other web role listened on port 443.
P UB L IC SIT E SURVEY M A N A GEM EN T SIT E

http://tailspin.cloudapp.net https://tailspin.cloudapp.net

Another option is to use layer 7 routing. In this approach, different URL paths get routed to different port numbers
on the back end. For example, the public site might use URL paths starting with /public/ .
Options for layer 7 routing include:
Use Application Gateway.
Use a network virtual appliance (NVA), such as nginx.
Write a custom gateway as a stateless service.
Consider this approach if you have two or more services with public HTTP endpoints, but want them to appear as
one site with a single domain name.

One approach that we don't recommend is allowing external clients to send requests through the Service
Fabric reverse proxy. Although this is possible, the reverse proxy is intended for service-to-service
communication. Opening it to external clients exposes any service running in the cluster that has an HTTP
endpoint.

Node types and placement constraints


In the deployment shown above, all the services run on all the nodes. However, you can also group services, so that
certain services run only on particular nodes within the cluster. Reasons to use this approach include:
Run some services on different VM types. For example, some services might be compute-intensive or require
GPUs. You can have a mix of VM types in your Service Fabric cluster.
Isolate front-end services from back-end services, for security reasons. All the front-end services will run on one
set of nodes, and the back-end services will run on different nodes in the same cluster.
Different scale requirements. Some services might need to run on more nodes than other services. For example,
if you define front-end nodes and back-end nodes, each set can be scaled independently.
The following diagram shows a cluster that separates front-end and back-end services:

To implement this approach:


1. When you create the cluster, define two or more node types.
2. For each service, use placement constraints to assign the service to a node type.
When you deploy to Azure, each node type is deployed to a separate virtual machine scale set. The Service Fabric
cluster spans all node types. For more information, see The relationship between Service Fabric node types and
virtual machine scale sets.

If a cluster has multiple node types, one node type is designated as the primary node type. Service Fabric
runtime services, such as the Cluster Management Service, run on the primary node type. Provision at least 5
nodes for the primary node type in a production environment. The other node type should have at least 2
nodes.

Configuring and managing the cluster


Clusters must be secured to prevent unauthorized users from connecting to your cluster. It is recommended to use
Azure AD to authenticate clients, and X.509 certificates for node-to-node security. For more information, see
Service Fabric cluster security scenarios.
To configure a public HTTPS endpoint, see Specify resources in a service manifest.
You can scale out the application by adding VMs to the cluster. Virtual machine scale sets support autoscaling using
autoscale rules based on performance counters. For more information, see Scale a Service Fabric cluster in or out
using autoscale rules.
While the cluster is running, collect logs from all the nodes in a central location. For more information, see Collect
logs by using Azure Diagnostics.

Refactor the application


After the application is ported to Service Fabric, the next step is to refactor it to a more granular architecture.
Tailspin's motivation for refactoring is to make it easier to develop, build, and deploy the Surveys application. By
decomposing the existing web and worker roles to a more granular architecture, Tailspin wants to remove the
existing tightly coupled communication and data dependencies between these roles.
Tailspin sees other benefits in moving the Surveys application to a more granular architecture:
Each service can be packaged into independent projects with a scope small enough to be managed by a small
team.
Each service can be independently versioned and deployed.
Each service can be implemented using the best technology for that service. For example, a service fabric cluster
can include services built using different versions of the .Net Frameworks, Java, or other languages such as C or
C++.
Each service can be independently scaled to respond to increases and decreases in load.
The source code for the refactored version of the app is available on GitHub.

Design considerations
The following diagram shows the architecture of the Surveys application refactored to a more granular
architecture:
Tailspin.Web is a stateless service self-hosting an ASP.NET MVC application that Tailspin customers visit to create
surveys and view survey results. This service shares most of its code with the Tailspin.Web service from the ported
Service Fabric application. As mentioned earlier, this service uses ASP.NET core and switches from using Kestrel as
web frontend to implementing a WebListener.
Tailspin.Web.Sur vey.Public is a stateless service also self-hosting an ASP.NET MVC site. Users visit this site to
select surveys from a list and then fill them out. This service shares most of its code with the
Tailspin.Web.Survey.Public service from the ported Service Fabric application. This service also uses ASP.NET Core
and also switches from using Kestrel as web frontend to implementing a WebListener.
Tailspin.Sur veyResponseSer vice is a stateful service that stores survey answers in Azure Blob Storage. It also
merges answers into the survey analysis data. The service is implemented as a stateful service because it uses a
ReliableConcurrentQueue to process survey answers in batches. This functionality was originally implemented in
the Tailspin.AnswerAnalysisService service in the ported Service Fabric application.
Tailspin.Sur veyManagementSer vice is a stateless service that stores and retrieves surveys and survey
questions. The service uses Azure Blob storage. This functionality was also originally implemented in the data
access components of the Tailspin.Web and Tailspin.Web.Survey.Public services in the ported Service Fabric
application. Tailspin refactored the original functionality into this service to allow it to scale independently.
Tailspin.Sur veyAnswerSer vice is a stateless service that retrieves survey answers and survey analysis. The
service also uses Azure Blob storage. This functionality was also originally implemented in the data access
components of the Tailspin.Web service in the ported Service Fabric application. Tailspin refactored the original
functionality into this service because it expects less load and wants to use fewer instances to conserve resources.
Tailspin.Sur veyAnalysisSer vice is a stateless service that persists survey answer summary data in a Redis cache
for quick retrieval. This service is called by the Tailspin.SurveyResponseService each time a survey is answered and
the new survey answer data is merged in the summary data. This service includes the functionality originally
implemented in the Tailspin.AnswerAnalysisService service from the ported Service Fabric application.

Stateless versus stateful services


Azure Service Fabric supports the following programming models:
The guest executable model allows any executable to be packaged as a service and deployed to a Service Fabric
cluster. Service Fabric orchestrates and manages execution of the guest executable.
The container model allows for deployment of services in container images. Service Fabric supports creation
and management of containers on top of Linux kernel containers as well as Windows Server containers.
The reliable services programming model allows for the creation of stateless or stateful services that integrate
with all Service Fabric platform features. Stateful services allow for replicated state to be stored in the Service
Fabric cluster. Stateless services do not.
The reliable actors programming model allows for the creation of services that implement the virtual actor
pattern.
All the services in the Surveys application are stateless reliable services, except for the
Tailspin.SurveyResponseService service. This service implements a ReliableConcurrentQueue to process survey
answers when they are received. Responses in the ReliableConcurrentQueue are saved into Azure Blob Storage and
passed to the Tailspin.SurveyAnalysisService for analysis. Tailspin chooses a ReliableConcurrentQueue because
responses do not require strict first-in-first-out (FIFO) ordering provided by a queue such as Azure Service Bus. A
ReliableConcurrentQueue is also designed to deliver high throughput and low latency for queue and dequeue
operations.
Operations to persist dequeued items from a ReliableConcurrentQueue should ideally be idempotent. If an
exception is thrown during the processing of an item from the queue, the same item may be processed more than
once. In the Surveys application, the operation to merge survey answers to the Tailspin.SurveyAnalysisService is
not idempotent because Tailspin decided that the survey analysis data is only a current snapshot of the analysis
data and does not need to be consistent. The survey answers saved to Azure Blob Storage are eventually consistent,
so the survey final analysis can always be recalculated correctly from this data.

Communication framework
Each service in the Surveys application communicates using a RESTful web API. RESTful APIs offer the following
benefits:
Ease of use: each service is built using ASP.NET Core MVC, which natively supports the creation of Web APIs.
Security: While each service does not require SSL, Tailspin could require each service to do so.
Versioning: clients can be written and tested against a specific version of a web API.
Services in the Survey application use the reverse proxy implemented by Service Fabric. Reverse proxy is a service
that runs on each node in the Service Fabric cluster and provides endpoint resolution, automatic retry, and handles
other types of connection failures. To use the reverse proxy, each RESTful API call to a specific service is made using
a predefined reverse proxy port. For example, if the reverse proxy port has been set to 19081 , a call to the
Tailspin.SurveyAnswerService can be made as follows:

static SurveyAnswerService()
{
httpClient = new HttpClient
{
BaseAddress = new Uri("http://localhost:19081/Tailspin/SurveyAnswerService/")
};
}

To enable reverse proxy, specify a reverse proxy port during creation of the Service Fabric cluster. For more
information, see reverse proxy in Azure Service Fabric.

Performance considerations
Tailspin created the ASP.NET Core services for Tailspin.Web and Tailspin.Web.Surveys.Public using Visual Studio
templates. By default, these templates include logging to the console. Logging to the console may be done during
development and debugging, but all logging to the console should be removed when the application is deployed to
production.
NOTE
For more information about setting up monitoring and diagnostics for Service Fabric applications running in production, see
monitoring and diagnostics for Azure Service Fabric.

For example, the following lines in startup.cs for each of the web front end services should be commented out:

// This method gets called by the runtime. Use this method to configure the HTTP request pipeline.
public void Configure(IApplicationBuilder app, IHostingEnvironment env, ILoggerFactory loggerFactory)
{
//loggerFactory.AddConsole(Configuration.GetSection("Logging"));
//loggerFactory.AddDebug();

app.UseMvc();
}

NOTE
These lines may be conditionally excluded when Visual Studio is set to release when publishing.

Finally, when Tailspin deploys the Tailspin application to production, they switch Visual Studio to release mode.

Deployment considerations
The refactored Surveys application is composed of five stateless services and one stateful service, so cluster
planning is limited to determining the correct VM size and number of nodes. In the applicationmanifest.xml file that
describes the cluster, Tailspin sets the InstanceCount attribute of the StatelessService tag to -1 for each of the
services. A value of -1 directs Service Fabric to create an instance of the service on each node in the cluster.

NOTE
Stateful services require the additional step of planning the correct number of partitions and replicas for their data.

Tailspin deploys the cluster using the Azure portal. The Service Fabric Cluster resource type deploys all of the
necessary infrastructure, including virtual machine scale sets and a load balancer. The recommended VM sizes are
displayed in the Azure portal during the provisioning process for the Service Fabric cluster. Because the VMs are
deployed in a virtual machine scale set, they can be both scaled up and out as user load increases.

Next steps
The Surveys application code is available on GitHub.
If you are just getting started with Azure Service Fabric, first set up your development environment then download
the latest Azure SDK and the Azure Service Fabric SDK. The SDK includes the OneBox cluster manager so you can
deploy and test the Surveys application locally with full F5 debugging.
Serverless Functions overview
11/2/2020 • 2 minutes to read • Edit Online

Serverless architecture evolves cloud platforms toward pure cloud-native code by abstracting code from the
infrastructure that it needs to run. Azure Functions is a serverless compute option that supports functions, small
pieces of code that do single things.
Benefits of using serverless architectures with Functions applications include:
The Azure infrastructure automatically provides all the updated servers that applications need to keep running
at scale.
Compute resources allocate dynamically, and instantly autoscale to meet elastic demands. Serverless doesn't
mean "no server," but "less server," because servers run only as needed.
Micro-billing saves costs by charging only for the compute resources and duration the code uses to execute.
Function bindings streamline integration by providing declarative access to a wide variety of Azure and third-
party services.
Functions are event-driven. An external event like an HTTP web request, message, schedule, or change in data
triggers the function code. A Functions application doesn't code the trigger, only the response to the trigger. With a
lower barrier to entry, developers can focus on business logic, rather than writing code to handle infrastructure
concerns like messaging.
Azure Functions is a managed service in Azure and Azure Stack. The open source Functions runtime works in many
environments, including Kubernetes, Azure IoT Edge, on-premises, and other clouds.
Serverless and Functions require new ways of thinking and new approaches to building applications. They aren't
the right solutions for every problem. For example serverless Functions scenarios, see Reference architectures.

Implementation steps
Successful implementation of serverless technologies with Azure Functions requires the following actions:
Decide and plan
Architects and technical decision makers (TDMs) perform application assessment, conduct or attend
technical workshops and trainings, run proof of concept (PoC) or pilot projects, and conduct architectural
designs sessions as necessary.
Develop and deploy apps
Developers implement serverless Functions app development patterns and practices, configure DevOps
pipelines, and employ site reliability engineering (SRE) best practices.
Manage operations
IT professionals identify hosting configurations, future-proof scalability by automating infrastructure
provisioning, and maintain availability by planning for business continuity and disaster recovery.
Secure apps
Security professionals handle Azure Functions security essentials, secure the hosting setup, and provide
application security guidance.
Related resources
To learn more about serverless technology, see the Azure serverless documentation.
To learn more about Azure Functions, see the Azure Functions documentation.
For help with choosing a compute technology, see Choose an Azure compute service for your application.
Serverless Functions reference architectures
11/2/2020 • 4 minutes to read • Edit Online

A reference architecture is a template of required components and the technical requirements to implement them.
A reference architecture isn't custom-built for a customer solution, but is a high-level scenario based on extensive
experience. Before designing a serverless solution, use a reference architecture to visualize an ideal technical
architecture, then blend and integrate it into your environment.

Common serverless architecture patterns


Common serverless architecture patterns include:
Serverless APIs, mobile and web backends.
Event and stream processing, Internet of Things (IoT) data processing, big data and machine learning pipelines.
Integration and enterprise service bus to connect line-of-business systems, publish and subscribe (Pub/Sub) to
business events.
Automation and digital transformation and process automation.
Middleware, software-as-a-Service (SaaS) like Dynamics, and big data projects.

Web application backends


Retail scenario: Pick up online orders from a queue,
process them, and store the resulting data in a database

Mobile application backends


Financial services scenario: Colleagues use mobile banking
to reimburse each other for lunch. Whoever paid for lunch
requests payment through a mobile app, triggering a
notification on colleagues' phones.

IoT-connected backends
Manufacturing scenario: A manufacturing company uses
IoT to monitor its machines. Functions detects anomalous
data and triggers a message to the service department
when repair is required.

Conversational bot processing


Hospitality scenario: Customers ask for available vacation
accommodations on their smartphones. A serverless bot
deciphers requests and returns vacation options.
Real-time file processing
Healthcare scenario: The solution securely uploads patient
records as PDF files. The solution then decomposes the
data, processes it using OCR detection, and adds it to a
database for easy queries.

Real-time stream processing


Independent software vendor (ISV) scenario: A massive
cloud app collects huge amounts of telemetry data. The
app processes that data in near real-time and stores it in
a database for use in an analytics dashboard.

Scheduled task automation


Financial services scenario: The app analyzes a customer
database for duplicate entries every 15 minutes, to avoid
sending out multiple communications to the same
customers.

Extending SaaS applications


Professional services scenario: A SaaS solution provides
extensibility through webhooks, which Functions can
implement to automate certain workflows.

Featured serverless reference architectures


The following featured serverless reference architectures walk through specific scenarios. See the linked articles
for architectural diagrams and details.
Serverless microservices
The serverless microservices reference architecture walks you through designing, developing, and delivering the
Rideshare application by Relecloud, a fictitious company. You get hands-on instructions for configuring and
deploying all the architectural components, with helpful information about each component.
Serverless web application and event processing with Azure Functions
This two-part solution describes a hypothetical drone delivery system. Drones send in-flight status to the cloud,
which stores these messages for later use. A web application allows users to retrieve the messages to get the latest
device status.
You can download the code for this solution from GitHub.
The article Code walkthrough: Serverless application with Azure Functions walks you through the code and the
design processes.
Event-based cloud automation
Automating workflows and repetitive tasks on the cloud can dramatically improve a DevOps team's productivity. A
serverless model is best suited for event-driven automation scenarios. This event-based automation reference
architecture illustrates two cloud automation scenarios: cost center tagging and throttling response.
Multicloud with Serverless Framework
The Serverless Framework architecture describes how the Microsoft Commercial Software Engineering (CSE) team
partnered with a global retailer to deploy a highly-available serverless solution across both Azure and Amazon
Web Services (AWS) cloud platforms, using the Serverless Framework.

More serverless Functions reference architectures


The following sections list other serverless and Azure Functions-related reference architectures and scenarios.
General
Serverless application architectures using Event Grid
Serverless apps using Cosmos DB
Serverless event processing using Azure Functions
Serverless web application on Azure
Serverless Asynchronous Multiplayer Reference Architecture
Instant Broadcasting on Serverless Architecture
Building a telehealth system on Azure
Custom Data Sovereignty & Data Gravity Requirements
Sharing location in real time using low-cost serverless Azure services
Web and mobile backend
An e-commerce front end
Architect scalable e-commerce web app
Improve scalability in an Azure web application
Uploading and CDN-preloading static content with Azure Functions
Cross Cloud Scaling Architecture
Social App for Mobile and Web with Authentication
AI + Machine Learning
Image classification for insurance claims
Personalized Offers
Personalized marketing solutions
Speech transcription with Azure Cognitive Services
Training a Model with AzureML and Azure Functions
Customer Reviews App with Cognitive Services
Enterprise-grade conversational bot
AI at the Edge
Mass ingestion and analysis of news feeds on Azure
HIPPA and HITRUST compliant health data AI
Intelligent Experiences On Containers (AKS, Functions, Keda)
Data and analytics
Application integration using Event Grid
Mass ingestion and analysis of news feeds
Tier Applications & Data for Analytics
Operational analysis and driving process efficiency
IoT
Azure IoT reference (SQL DB)
Azure IoT reference (Cosmos DB)
IoT using Cosmos DB
Facilities management powered by mixed reality and IoT
Complementary Code Pattern for Azure IoT Edge Modules & Cloud Applications
Gaming
Custom Game Server Scaling
Non-real Time Dashboard
In-editor Debugging Telemetry
Multiplayer Serverless Matchmaker
Advanced leaderboard for large scale
Relational Leaderboard
Content Moderation
Text Translation
Text to Speech
Gaming using Cosmos DB
Automation
Smart scaling for Azure Scale Set with Azure Functions
Plan for serverless Functions
11/2/2020 • 2 minutes to read • Edit Online

To plan for moving an application to a serverless Azure Functions architecture, a technical decision maker (TDM) or
architect:
Verifies the application's characteristics and business requirements.
Determines the application's suitability for serverless Azure Functions.
Transforms business requirements into functional and other requirements.
Planning activities also include assessing technical team readiness, providing or attending workshops and training,
and conducting architectural design reviews, proofs of concept, pilots, and technical implementations.
TDMs and architects may perform one or more of the following activities:
Execute an application assessment. Evaluate the main aspects of the application to determine how
complex and risky it is to rearchitect through application modernization, or rebuild a new cloud-native
application. See Application assessment.
Attend or promote technical workshops and training. Host a Serverless workshop or CloudHack, or
enjoy many other training and learning opportunities for serverless technologies, Azure Functions, app
modernization, and cloud-native apps. See Technical workshops and training.
Identify and execute a Proof of Concept (PoC) or pilot, or technical implementation. Deliver a
PoC, pilot, or technical implementation to provide evidence that serverless Azure Functions can solve a
team's business problems. Showing teams how to modernize or build new cloud-native applications to their
specifications can accelerate deployment to production. See PoC or pilot.
Conduct architectural design sessions. An architectural design session (ADS) is an in-depth discussion
on how a new solution will blend into the environment. ADSs validate business requirements and transform
them to functional and other requirements.

Next steps
For example scenarios that use serverless architectures with Azure Functions, see Serverless reference
architectures.
To move forward with serverless Azure Functions implementation, see the following resources:
Application development and deployment
Azure Functions app operations
Azure Functions app security
Application assessment
11/2/2020 • 6 minutes to read • Edit Online

Cloud rationalization is the process of evaluating applications to determine the best way to migrate or modernize
them for the cloud.
Rationalization methods include:
Rehost . Also known as a lift and shift migration, rehost moves a current application to the cloud with minimal
change.
Refactor . Slightly refactoring an application to fit platform-as-a-service (PaaS)-based options can reduce
operational costs.
Rearchitect . Rearchitect aging applications that aren't compatible with cloud components, or cloud-compatible
applications that would realize cost and operational efficiencies by rearchitecting into a cloud-native solution.
Rebuild . If the changes or costs to carry an application forward are too great, consider creating a new cloud-
native code base. Rebuild is especially appropriate for applications that previously met business needs, but are
now unsupported or misaligned with current business processes.
Before you decide on an appropriate strategy, analyze the current application to determine the risk and complexity
of each method. Consider application lifecycle, technology, infrastructure, performance, and operations and
monitoring. For multitier architectures, evaluate the presentation tier, service tier, integrations tier, and data tier.
The following checklists evaluate an application to determine the complexity and risk of rearchitecting or
rebuilding.

Complexity and risk


Each of the following factors adds to complexity, risk, or both.
Architecture
Define the high-level architecture, such as web application, web services, data storage, or caching.

FA C TO R C O M P L EXIT Y RISK

Application components don't translate ✔ ✔


directly to Azure.

The application needs code changes to ✔ ✔


run in Azure.

The application needs major, complex ✔ ✔


code changes to run in Azure.

Business drivers
Older applications might require extensive changes to get to the cloud.

FA C TO R C O M P L EXIT Y RISK

This application has been around for ✔


more than three years.
FA C TO R C O M P L EXIT Y RISK

This application is business critical. ✔

There are technology blockers for ✔


migration.

There are business blockers for ✔


migration.

This application has compliance ✔


requirements.

The application is subject to country- ✔


specific data requirements.

The application is publicly accessible. ✔ ✔

Technology
FA C TO R C O M P L EXIT Y RISK

This is not a web-based application, and ✔


isn't hosted on a web server.

The app isn't hosted in Windows IIS ✔

The app isn't hosted on Linux ✔

The application is hosted in a web farm, ✔ ✔


and requires multiple servers to host
the web components.

The application requires third-party ✔ ✔


software to be installed on the servers.

The application is hosted in a single ✔


datacenter, and operations are
performed in a single location.

The application accesses the server's ✔


registry.

The application sends emails, and needs ✔


access to an SMTP server.

This isn't a .NET application. ✔

The application uses SQL Server as its ✔


data store.

The application stores data on local ✔


disks, and needs access to the disks to
operate properly.
FA C TO R C O M P L EXIT Y RISK

The application uses Windows Services ✔


to process asynchronous operations, or
needs external services to process data
or operations.

Deployment
When assessing deployment requirements, consider:
Number of daily users
Average number of concurrent users
Expected traffic
Bandwidth in Gbps
Requests per second
Amount of memory needed
You can reduce deployment risk by storing code under source control in a version control system such as Git,
Azure DevOps Server, or SVN.

FA C TO R C O M P L EXIT Y RISK

Leveraging existing code and data is a ✔ ✔


#1 priority.

The application code isn't under source ✔


control.

There's no automated build process like ✔


Azure DevOps Server or Jenkins.

There's no automated release process ✔ ✔


to deploy the application.

The application has a Service Level ✔


Agreement (SLA) that dictates the
amount of expected downtime.

The application experiences peak or ✔


variable usage times or loads.

The web application saves its session ✔


state in process, rather than an external
data store.

Operations
FA C TO R C O M P L EXIT Y RISK

The application doesn't have a well- ✔


established instrumentation strategy or
standard instrumentation framework.
FA C TO R C O M P L EXIT Y RISK

The application doesn't use monitoring ✔


tools, and the operations team doesn't
monitor the app's performance.

The application has measured SLA in ✔


place, and the operations team
monitors the application's performance.

The application writes to a log store, ✔


event log, log file, log database, or
Application Insights.

The application doesn't write to a log ✔


store, event log, log file, log database,
or Application Insights.

The application isn't part of the ✔


organization's disaster recovery plan.

Security
FA C TO R C O M P L EXIT Y RISK

The application uses Active Directory to ✔ ✔


authenticate users.

The organization hasn't yet configured ✔


Azure Active Directory (Azure AD), or
hasn't configured Azure AD Connect to
synchronize on-premises AD with Azure
AD.

The application requires access to on- ✔


premises resources, which will require
VPN connectivity from Azure.

The organization hasn't yet configured ✔ ✔


a VPN connection between Azure and
their on-premises environment.

The application requires a SSL certificate ✔ ✔


to run.

Results
Count your application's Complexity and Risk checkmarks.
The expected level of complexity to migrate or modernize the application to Azure is: Total Complexity/25 .
The expected risk involved is: Total Risk/19 .
For both complexity and risk, a score of <0.3 = low, <0.7 = medium, >0.7 = high.

Refactor, rearchitect, or rebuild


To rationalize whether to rehost, refactor, rearchitect, or rebuild your application, consider the following points.
Many of these factors also contribute to complexity and risk.
Determine whether the application components can translate directly to Azure. If so, you don't need code changes
to move the application to Azure, and could use rehost or refactor strategies. If not, you need to rewrite code, so
you need to rearchitect or rebuild.
If the app does need code changes, determine the complexity and extent of the needed changes. Minor changes
might allow for rearchitecting, while major changes may require rebuilding.
Rehost or refactor
If leveraging existing code and data is a top priority, consider a refactor strategy rather than rearchitecting
or rebuilding.
If you have pressing timelines like datacenter shutdown or contract expiration, end-of-life licensing, or
mergers or acquisitions, the fastest way to get the application to Azure might be to rehost, followed by
refactoring to take advantage of cloud capabilities.
Rearchitect or rebuild
If there are applications serving similar needs in your portfolio, this might be an opportunity to rearchitect
or rebuild the entire solution.
If you want to implement multi-tier or microservices architecture for a monolithic app, you must rearchitect
or rebuild the app. If you don't mind retaining the monolithic structure, you might be able to rehost or
refactor.
Rearchitect or rebuild the app to take advantage of cloud capabilities if you plan to update the app more
often than yearly, if the app has peak or variable usage times, or if you expect the app to handle high traffic.
To decide between rearchitecting or rebuilding, assess the following factors. The largest scoring result indicates
your best strategy.

FA C TO R REA RC H IT EC T REB UIL D

There are other applications serving ✔ ✔


similar needs in your portfolio.

The application needs minor code ✔


changes to run in Azure.

The application needs major, complex ✔


code changes to run in Azure.

It's important to leverage existing code. ✔

You want to move a monolithic ✔


application to multi-tier architecture.

You want to move a monolithic ✔ ✔


application to a microservices
architecture.

You expect this app to add ✔


breakthrough capabilities like AI, IoT, or
bots.

Among functionality, cost, ✔


infrastructure, and processes,
functionality is the least efficient aspect
of this application.
FA C TO R REA RC H IT EC T REB UIL D

The application requires third-party ✔


software installed on the servers.

The application accesses the server's ✔


registry.

The application sends emails and needs ✔


access to an SMTP server.

The application uses SQL Server as its ✔


data store.

The application stores data on local ✔


disks, and needs access to the disks to
run properly.

The application uses Windows services ✔


to process asynchronous operations, or
needs external services to process data
or operations.

A web application saves its session ✔


state in process, rather than to an
external data store.

The app has peak and variable usage ✔ ✔


times and loads.

You expect the application to handle ✔ ✔


high traffic.
Technical workshops and training
11/2/2020 • 3 minutes to read • Edit Online

The workshops, classes, and learning materials in this article provide technical training for serverless architectures
with Azure Functions. These resources help you and your team or customers understand and implement
application modernization and cloud-native apps.

Technical workshops
The Microsoft Cloud Workshop (MCW) program provides workshops you can host to foster cloud learning and
adoption. Each workshop includes presentation decks, trainer and student guides, and hands-on lab guides.
Contribute your own content and feedback to add to a robust database of training guides for deploying advanced
Azure workloads on the Microsoft Cloud Platform.
Workshops related to application development workloads include:
Serverless APIs in Azure. Set of entry-level exercises, which cover the basics of building and managing
serverless APIs in Microsoft Azure - with Azure Functions, Azure API Management, and Azure Application
Insights.
Serverless architecture. Implement a series of Azure Functions that independently scale and break down
business logic to discrete components, allowing customers to pay only for the services they use.
App modernization. Design a modernization plan to move services from on-premises to the cloud by
leveraging cloud, web, and mobile services, secured by Azure Active Directory.
Modern cloud apps. Deploy, configure, and implement an end-to-end secure and Payment Card Industry (PCI)
compliant solution for e-commerce, based on Azure App Services, Azure Active Directory, and Azure DevOps.
Cloud-native applications. Using DevOps best practices, build a proof of concept (PoC) to transform a platform-
as-a-service (PaaS) application to a container-based application with multi-tenant web app hosting.
Continuous delivery in Azure DevOps. Set up and configure continuous delivery (CD) in Azure to reduce
manual errors, using Azure Resource Manager templates, Azure DevOps, and Git repositories for source
control.

Instructor-led training
Course AZ-204: Developing solutions for Microsoft Azure teaches developers how to create end-to-end solutions
in Microsoft Azure. Students learn how to implement Azure compute solutions, create Azure Functions, implement
and manage web apps, develop solutions utilizing Azure Storage, implement authentication and authorization, and
secure their solutions by using Azure Key Vault and managed identities. Students also learn to connect to and
consume Azure and third-party services, and include event- and message-based models in their solutions. The
course also covers monitoring, troubleshooting, and optimizing Azure solutions.

Serverless OpenHack
The Serverless OpenHack simulates a real-world scenario where a company wants to utilize serverless services to
build and release an API to integrate into their distributor's application. This OpenHack lets attendees quickly build
and deploy Azure serverless solutions with cutting-edge compute services like Azure Functions, Logic Apps, Event
Grid, Service Bus, Event Hubs, and Cosmos DB. The OpenHack also covers related technologies like API
Management, Azure DevOps or GitHub, Application Insights, Dynamics 365/Microsoft 365, and Cognitive APIs.
OpenHack attendees build serverless functions, web APIs, and a CI/CD pipeline to support them. They implement
further serverless technologies to integrate line of business (LOB) app workflows, process user and data telemetry,
and create key progress indicator (KPI)-aligned reports. By the end of the OpenHack, attendees have built out a full
serverless technical solution that can create workflows between systems and handle events, files, and data
ingestion.
Microsoft customer projects inspired these OpenHack challenges:
Configure the developer environment.
Create your first serverless function and workflow.
Build APIs to support business needs.
Deploy a management layer for APIs and monitoring usage.
Build a LOB workflow process.
Process large amounts of unstructured file data.
Process large amounts of incoming event data.
Implement a publisher/subscriber messaging pattern and virtual network integration.
Conduct sentiment analysis.
Perform data aggregation, analysis, and reporting.
To attend an OpenHack, register at https://openhack.microsoft.com. For enterprises with many engineers,
Microsoft can request and organize a dedicated Serverless OpenHack.

Microsoft Learn
Microsoft Learn is a free, online training platform that provides interactive learning for Microsoft products. The
goal is to improve proficiency with fun, guided, hands-on content that's specific to your role and goals. Learning
paths are collections of modules that are organized around specific roles like developer, architect, or system admin,
or technologies like Azure Web Apps, Azure Functions, or Azure SQL DB. Learning paths provide understanding of
different aspects of the technology or role.
Learning paths about serverless apps and Azure Functions include:
Create serverless applications. Learn how to leverage functions to execute server-side logic and build serverless
architectures.
Architect message brokering and serverless applications in Azure. Learn how to create reliable messaging for
your applications, and how to take advantage of serverless application services in Azure.
Search all Functions-related learning paths.

Hands-on labs and how-to guides


Build a serverless web app, from the Build 2018 conference
Build a Serverless IoT Solution with Python Azure Functions and SignalR

Next steps
Execute an application assessment
Identify and execute a PoC or Pilot project
Proof of Concept or pilot
11/2/2020 • 4 minutes to read • Edit Online

When driving a technical and security decision for your company or customer, a Proof of Concept (PoC) or pilot is
an opportunity to deliver evidence that the proposed solution solves the business problems. The PoC or pilot
increases the likelihood of a successful adoption.
A PoC:
Demonstrates that a business model or idea is feasible and will work to solve the business problem
Usually involves one to three features or capabilities
Can be in one or multiple technologies
Is usually geared toward a particular scenario, and proves what the customer needs to know to make the
technical or security decision
Is used only as a demonstration and won't go into production
Is IT-driven and enablement-driven
A pilot:
Is a test run or trial of a proposed action or product
Lasts longer than a PoC, often weeks or months
Has a higher return on investment (ROI) than a PoC
Builds in a pre-production or trial environment, with the intent that it will then go into production
Is adoption-driven and consumption-driven

PoC and pilot best practices


Be aware of compliance issues when working in a customer's environment, and make sure your actions are always
legal and compliant.
Touching or altering the customer's environment usually requires a contract, and may involve a partner or
Microsoft services. Without a contract, your company may be liable for issues or damages.
Governance may require Legal department approval. Your company may not be able to give intellectual
property (IP) away for free. You may need a legal contract or contracts to specify whether your company or the
customer pays for the IP.
Get disclosure guidance when dealing with non-disclosure agreements (NDAs), product roadmaps, NDA
features, or anything not released to the general public.
In a pilot, don't use a trial Microsoft Developers Network (MSDN) environment, or any environment that you
own.
Use properly-licensed software, and ask the opportunity owner to make sure to handle software licensing
correctly.
The customer, partner, or your company may pay for the PoC or pilot. Depending on the size of the contract, the
ROI, and the cost of sale, one group may cover it all, or a combination of all three parties may cover the cost.
Ensure that your company or customer has some investment in the PoC or pilot. If they don't, this can be a red flag
signaling that your company or customer doesn't yet see value in the solution.

PoC and pilot process


The Technical Decision Maker (TDM) is responsible for driving an adoption decision. The TDM is responsible for
ensuring that the right partners and resources are involved in a PoC or pilot. As a TDM, make sure you're aware of
the partners in your product and service area or region. Be aware of their key service offerings around your
product service area.
Planning
Consider the following health questions:
Do you have a good technical plan, including key decision makers and Microsoft potential?
Can you deliver the needed assurance without a PoC?
Should you switch to a pilot?
What are the detailed scope and decision criteria your team or customer agreed to?
If you meet the criteria, will your company or customer buy or deploy the solution?
Do the following tasks:
Analyze risk.
Evaluate the setting.
Do the preparation.
Consider workloads and human resources.
Present PoC or pilot health status.
Fulfill technical prerequisites.
Define the go/no go decision.
Create a final project plan specification.
Execution
For the execution phase:
Determine who kicks off the presentation.
Schedule the meeting in the morning, if possible.
Prepare the demos and slides.
Conduct a dry run to refine the presentation.
Get feedback.
Involve your company or customer team.
Complete the win/lose statement.
Debriefing
During the debriefing phase, consider:
Whether criteria were met or not met
Stakeholder investment
Initiating deployment
Finding a partner and training
Lessons learned
Corrections or extensions of PoC or pilot guidance
Archiving of valuable deliverables

Change management
Change management uses tested methods and techniques to avoid errors and minimize impact when
administering change.
Ideally, a pilot includes a cross-section of users, to address any potential issues or problems that arise. Users may
be comfortable and familiar with their old technology, and have difficulty moving into new technical solutions.
Change management keeps this in mind, and helps the user understand the reasons behind the change and the
impact the change will make.
This understanding is part of a pilot, and addresses everyone who has a stake in the project. A pilot is better than a
PoC, because the customer is more involved, so they're more likely to implement the change.
The pilot includes a detailed follow up through surveys or focus groups. The feedback can prove and improve the
change.

Next steps
Execute an application assessment
Promote a technical workshop or training
Code a technical implementation with the team or customer

Related resources
Prosci® change management training
Application development and deployment
11/2/2020 • 4 minutes to read • Edit Online

To develop and deploy serverless applications with Azure Functions, examine patterns and practices, configure
DevOps pipelines, and implement site reliability engineering (SRE) best practices.
For detailed information about serverless architectures and Azure Functions, see:
Serverless apps: Architecture, patterns, and Azure implementation
Azure Serverless Computing Cookbook
Example serverless reference architectures

Planning
To plan app development and deployment:
1. Prepare development environment and set up workflow.
2. Structure projects to support Azure Functions app development.
3. Identify app triggers, bindings, and configuration requirements.
Understand event-driven architecture
A different event triggers every function in a serverless Functions project. For more information about event-
driven architectures, see:
Event-driven architecture style.
Event-driven design patterns to enhance existing applications using Azure Functions
Prepare development environment
Set up your development workflow and environment with the tools to create Functions. For details about
development tools and Functions code project structure, see:
Code and test Azure Functions locally
Develop Azure Functions by using Visual Studio Code
Develop Azure Functions using Visual Studio
Work with Azure Functions Core Tools
Folder structure

Development
Decide on the development language to use. Azure Functions supports C#, F#, PowerShell, JavaScript, TypeScript,
Java, and Python. All of a project's Functions must be in the same language. For more information, see Supported
languages in Azure Functions.
Define triggers and bindings
A trigger invokes a Function, and every Function must have exactly one trigger. Binding to a Function declaratively
connects another resource to the Function. For more information about Functions triggers and bindings, see:
Azure Functions triggers and bindings concepts
Execute an Azure Function with triggers
Chain Azure Functions together using input and output bindings
Create the Functions application
Functions follow the single responsibility principle: do only one thing. For more information about Functions
development, see:
Azure Functions developers guide
Create serverless applications
Strategies for testing your code in Azure Functions
Functions best practices
Use Durable Functions for stateful workflows
Durable Functions in Azure Functions let you define stateful workflows in a serverless environment by writing
orchestrator functions, and stateful entities by writing entity functions. Durable Functions manage state,
checkpoints, and restarts, allowing you to focus on business logic. For more information, see What are Durable
Functions.
Understand and address cold starts
If the number of serverless host instances scales down to zero, the next request has the added latency of restarting
the Function app, called a cold start. To minimize the performance impact of cold starts, reduce dependencies that
the Functions app needs to load on startup, and use as few large, synchronous calls and operations as possible.
For more information about autoscaling and cold starts, see Serverless Functions operations.
Manage application secrets
For security, don't store credentials in application code. To use Azure Key Vault with Azure Functions to store and
retrieve keys and credentials, see Use Key Vault references for App Service and Azure Functions.
For more information about serverless Functions application security, see Serverless Functions security.

Deployment
To prepare serverless Functions application for production, make sure you can:
Fulfill application resource requirements.
Monitor all aspects of the application.
Diagnose and troubleshoot application issues.
Deploy new application versions without affecting production systems.
Define deployment technology
Decide on deployment technology, and organize scheduled releases. For more information about how Functions
app deployment enables reliable, zero-downtime upgrades, see Deployment technologies in Azure Functions.
Avoid using too many resource connections
Functions in a Functions app share resources, including connections to HTTPS, databases, and services such as
Azure Storage. When many Functions are running concurrently, it's possible to run out of available connections.
For more information, see Manage connections in Azure Functions.
Configure logging, alerting, and application monitoring
Application Insights in Azure Monitor collects log, performance, and error data. Application Insights automatically
detects performance anomalies, and includes powerful analytics tools to help diagnose issues and understand
function usage.
For more information about application monitoring and logging, see:
Monitor Azure Functions
Monitoring Azure Functions with Azure Monitor Logs
Application Insights for Azure Functions supported features
Diagnose and troubleshoot issues
Learn how to effectively use diagnostics for troubleshooting in proactive and problem-first scenarios. For more
information, see:
Keep your Azure App Service and Azure Functions apps healthy and happy
Troubleshoot error: "Azure Functions Runtime is unreachable"
Deploy applications using an automated pipeline and DevOps
Full automation of all steps from code commit to production deployment lets teams focus on building code, and
removes the overhead and potential human error of manual steps. Deploying new code is quicker and less risky,
helping teams become more agile, more productive, and more confident about their code.
For more information about DevOps and continuous deployment (CD), see:
Continuous deployment for Azure Functions
Continuous delivery by using Azure DevOps
Continuous delivery by using GitHub Action

Optimization
Once the application is in production, prepare for scaling and implement site reliability engineering (SRE).
Ensure optimal scalability
For information about factors that impact Functions app scalability, see:
Scalability best practices
Performance and scale in Durable Functions
Implement SRE practices
Site Reliability Engineering (SRE) is a proven approach to maintaining crucial system and application reliability,
while iterating at the speed the marketplace demands. For more information, see:
Introduction to Site Reliability Engineering (SRE)
DevOps at Microsoft: Game streaming SRE

Next steps
For hands-on serverless Functions app development and deployment walkthroughs, see:
Serverless Functions code walkthrough
CI/CD for a serverless frontend
For an engineering playbook to help teams and customers successfully implement serverless Functions projects,
see the Code-With Customer/Partner Engineering Playbook.
Code walkthrough: Serverless application with
Functions
11/2/2020 • 18 minutes to read • Edit Online

Serverless models abstract code from the underlying compute infrastructure, allowing developers to focus on
business logic without extensive setup. Serverless code reduces costs, because you pay only for the code execution
resources and duration.
The serverless event-driven model fits situations where a certain event triggers a defined action. For example,
receiving an incoming device message triggers storage for later use, or a database update triggers some further
processing.
To help you explore Azure serverless technologies in Azure, Microsoft developed and tested a serverless
application that uses Azure Functions. This article walks through the code for the serverless Functions solution,
and describes design decisions, implementation details, and some of the "gotchas" you might encounter.

Explore the solution


The two-part solution describes a hypothetical drone delivery system. Drones send in-flight status to the cloud,
which stores these messages for later use. A web app lets users retrieve the messages to get the latest status of the
devices.
You can download the code for this solution from GitHub.
This walkthrough assumes basic familiarity with the following technologies:
Azure Functions
Azure Event Hubs
.NET Core
You don't need to be an expert in Functions or Event Hubs, but you should understand their features at a high
level. Here are some good resources to get started:
An introduction to Azure Functions
Features and terminology in Azure Event Hubs

Understand the scenario

Fabrikam manages a fleet of drones for a drone delivery service. The application consists of two main functional
areas:
Event ingestion . During flight, drones send status messages to a cloud endpoint. The application ingests
and processes these messages, and writes the results to a back-end database (Cosmos DB). The devices
send messages in protocol buffer (protobuf) format. Protobuf is an efficient, self-describing serialization
format.
These messages contain partial updates. At a fixed interval, each drone sends a "key frame" message that
contains all of the status fields. Between key frames, the status messages only include fields that changed
since the last message. This behavior is typical of many IoT devices that need to conserve bandwidth and
power.
Web app . A web application allows users to look up a device and query the device's last-known status.
Users must sign into the application and authenticate with Azure Active Directory (Azure AD). The
application only allows requests from users who are authorized to access the app.
Here's a screenshot of the web app, showing the result of a query:

Design the application


Fabrikam has decided to use Azure Functions to implement the application business logic. Azure Functions is an
example of "Functions as a Service" (FaaS). In this computing model, a function is a piece of code that is deployed
to the cloud and runs in a hosting environment. This hosting environment completely abstracts the servers that
run the code.
Why choose a serverless approach?
A serverless architecture with Functions is an example of an event-driven architecture. The function code is a
triggered by some event that's external to the function — in this case, either a message from a drone, or an HTTP
request from a client application. With a function app, you don't need to write any code for the trigger. You only
write the code that runs in response to the trigger. That means you can focus on your business logic, rather than
writing a lot of code to handle infrastructure concerns like messaging.
There are also some operational advantages to using a serverless architecture:
There is no need to manage servers.
Compute resources are allocated dynamically as needed.
You are charged only for the compute resources used to execute your code.
The compute resources scale on demand based on traffic.
Architecture
The following diagram shows the high-level architecture of the application:

In one data flow, arrows show messages going from Devices to Event Hubs and triggering the function app. From
the app, one arrow shows dead-letter messages going to a storage queue, and another arrow shows writing to
Azure Cosmos DB. In another data flow, arrows show the client web app getting static files from Blob storage static
web hosting, through a CDN. Another arrow shows the client HTTP request going through API Management. From
API Management, one arrow shows the function app triggering and reading data from Azure Cosmos DB. Another
arrow shows authentication through Azure AD. A User also signs in to Azure AD.
Event ingestion:
1. Drone messages are ingested by Azure Event Hubs.
2. Event Hubs produces a stream of events that contain the message data.
3. These events trigger an Azure Functions app to process them.
4. The results are stored in Cosmos DB.
Web app:
1. Static files are served by CDN from Blob storage.
2. A user signs into the web app using Azure AD.
3. Azure API Management acts as a gateway that exposes a REST API endpoint.
4. HTTP requests from the client trigger an Azure Functions app that reads from Cosmos DB and returns the
result.
This application is based on two reference architectures, corresponding to the two functional blocks described
above:
Serverless event processing using Azure Functions
Serverless web application on Azure
You can read those articles to learn more about the high-level architecture, the Azure services that are used in the
solution, and considerations for scalability, security, and reliability.

Drone telemetry function


Let's start by looking at the function that processes drone messages from Event Hubs. The function is defined in a
class named RawTelemetryFunction :
namespace DroneTelemetryFunctionApp
{
public class RawTelemetryFunction
{
private readonly ITelemetryProcessor telemetryProcessor;
private readonly IStateChangeProcessor stateChangeProcessor;
private readonly TelemetryClient telemetryClient;

public RawTelemetryFunction(ITelemetryProcessor telemetryProcessor, IStateChangeProcessor


stateChangeProcessor, TelemetryClient telemetryClient)
{
this.telemetryProcessor = telemetryProcessor;
this.stateChangeProcessor = stateChangeProcessor;
this.telemetryClient = telemetryClient;
}
}
...
}

This class has several dependencies, which are injected into the constructor using dependency injection:
The ITelemetryProcessor and IStateChangeProcessor interfaces define two helper objects. As we'll see,
these objects do most of the work.
The TelemetryClient is part of the Application Insights SDK. It is used to send custom application metrics to
Application Insights.
Later, we'll look at how to configure the dependency injection. For now, just assume these dependencies exist.

Configure the Event Hubs trigger


The logic in the function is implemented as an asynchronous method named RunAsync . Here is the method
signature:

[FunctionName("RawTelemetryFunction")]
[StorageAccount("DeadLetterStorage")]
public async Task RunAsync(
[EventHubTrigger("%EventHubName%", Connection = "EventHubConnection", ConsumerGroup
="%EventHubConsumerGroup%")]EventData[] messages,
[Queue("deadletterqueue")] IAsyncCollector<DeadLetterMessage> deadLetterMessages,
ILogger logger)
{
// implementation goes here
}

The method takes the following parameters:


messages is an array of event hub messages.
deadLetterMessages is an Azure Storage Queue, used for storing dead letter messages.
logging provides a logging interface, for writing application logs. These logs are sent to Azure Monitor.

The EventHubTrigger attribute on the messages parameter configures the trigger. The properties of the attribute
specify an event hub name, a connection string, and a consumer group. (A consumer group is an isolated view of
the Event Hubs event stream. This abstraction allows for multiple consumers of the same event hub.)
Notice the percent signs (%) in some of the attribute properties. These indicate that the property specifies the
name of an app setting, and the actual value is taken from that app setting at run time. Otherwise, without percent
signs, the property gives the literal value.
The Connection property is an exception. This property always specifies an app setting name, never a literal value,
so the percent sign is not needed. The reason for this distinction is that a connection string is secret and should
never be checked into source code.
While the other two properties (event hub name and consumer group) are not sensitive data like a connection
string, it's still better to put them into app settings, rather than hard coding. That way, they can be updated without
recompiling the app.
For more information about configuring this trigger, see Azure Event Hubs bindings for Azure Functions.

Message processing logic


Here's the implementation of the RawTelemetryFunction.RunAsync method that processes a batch of messages:

[FunctionName("RawTelemetryFunction")]
[StorageAccount("DeadLetterStorage")]
public async Task RunAsync(
[EventHubTrigger("%EventHubName%", Connection = "EventHubConnection", ConsumerGroup
="%EventHubConsumerGroup%")]EventData[] messages,
[Queue("deadletterqueue")] IAsyncCollector<DeadLetterMessage> deadLetterMessages,
ILogger logger)
{
telemetryClient.GetMetric("EventHubMessageBatchSize").TrackValue(messages.Length);

foreach (var message in messages)


{
DeviceState deviceState = null;

try
{
deviceState = telemetryProcessor.Deserialize(message.Body.Array, logger);

try
{
await stateChangeProcessor.UpdateState(deviceState, logger);
}
catch (Exception ex)
{
logger.LogError(ex, "Error updating status document", deviceState);
await deadLetterMessages.AddAsync(new DeadLetterMessage { Exception = ex, EventData = message,
DeviceState = deviceState });
}
}
catch (Exception ex)
{
logger.LogError(ex, "Error deserializing message", message.SystemProperties.PartitionKey,
message.SystemProperties.SequenceNumber);
await deadLetterMessages.AddAsync(new DeadLetterMessage { Exception = ex, EventData = message });
}
}
}

When the function is invoked, the messages parameter contains an array of messages from the event hub.
Processing messages in batches will generally yield better performance than reading one message at a time.
However, you have to make sure the function is resilient and handles failures and exceptions gracefully. Otherwise,
if the function throws an unhandled exception in the middle of a batch, you might lose the remaining messages.
This consideration is discussed in more detail in the section Error handling.
But if you ignore the exception handling, the processing logic for each message is simple:
1. Call ITelemetryProcessor.Deserialize to deserialize the message that contains a device state change.
2. Call IStateChangeProcessor.UpdateState to process the state change.
Let's look at these two methods in more detail, starting with the Deserialize method.
Deserialize method
The TelemetryProcess.Deserialize method takes a byte array that contains the message payload. It deserializes
this payload and returns a DeviceState object, which represents the state of a drone. The state may represent a
partial update, containing just the delta from the last-known state. Therefore, the method needs to handle null
fields in the deserialized payload.

public class TelemetryProcessor : ITelemetryProcessor


{
private readonly ITelemetrySerializer<DroneState> serializer;

public TelemetryProcessor(ITelemetrySerializer<DroneState> serializer)


{
this.serializer = serializer;
}

public DeviceState Deserialize(byte[] payload, ILogger log)


{
DroneState restored = serializer.Deserialize(payload);

log.LogInformation("Deserialize message for device ID {DeviceId}", restored.DeviceId);

var deviceState = new DeviceState();


deviceState.DeviceId = restored.DeviceId;

if (restored.Battery != null)
{
deviceState.Battery = restored.Battery;
}
if (restored.FlightMode != null)
{
deviceState.FlightMode = (int)restored.FlightMode;
}
if (restored.Position != null)
{
deviceState.Latitude = restored.Position.Value.Latitude;
deviceState.Longitude = restored.Position.Value.Longitude;
deviceState.Altitude = restored.Position.Value.Altitude;
}
if (restored.Health != null)
{
deviceState.AccelerometerOK = restored.Health.Value.AccelerometerOK;
deviceState.GyrometerOK = restored.Health.Value.GyrometerOK;
deviceState.MagnetometerOK = restored.Health.Value.MagnetometerOK;
}
return deviceState;
}
}

This method uses another helper interface, ITelemetrySerializer<T> , to deserialize the raw message. The results
are then transformed into a POCO model that is easier to work with. This design helps to isolate the processing
logic from the serialization implementation details. The ITelemetrySerializer<T> interface is defined in a shared
library, which is also used by the device simulator to generate simulated device events and send them to Event
Hubs.
using System;

namespace Serverless.Serialization
{
public interface ITelemetrySerializer<T>
{
T Deserialize(byte[] message);

ArraySegment<byte> Serialize(T message);


}
}

UpdateState method
The StateChangeProcessor.UpdateState method applies the state changes. The last-known state for each drone is
stored as a JSON document in Cosmos DB. Because the drones send partial updates, the application can't simply
overwrite the document when it gets an update. Instead, it needs to fetch the previous state, merge the fields, and
then perform an upsert operation.
public class StateChangeProcessor : IStateChangeProcessor
{
private IDocumentClient client;
private readonly string cosmosDBDatabase;
private readonly string cosmosDBCollection;

public StateChangeProcessor(IDocumentClient client, IOptions<StateChangeProcessorOptions> options)


{
this.client = client;
this.cosmosDBDatabase = options.Value.COSMOSDB_DATABASE_NAME;
this.cosmosDBCollection = options.Value.COSMOSDB_DATABASE_COL;
}

public async Task<ResourceResponse<Document>> UpdateState(DeviceState source, ILogger log)


{
log.LogInformation("Processing change message for device ID {DeviceId}", source.DeviceId);

DeviceState target = null;

try
{
var response = await client.ReadDocumentAsync(UriFactory.CreateDocumentUri(cosmosDBDatabase,
cosmosDBCollection, source.DeviceId),
new RequestOptions { PartitionKey = new
PartitionKey(source.DeviceId) });

target = (DeviceState)(dynamic)response.Resource;

// Merge properties
target.Battery = source.Battery ?? target.Battery;
target.FlightMode = source.FlightMode ?? target.FlightMode;
target.Latitude = source.Latitude ?? target.Latitude;
target.Longitude = source.Longitude ?? target.Longitude;
target.Altitude = source.Altitude ?? target.Altitude;
target.AccelerometerOK = source.AccelerometerOK ?? target.AccelerometerOK;
target.GyrometerOK = source.GyrometerOK ?? target.GyrometerOK;
target.MagnetometerOK = source.MagnetometerOK ?? target.MagnetometerOK;
}
catch (DocumentClientException ex)
{
if (ex.StatusCode == System.Net.HttpStatusCode.NotFound)
{
target = source;
}
}

var collectionLink = UriFactory.CreateDocumentCollectionUri(cosmosDBDatabase, cosmosDBCollection);


return await client.UpsertDocumentAsync(collectionLink, target);
}
}

This code uses the IDocumentClient interface to fetch a document from Cosmos DB. If the document exists, the
new state values are merged into the existing document. Otherwise, a new document is created. Both cases are
handled by the UpsertDocumentAsync method.
This code is optimized for the case where the document already exists and can be merged. On the first telemetry
message from a given drone, the ReadDocumentAsync method will throw an exception, because there is no
document for that drone. After the first message, the document will be available.
Notice that this class uses dependency injection to inject the IDocumentClient for Cosmos DB and an IOptions<T>
with configuration settings. We'll see how to set up the dependency injection later.
NOTE
Azure Functions supports an output binding for Cosmos DB. This binding lets the function app write documents in Cosmos
DB without any code. However, the output binding won't work for this particular scenario, because of the custom upsert
logic that's needed.

Error handling
As mentioned earlier, the RawTelemetryFunction function app processes a batch of messages in a loop. That means
the function needs to handle any exceptions gracefully and continue processing the rest of the batch. Otherwise,
messages might get dropped.
If an exception is encountered when processing a message, the function puts the message onto a dead-letter
queue:

catch (Exception ex)


{
logger.LogError(ex, "Error deserializing message", message.SystemProperties.PartitionKey,
message.SystemProperties.SequenceNumber);
await deadLetterMessages.AddAsync(new DeadLetterMessage { Exception = ex, EventData = message });
}

The dead-letter queue is defined using an output binding to a storage queue:

[FunctionName("RawTelemetryFunction")]
[StorageAccount("DeadLetterStorage")] // App setting that holds the connection string
public async Task RunAsync(
[EventHubTrigger("%EventHubName%", Connection = "EventHubConnection", ConsumerGroup
="%EventHubConsumerGroup%")]EventData[] messages,
[Queue("deadletterqueue")] IAsyncCollector<DeadLetterMessage> deadLetterMessages, // output binding
ILogger logger)

Here the Queue attribute specifies the output binding, and the StorageAccount attribute specifies the name of an
app setting that holds the connection string for the storage account.
Deployment tip: In the Resource Manager template that creates the storage account, you can automatically
populate an app setting with the connection string. The trick is to use the listKeys function.
Here is the section of the template that creates the storage account for the queue:

{
"name": "[variables('droneTelemetryDeadLetterStorageQueueAccountName')]",
"type": "Microsoft.Storage/storageAccounts",
"location": "[resourceGroup().location]",
"apiVersion": "2017-10-01",
"sku": {
"name": "[parameters('storageAccountType')]"
},

Here is the section of the template that creates the function app.
{
"apiVersion": "2015-08-01",
"type": "Microsoft.Web/sites",
"name": "[variables('droneTelemetryFunctionAppName')]",
"location": "[resourceGroup().location]",
"tags": {
"displayName": "Drone Telemetry Function App"
},
"kind": "functionapp",
"dependsOn": [
"[resourceId('Microsoft.Web/serverfarms', variables('hostingPlanName'))]",
...
],
"properties": {
"serverFarmId": "[resourceId('Microsoft.Web/serverfarms', variables('hostingPlanName'))]",
"siteConfig": {
"appSettings": [
{
"name": "DeadLetterStorage",
"value": "[concat('DefaultEndpointsProtocol=https;AccountName=',
variables('droneTelemetryDeadLetterStorageQueueAccountName'), ';AccountKey=',
listKeys(variables('droneTelemetryDeadLetterStorageQueueAccountId'),'2015-05-01-preview').key1)]"
},
...

This defines an app setting named DeadLetterStorage whose value is populated using the listKeys function. It's
important to make the function app resource depend on the storage account resource (see the dependsOn
element). This guarantees that the storage account is created first and the connection string is available.

Setting up dependency injection


The following code sets up dependency injection for the RawTelemetryFunction function:

[assembly: FunctionsStartup(typeof(DroneTelemetryFunctionApp.Startup))]

namespace DroneTelemetryFunctionApp
{
public class Startup : FunctionsStartup
{
public override void Configure(IFunctionsHostBuilder builder)
{
builder.Services.AddOptions<StateChangeProcessorOptions>()
.Configure<IConfiguration>((configSection, configuration) =>
{
configuration.Bind(configSection);
});

builder.Services.AddTransient<ITelemetrySerializer<DroneState>, TelemetrySerializer<DroneState>>
();
builder.Services.AddTransient<ITelemetryProcessor, TelemetryProcessor>();
builder.Services.AddTransient<IStateChangeProcessor, StateChangeProcessor>();

builder.Services.AddSingleton<IDocumentClient>(ctx => {
var config = ctx.GetService<IConfiguration>();
var cosmosDBEndpoint = config.GetValue<string>("CosmosDBEndpoint");
var cosmosDBKey = config.GetValue<string>("CosmosDBKey");
return new DocumentClient(new Uri(cosmosDBEndpoint), cosmosDBKey);
});
}
}
}
Azure Functions written for .NET can use the ASP.NET Core dependency injection framework. The basic idea is that
you declare a startup method for your assembly. The method takes an IFunctionsHostBuilder interface, which is
used to declare the dependencies for DI. You do this by calling Add* method on the Services object. When you
add a dependency, you specify its lifetime:
Transient objects are created each time they're requested.
Scoped objects are created once per function execution.
Singleton objects are reused across function executions, within the lifetime of the function host.
In this example, the TelemetryProcessor and StateChangeProcessor objects are declared as transient. This is
appropriate for lightweight, stateless services. The DocumentClient class, on the other hand, should be a singleton
for best performance. For more information, see Performance tips for Azure Cosmos DB and .NET.
If you refer back to the code for the RawTelemetryFunction, you'll see there another dependency that doesn't
appear in DI setup code, namely the TelemetryClient class that is used to log application metrics. The Functions
runtime automatically registers this class into the DI container, so you don't need to register it explicitly.
For more information about DI in Azure Functions, see the following articles:
Use dependency injection in .NET Azure Functions
Dependency injection in ASP.NET Core
Passing configuration settings in DI
Sometimes an object must be initialized with some configuration values. Generally, these settings should come
from app settings or (in the case of secrets) from Azure Key Vault.
There are two examples in this application. First, the DocumentClient class takes a Cosmos DB service endpoint
and key. For this object, the application registers a lambda that will be invoked by the DI container. This lambda
uses the IConfiguration interface to read the configuration values:

builder.Services.AddSingleton<IDocumentClient>(ctx => {
var config = ctx.GetService<IConfiguration>();
var cosmosDBEndpoint = config.GetValue<string>("CosmosDBEndpoint");
var cosmosDBKey = config.GetValue<string>("CosmosDBKey");
return new DocumentClient(new Uri(cosmosDBEndpoint), cosmosDBKey);
});

The second example is the StateChangeProcessor class. For this object, we use an approach called the options
pattern. Here's how it works:
1. Define a class T that contains your configuration settings. In this case, the Cosmos DB database name and
collection name.

public class StateChangeProcessorOptions


{
public string COSMOSDB_DATABASE_NAME { get; set; }
public string COSMOSDB_DATABASE_COL { get; set; }
}

2. Add the class T as an options class for DI.

builder.Services.AddOptions<StateChangeProcessorOptions>()
.Configure<IConfiguration>((configSection, configuration) =>
{
configuration.Bind(configSection);
});
3. In the constructor of the class that is being configured, include an IOptions<T> parameter.

public StateChangeProcessor(IDocumentClient client, IOptions<StateChangeProcessorOptions> options)

The DI system will automatically populate the options class with configuration values and pass this to the
constructor.
There are several advantages of this approach:
Decouple the class from the source of the configuration values.
Easily set up different configuration sources, such as environment variables or JSON configuration files.
Simplify unit testing.
Use a strongly typed options class, which is less error prone than just passing in scalar values.

GetStatus function
The other Functions app in this solution implements a simple REST API to get the last-known status of a drone.
This function is defined in a class named GetStatusFunction . Here is the complete code for the function:
using Microsoft.AspNetCore.Http;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.Extensions.Logging;
using System.Security.Claims;
using System.Threading.Tasks;

namespace DroneStatusFunctionApp
{
public static class GetStatusFunction
{
public const string GetDeviceStatusRoleName = "GetStatus";

[FunctionName("GetStatusFunction")]
public static IActionResult Run(
[HttpTrigger(AuthorizationLevel.Function, "get", Route = null)]HttpRequest req,
[CosmosDB(
databaseName: "%COSMOSDB_DATABASE_NAME%",
collectionName: "%COSMOSDB_DATABASE_COL%",
ConnectionStringSetting = "COSMOSDB_CONNECTION_STRING",
Id = "{Query.deviceId}",
PartitionKey = "{Query.deviceId}")] dynamic deviceStatus,
ClaimsPrincipal principal,
ILogger log)
{
log.LogInformation("Processing GetStatus request.");

if (!principal.IsAuthorizedByRoles(new[] { GetDeviceStatusRoleName }, log))


{
return new UnauthorizedResult();
}

string deviceId = req.Query["deviceId"];


if (deviceId == null)
{
return new BadRequestObjectResult("Missing DeviceId");
}

if (deviceStatus == null)
{
return new NotFoundResult();
}
else
{
return new OkObjectResult(deviceStatus);
}
}
}
}

This function uses an HTTP trigger to process an HTTP GET request. The function uses a Cosmos DB input binding
to fetch the requested document. One consideration is that this binding will run before the authorization logic is
performed inside the function. If an unauthorized user requests a document, the function binding will still fetch the
document. Then the authorization code will return a 401, so the user won't see the document. Whether this
behavior is acceptable may depend on your requirements. For example, this approach might make it harder to
audit data access for sensitive data.

Authentication and authorization


The web app uses Azure AD to authenticate users. Because the app is a single-page application (SPA) running in
the browser, the implicit grant flow is appropriate:
1. The web app redirects the user to the identity provider (in this case, Azure AD).
2. The user enters their credentials.
3. The identity provider redirects back to the web app with an access token.
4. The web app sends a request to the web API and includes the access token in the Authorization header.

A Function application can be configured to authenticate users with zero code. For more information, see
Authentication and authorization in Azure App Service.
Authorization, on the other hand, generally requires some business logic. Azure AD supports claims based
authentication. In this model, a user's identity is represented as a set of claims that come from the identity provider.
A claim can be any piece of information about the user, such as their name or email address.
The access token contains a subset of user claims. Among these are any application roles that the user is assigned
to.
The principal parameter of the function is a ClaimsPrincipal object that contains the claims from the access
token. Each claim is a key/value pair of claim type and claim value. The application uses these to authorize the
request.
The following extension method tests whether a ClaimsPrincipal object contains a set of roles. It returns false if
any of the specified roles is missing. If this method returns false, the function returns HTTP 401 (Unauthorized).

namespace DroneStatusFunctionApp
{
public static class ClaimsPrincipalAuthorizationExtensions
{
public static bool IsAuthorizedByRoles(
this ClaimsPrincipal principal,
string[] roles,
ILogger log)
{
var principalRoles = new HashSet<string>(principal.Claims.Where(kvp => kvp.Type ==
"roles").Select(kvp => kvp.Value));
var missingRoles = roles.Where(r => !principalRoles.Contains(r)).ToArray();
if (missingRoles.Length > 0)
{
log.LogWarning("The principal does not have the required {roles}", string.Join(", ",
missingRoles));
return false;
}

return true;
}
}
}
For more information about authentication and authorization in this application, see the Security considerations
section of the reference architecture.

Next steps
Once you get a feel for how this reference solution works, learn best practices and recommendations for similar
solutions.
For a serverless event ingestion solution, see Serverless event processing using Azure Functions.
For a serverless web app, see Serverless web application on Azure.
Azure Functions is just one Azure compute option. For help with choosing a compute technology, see Choose an
Azure compute service for your application.

Related resources
For in-depth discussion on developing serverless solutions on premises as well as in the cloud, read Serverless
apps: Architecture, patterns, and Azure implementation.
Read more about the Event-driven architecture style.
CI/CD for serverless application frontend on Azure
11/2/2020 • 11 minutes to read • Edit Online

Serverless computing abstracts the servers, infrastructure, and operating systems, allowing developers to focus on
application development. A robust CI/CD or Continuous Integration/Continuous Delivery of such applications
allows companies to ship fully tested and integrated software versions within minutes of development. It provides
a backbone of modern DevOps environment.
What does CI/CD actually stand for?
Continuous Integration allows development teams to integrate code changes in a shared repository almost
instantaneously. This ability, coupled with automated build and testing before the changes are integrated,
ensures that only fully functional application code is available for deployment.
Continuous Delivery allows changes in the source code, configuration, content, and other artifacts to be
delivered to production, and ready to be deployed to end-users, as quickly and safely as possible. The process
keeps the code in a deployable state at all times. A special case of this is Continuous Deployment, which
includes actual deployment to end users.
This article discusses a CI/CD pipeline for the web frontend of a serverless reference implementation. This pipeline
is developed using Azure services. The web frontend demonstrates a modern web application, with client-side
JavaScript, reusable server-side APIs, and pre-built Markup, alternatively called Jamstack. You can find the code in
this GitHub repository. The readme describes the steps to download, build, and deploy the application.
The following diagram describes the CI/CD pipeline used in this sample frontend:

This article does not discuss the backend deployment.

Prerequisites
To work with this sample application, make sure you have the following:
A GitHub account.
An Azure account. If you don't have one, you can try out a free Azure account.
An Azure DevOps organization. If you don't have one, you can try out a basic plan, which includes DevOps
services such as Azure Pipelines.

Use an online version control system


Version control systems keep track and control changes in your source code. Keeping your source code in an online
version control system allows multiple development teams to collaborate. It is also easier to maintain than a
traditional version control on premises. These online systems can be easily integrated with leading CI/CD services.
You get the ability to create and maintain the source code in multiple directories, along with build and
configuration files, in what is called a repository.
The project files for this sample application are kept in GitHub. If you don't have a GitHub account, read this
documentation to get started with GitHub repositories.

Automate your build and deploy


Using a CI/CD service such as Azure Pipelines can help you to automate the build and deploy processes. You can
create multiple stages in the pipeline, each stage running based on the result of the previous one. The stages can
run in either a Windows or Linux container. The script must make sure the tools and environments are set properly
in the container. Azure Pipelines can run a variety of build tools, and can work with quite a few online version
control systems.
Integrate build tools
Modern build tools can simplify your build process, and provide functionality such as pre-configuration,
minification of the JavaScript files, and static site generation. Static site generators can build markup files before
they are deployed to the hosting servers, resulting in a fast user experience. You can select from a variety of these
tools, based on the type of your application's programming language and platform, as well as additional
functionality needed. This article provides a list of popular build tools for a modern application.
The sample is a React application, built using Gatsby.js, which is a static site generator and front-end development
framework. These tools can be run locally during development and testing phases, and then integrated with Azure
Pipelines for the final deployment.
The sample uses the Gatsby file gatsby-ssr.js for rendering, and gatsby-config.js for site configuration. Gatsby
converts all JavaScript files under the pages subdirectory of the src folder to HTML files. Additional components
go in the components subdirectory. The sample also uses the gatsby-plugin-typescript plugin that allows using
TypeScript for type safety, instead of JavaScript.
For more information about setting up a Gatsby project, see the official Gatsby documentation.
Automate builds
Automating the build process reduces the human errors that can be introduced in manual processes. The file
azure-pipelines.yml includes the script for a two-stage automation. The Readme for this project describes the steps
required to set up the automation pipeline using Azure Pipelines. The following subsections show how the pipeline
stages are configured.
Build stage
Since the Azure Pipeline is integrated with GitHub repository, any changes in the tracked directory of the master
branch, trigger the first stage of the pipeline, the build stage:

trigger:
batch: true
branches:
include:
- master
paths:
include:
- src/ClientApp

The following snippet illustrates the start of the build stage, which starts an Ubuntu container to run this stage.
stages:
- stage: Build
jobs:
- job: WebsiteBuild
displayName: Build Fabrikam Drone Status app
pool:
vmImage: 'Ubuntu-16.04'
continueOnError: false
steps:

This is followed by tasks and scripts required to successfully build the project. These include the following:
Installing Node.js and setting up environment variables,
Installing and running Gatsby.js that builds the static website:

- script: |
cd src/ClientApp
npm install
npx gatsby build
displayName: 'gatsby build'

installing and running a compression tool named brotli, to compress the built files before deployment:

- script: |
cd src/ClientApp/public
sudo apt-get install brotli --install-suggests --no-install-recommends -q --assume-yes
for f in $(find . -type f \( -iname '*.html' -o -iname '*.map' -o -iname '*.js' -o -iname
'*.json' \)); do brotli $f -Z -j -f -v && mv ${f}.br $f; done
displayName: 'enable compression at origin level'

Computing the version of the current build for cache management,


Publishing the built files for use by the deploy stage:

- task: PublishPipelineArtifact@1
inputs:
targetPath: 'src/ClientApp/public'
artifactName: 'drop'

A successful completion of the build stage tears down the Ubuntu environment, and triggers the deploy stage in
the pipeline.
Deploy stage
The deploy stage runs in a new Ubuntu container:

- stage: Deploy
jobs:
- deployment: WebsiteDeploy
displayName: Deploy Fabrikam Drone Status app
pool:
vmImage: 'Ubuntu-16.04'
environment: 'fabrikamdronestatus-prod'
strategy:
runOnce:
deploy:
steps:
This stage includes various deployment tasks and scripts to:
Download the build artifacts to the container (which happens automatically as a consequence of using
PublishPipelineArtifact in the build stage),
Record the build release version, and update in the GitHub repository,
Upload the website files to Blob Storage, in a new folder corresponding to the new version, and
Change the CDN to point to this new folder.
The last two steps together replicate a cache purge, since older folders are no longer accessible by the CDN edge
servers. The following snippet shows how this is achieved:

- script: |
az login --service-principal -u $(azureArmClientId) -p $(azureArmClientSecret) --tenant
$(azureArmTenantId)
# upload content to container versioned folder
az storage blob upload-batch -s "$(Pipeline.Workspace)/drop" --destination
"\$web\$(releaseSemVer)" --account-name $(azureStorageAccountName) --content-encoding br --pattern "*.html" --
content-type "text/html"
az storage blob upload-batch -s "$(Pipeline.Workspace)/drop" --destination
"\$web\$(releaseSemVer)" --account-name $(azureStorageAccountName) --content-encoding br --pattern "*.js" --
content-type "application/javascript"
az storage blob upload-batch -s "$(Pipeline.Workspace)/drop" --destination
"\$web\$(releaseSemVer)" --account-name $(azureStorageAccountName) --content-encoding br --pattern "*.js.map"
--content-type "application/octet-stream"
az storage blob upload-batch -s "$(Pipeline.Workspace)/drop" --destination
"\$web\$(releaseSemVer)" --account-name $(azureStorageAccountName) --content-encoding br --pattern "*.json" --
content-type "application/json"
az storage blob upload-batch -s "$(Pipeline.Workspace)/drop" --destination
"\$web\$(releaseSemVer)" --account-name $(azureStorageAccountName) --pattern "*.txt" --content-type
"text/plain"
# target new version
az cdn endpoint update --resource-group $(azureResourceGroup) --profile-name $(azureCdnName) --
name $(azureCdnName) --origin-path '/$(releaseSemVer)'
AZURE_CDN_ENDPOINT_HOSTNAME=$(az cdn endpoint show --resource-group $(azureResourceGroup) --name
$(azureCdnName) --profile-name $(azureCdnName) --query hostName -o tsv)
echo "Azure CDN endpoint host ${AZURE_CDN_ENDPOINT_HOSTNAME}"
echo '##vso[task.setvariable variable=azureCndEndpointHost]'$AZURE_CDN_ENDPOINT_HOSTNAME
displayName: 'upload to Azure Storage static website hosting and purge Azure CDN endpoint'

Atomic deploys
Atomic deployment ensures that the users of your website or application always get the content corresponding to
the same version.
In the sample CI/CD pipeline, the website contents are deployed to the Blob storage, which acts as the origin server
for the Azure CDN. If the files are updated in the same root folder in the blob, the website will be served
inconsistently. Uploading to a new versioned folder as shown in the preceding section solves this problem. The
users either get all or nothing of the new successful build, since the CDN points to the new folder as the origin, only
after all files are successfully updated.
The advantages of this approach are as follows:
Since new content is not available to users until the CDN points to the new origin folder, it results in an atomic
deployment.
You can easily roll back to an older version of the website if necessary.
Since the origin can host multiple versions of the website side by side, you can fine-tune the deployment by
using techniques such as allowing preview to certain users before wider availability.

Host and distribute using the cloud


A content delivery network (CDN) is a set of distributed servers that speed up the content delivery to users over a
vast geographical area. Every user gets the content from the server nearest to them. The CDN accesses this content
from an origin server, and caches it to edge servers at strategic locations. The sample CI/CD in this article uses
Azure CDN, pointing to website content hosted on Azure Blob Storage as the origin server. The Blob storage is
configured for static website hosting. For a quick guide on how to use Azure CDN with Azure Blob Storage, read
Integrate an Azure storage account with Azure CDN.
The following are some strategies that can improve the CDN performance.
Compress the content
Compressing the files before serving improves file transfer speed and increases page-load performance for the
end users.
There are two ways to do this:
1. On the fly in the CDN edge ser vers . This improves bandwidth consumption by a significant percentage,
making the website considerably faster than without compression. It is relatively easy to enable this type of
compression on Azure CDN. Since it's controlled by the CDN, you cannot choose or configure the
compression tool. Use this compression if website performance is not a critical concern.
2. Pre-compressing before reaching the CDN , either on your origin server, or before reaching the origin.
Pre-compressing further improves the run time performance of your website, since it's done before being
fetched by the CDN. The sample CI/CD pipeline compresses the built files in the deployment stage, using
brotli, as shown in the build section above. The advantages of using pre-compression are as follows:
a. You can use any compression tool that you're comfortable with, and further fine-tune the compression.
CDNs may have limitations on the tools they use.
b. Some CDNs limit file sizes that can be compressed on the fly, causing a performance hit for larger files.
Pre-compression sets no limits on the file size.
c. Pre-compressing in the deployment pipeline results in less storage required at the origin.
d. This is faster and more efficient than the CDN compression.
For more information, see Improve performance by compressing files in Azure CDN.
Dynamic site acceleration
Using CDN is optimal for static content, which can be safely cached at the edge servers. However, dynamic web
sites require the server to generate content based on user response. This type of content cannot be cached on the
edge, requiring a more involved end-to-end solution that can speed up the content delivery. Dynamic site
acceleration by Azure CDN is one such solution that measurably improves the performance of dynamic web pages.
This sample enables dynamic site acceleration as shown in this section of the readme.

Manage website cache


CDNs use caching to improve their performance. Configuring and managing this cache becomes an integral part of
your deployment pipeline. The Azure CDN documentation shows several ways to do this. You can set caching rules
on your CDN, as well as configure the time-to-live for the content in the origin server. Static web content can be
cached for a long duration, since it may not change too much over time. This reduces the overhead of accessing the
single origin server for every user request. For more information, see How caching works.
Cache purge
The CDN caches the website files at the edge servers until their time-to-live expires. These are updated for a new
client request, only after their time-to-live has expired. Purging your CDN cache is required to guarantee that every
user gets the latest live website files, especially if the deployment happens at the same CDN origin folder. Azure
CDN allows you to purge the cache from the Azure portal.
A better approach is to invalidate this cache by using versioning during deployment. An explicit cache purge or
expiration usually causes the CDN to retrieve newer versions of the web content. However, since the CDN always
points to the latest version of the deployment, it improves caching in the following manner:
1. The CDN validates the index.html against the origin, for every new website instance.
2. Except for the index.html and 404.html, all other website files are fingerprinted and cached for a year. This is
based on the assumption that resources such as images and videos, do not need frequent changes. If these files
are updated and rebuilt, their names are updated by a new fingerprint GUID. This results in an update to the
index.html with an updated reference to the changed resource file. The CDN then retrieves the updated
index.html, and since it does not find the reference resource file in its cache, it also retrieves the changed
resource files.
This ensures that the CDN always gets new updated files, and removes the need to purge the cache for a new build.

Next steps
Now that you understand the basics, follow this readme to set up and execute the CI/CD pipeline.
Learn best practices for using content delivery networks (CDNs)
Serverless Functions app operations
11/2/2020 • 3 minutes to read • Edit Online

This article describes Azure operations considerations for serverless Functions applications. To support Functions
apps, operations personnel need to:
Understand and implement hosting configurations.
Future-proof scalability by automating infrastructure provisioning.
Maintain business continuity by meeting availability and disaster recovery requirements.

Planning
To plan operations, understand your workloads and their requirements, then design and configure the best options
for the requirements.
Choose a hosting option
The Azure Functions Runtime provides flexibility in hosting. Use the hosting plan comparison table to determine
the best choice for your requirements.
Azure Functions hosting plans
Each Azure Functions project deploys and runs in its own Functions app, which is the unit of scale and cost.
The three hosting plans available for Azure Functions are the Consumption plan, Premium plan, and
Dedicated (App Service) plan. The hosting plan determines scaling behavior, available resources, and
support for advanced features like virtual network connectivity.
Azure Kubernetes Service (AKS)
Kubernetes-based Functions provides the Functions Runtime in a Docker container with event-driven
scaling through Kubernetes-based Event Driven Autoscaling (KEDA).
For more information about hosting plans, see:
Azure Functions scale and hosting
Consumption plan
Premium plan
Dedicated (App Service) plan
Azure Functions on Kubernetes with KEDA
Azure subscription and service limits, quotas, and constraints
Understand scaling
The serverless Consumption and Premium hosting plans scale automatically, adding and removing Azure
Functions host instances based on the number of incoming events. Scaling can vary on several dimensions, and
behave differently based on plan, trigger, and code language.
For more information about scaling, see:
Understand scaling behaviors
Scalability best practices
Understand and address cold starts
If the number of host instances scales down to zero, the next request has the added latency of restarting the
Function app, called a cold start. Cold start is a large discussion point for serverless architectures, and a point of
ambiguity for Azure Functions.
The Premium hosting plan prevents cold starts by keeping some instances warm. Reducing dependencies and
using asynchronous operations in the Functions app also minimizes the impact of cold starts. However, availability
requirements may require running the app in a Dedicated hosting plan with Always on enabled. The Dedicated
plan uses dedicated virtual machines (VMs), so is not serverless.
For more information about cold start, see Understanding serverless cold start.
Identify storage considerations
Every Azure Functions app relies on Azure Storage for operations such as managing triggers and logging function
executions. When creating a Functions app, you must create or link to a general-purpose Azure Storage account
that supports Blob, Queue, and Table storage. For more information, see Storage considerations for Azure
Functions.
Identify network design considerations
Networking options let the Functions app restrict access, or access resources without using internet-routable
addresses. The hosting plans offer different levels of network isolation. Choose the option that best meets your
network isolation requirements. For more information, see Azure Functions networking options.

Production
To prepare the application for production, make sure you can easily redeploy the hosting plan, and apply scale-out
rules.
Automate hosting plan provisioning
With infrastructure as code, you can automate infrastructure provisioning. Automatic provisioning provides more
resiliency during disasters, and more agility to quickly redeploy the infrastructure as needed.
For more information on automated provisioning, see:
Automate resource deployment for your function app in Azure Functions
Terraform - Manages a Function App
Configure scale out options
Autoscale provides the right amount of running resources to handle application load. Autoscale adds resources to
handle increases in load, and saves money by removing resources that are idle.
For more information about autoscale options, see:
Premium Plan settings
App Service Plan settings

Optimization
When the application is in production, make sure that:
The hosting plan can scale to meet application demands.
There's a plan for business continuity, availability, and disaster recovery.
You can monitor hosting and application health and receive alerts.
Implement availability requirements
Azure Functions run in a specific region. To get higher availability, you can deploy the same Functions app to
multiple regions. In multiple regions, Functions can run in the active-active or active-passive availability pattern.
For more information about Azure Functions availability and disaster recovery, see:
Azure Functions geo-disaster recovery
Disaster recovery and geo-distribution in Azure Durable Functions
Monitoring logging, application monitoring, and alerting
Application Insights and logs in Azure Monitor automatically collect log, performance, and error data and detect
performance anomalies. Azure Monitor includes powerful analytics tools to help diagnose issues and understand
function use. Application Insights help you continuously improve performance and usability.
For more information about monitoring and analyzing Azure Functions performance, see:
Monitor Azure Functions
Monitor Azure Functions with Azure Monitor logs
Application Insights for Azure Functions supported features

Next steps
Serverless application development and deployment
Azure Functions app security
Serverless Functions security
11/2/2020 • 5 minutes to read • Edit Online

This article describes Azure services and activities security personnel can implement for serverless Functions.
These guidelines and resources help develop secure code and deploy secure applications to the cloud.

Planning
The primary goals of a secure serverless Azure Functions application environment are to protect running
applications, quickly identify and address security issues, and prevent future similar issues.
The OWASP Serverless Top 10 describes the most common serverless application security vulnerabilities, and
provides basic techniques to identify and protect against them.
In many ways, planning for secure development, deployment, and operation of serverless functions is much the
same as for any web-based or cloud hosted application. Azure App Service provides the hosting infrastructure for
your function apps. Securing Azure Functions article provides security strategies for running your function code,
and how App Service can help you secure your functions.
For more information about Azure security, best practices, and shared responsibilities, see:
Security in Azure App Service
Built-in security controls
Secure development best practices on Azure.
Security best practices for Azure solutions (PDF report)
Shared responsibilities for cloud computing (PDF report)

Deployment
To prepare serverless Functions applications for production, security personnel should:
Conduct regular code reviews to identify code and library vulnerabilities.
Define resource permissions that Functions needs to execute.
Configure network security rules for inbound and outbound communication.
Identify and classify sensitive data access.
The Azure Security Baseline for Azure Functions article contains more recommendations that will help you
improve the security posture of your deployment.
Keep code secure
Find security vulnerabilities and errors in code and manage security vulnerabilities in projects and dependencies.
For more information, see:
GitHub - Finding security vulnerabilities and errors in your code
GitHub - Managing security vulnerabilities in your project
GitHub - Managing vulnerabilities in your project's dependencies
Perform input validation
Different event sources like Blob storage, Cosmos DB NoSQL databases, event hubs, queues, or Graph events can
trigger serverless Functions. Injections aren't strictly limited to inputs coming directly from the API calls. Functions
may consume other input from the possible event sources.
In general, don't trust input or make any assumptions about its validity. Always use safe APIs that sanitize or
validate the input. If possible, use APIs that bind or parameterize variables, like using prepared statements for SQL
queries.
For more information, see:
Azure Functions Input Validation with FluentValidation
Security Frame: Input Validation Mitigations
HTTP Trigger Function Request Validation
How to validate request for Azure Functions
Secure HTTP endpoints for development, testing, and production
Azure Functions lets you use keys to make it harder to access your HTTP function endpoints. To fully secure your
function endpoints in production, consider implementing one of the following Function app-level security options:
Turn on App Service authentication and authorization for your Functions app. See Authorization keys.
Use Azure API Management (APIM) to authenticate requests. See Import an Azure Function App as an API in
Azure API Management.
Deploy your Functions app to an Azure App Service Environment (ASE).
Use an App Service Plan that restricts access, and implement Azure Front Door + WAF to handle your incoming
requests. See Create a Front Door for a highly available global web application.
For more information, see Secure an HTTP endpoint in production.
Set up role -based access control (RBAC )
Azure role-based access control (RBAC) has several built-in Azure roles that you can assign to users, groups,
service principals, and managed identities to control access to Azure resources. If the built-in roles don't meet your
organization's needs, you can create your own Azure custom roles.
Review each Functions app before deployment to identify excessive permissions. Carefully examine functions to
apply "least privilege" permissions, giving each function only what it needs to successfully execute.
Use RBAC to assign permissions to users, groups, and applications at a certain scope. The scope of a role
assignment can be a subscription, a resource group, or a single resource. Avoid using wildcards whenever
possible.
For more information about RBAC, see:
What is role-based access control (RBAC) for Azure resources?
Azure built-in roles
Azure role-based access control (RBAC)
Azure custom roles
Use managed identities and key vaults
A common challenge when building cloud applications is how to manage credentials for authenticating to cloud
services in your code. Credentials should never appear in application code, developer workstations, or source
control. Instead, use a key vault to store and retrieve keys and credentials. Azure Key Vault provides a way to
securely store credentials, secrets, and other keys. The code authenticates to Key Vault to retrieve the credentials.
For more information, see Use Key Vault references for App Service and Azure Functions.
Managed identities let Functions apps access resources like key vaults and storage accounts without requiring
specific access keys or connection strings. A full audit trail in the logs displays which identities execute requests to
resources. Use RBAC and managed identities to granularly control exactly what resources Azure Functions
applications can access.
For more information, see:
What are managed identities for Azure resources?
How to use managed identities for App Service and Azure Functions
Use shared access signature (SAS ) tokens to limit access to resources
A shared access signature (SAS) provides secure delegated access to resources in your storage account, without
compromising the security of your data. With a SAS, you have granular control over how a client can access your
data. You can control what resources the client may access, what permissions they have on those resources, and
how long the SAS is valid, among other parameters.
For more information, see Grant limited access to Azure Storage resources using shared access signatures (SAS).
Secure Blob storage
Identify and classify sensitive data, and minimize sensitive data storage to only what is necessary. For sensitive
data storage, add multi-factor authentication and data encryption in transit and at rest. Grant limited access to
Azure Storage resources using SAS tokens.
For more information, see Security recommendations for Blob storage.

Optimization
Once an application is in production, security personnel can help optimize workflow and prepare for scaling.
Use Azure Security Center and apply security recommendations
Azure Security Center is a security scanning solution for your application that identifies potential security
vulnerabilities and creates recommendations. The recommendations guide you to configure needed controls to
harden and protect your resources.
For more information, see:
Protect your applications with Azure Security Center
Security Center app recommendations
Enforce application governance policies
Apply centralized, consistent enforcements and safeguards to your application at scale. For more information, see
Azure Policy built-in policy definitions.

Next steps
Serverless application development and deployment
Azure Functions app operations
Big data analytics with Azure Data Explorer
11/2/2020 • 2 minutes to read • Edit Online

This architecture pattern demonstrates big data analytics over large volumes of high velocity data from various
sources. This pattern illustrates how Azure Data Explorer and Azure Synapse Analytics complement each other for
near real-time analytics and modern data warehousing use cases.
This pattern is already being used by Microsoft customers. For example, the Singapore-based ride-hailing
company, Grab, implemented real-time analytics over a huge amount of data collected from their taxi and food
delivery services as well as merchant partner apps. The team from Grab presented their solution at MS Ignite in
this video (20:30 onwards). Using this pattern, Grab processed more than a trillion events per day.

Data flow
1. Raw structured, semi-structured, and unstructured (free text) data such as any type of logs, business events,
and user activities can be ingested into Azure Data Explorer from various sources.
2. Ingest data into Azure Data Explorer with low-latency and high throughput using its connectors for Azure Data
Factory, Azure Event Hub, Azure IoT Hub, Kafka, and so on. Alternatively, ingest data through Azure Storage
(Blob or ADLS Gen2), which uses Azure Event Grid and triggers the ingestion pipeline to Azure Data Explorer.
You can also continuously export data to Azure Storage in compressed, partitioned parquet format and
seamlessly query that data as detailed in the Continuous data export overview.
3. Export pre-aggregated data from Azure Data Explorer to Azure Storage, and then ingest the data into Synapse
Analytics to build data models and reports.
4. Use Azure Data Explorer’s native capabilities to process, aggregate, and analyze data. To get insights at a
lightning speed, build near real-time analytics dashboards using Azure Data Explorer dashboards, Power BI,
Grafana, or other tools. Use Azure Synapse Analytics to build a modern data warehouse and combine it with
the Azure Data Explorer data to generate BI reports on curated and aggregated data models.
5. Azure Data Explorer provides native advanced analytics capabilities for time series analysis, pattern recognition,
anomaly detection and forecasting, and machine learning. Azure Data Explorer is also well integrated with ML
services such as Databricks and Azure Machine Learning. This integration allows you to build models using
other tools and services and export ML models to Azure Data Explorer for scoring data.

Components
Azure Event Hub: Fully managed, real-time data ingestion service that’s simple, trusted, and scalable.
Azure IoT Hub: Managed service to enable bi-directional communication between IoT devices and Azure.
Kafka on HDInsight: Easy, cost-effective, enterprise-grade service for open source analytics with Apache Kafka.
Azure Data Explorer: Fast, fully managed and highly scalable data analytics service for real-time analysis on
large volumes of data streaming from applications, websites, IoT devices, and more.
Azure Data Explorer Dashboards: Natively export Kusto queries that were explored in the Web UI to optimized
dashboards.
Azure Synapse Analytics: Analytics service that brings together enterprise data warehousing and Big Data
analytics.

Next steps
For more information, see Azure Data Explorer documentation.
Design a CI/CD pipeline using Azure DevOps
11/2/2020 • 5 minutes to read • Edit Online

This scenario provides architecture and design guidance for building a continuous integration (CI) and continuous
deployment (CD) pipeline. In this example, the CI/CD pipeline deploys a two-tier .NET web application to the Azure
App Service.
Migrating to modern CI/CD processes provides many benefits for application builds, deployments, testing, and
monitoring. By using Azure DevOps along with other services such as App Service, organizations can focus on the
development of their apps rather than the management of the supporting infrastructure.

Relevant use cases


Consider Azure DevOps and CI/CD processes for:
Accelerating application development and development lifecycles.
Building quality and consistency into an automated build and release process
Increasing application stability and uptime.

Architecture

The data flows through the scenario as follows:


1. A developer changes application source code.
2. Application code including the web.config file is committed to the source code repository in Azure Repos.
3. Continuous integration triggers application build and unit tests using Azure Test Plans.
4. Continuous deployment within Azure Pipelines triggers an automated deployment of application artifacts with
environment-specific configuration values.
5. The artifacts are deployed to Azure App Service.
6. Azure Application Insights collects and analyzes health, performance, and usage data.
7. Developers monitor and manage health, performance, and usage information.
8. Backlog information is used to prioritize new features and bug fixes using Azure Boards.
Components
Azure DevOps is a service for managing your development lifecycle end-to-end—from planning and project
management, to code management, and continuing to build and release.
Azure Web Apps is a PaaS service for hosting web applications, REST APIs, and mobile back ends. While this
article focuses on .NET, there are several additional development platform options supported.
Application Insights is a first-party, extensible Application Performance Management (APM) service for web
developers on multiple platforms.
Alternatives
While this article focuses on Azure DevOps, Azure DevOps Server (previously known as Team Foundation Server)
could be used as an on-premises substitute. Alternatively, you could also use a set of technologies for an open-
source development pipeline using Jenkins.
From an infrastructure-as-code perspective, Resource Manager templates were used as part of the Azure DevOps
project, but you could consider other management technologies such as Terraform or Chef. If you prefer an
infrastructure-as-a-service (IaaS)-based deployment and require configuration management, you could consider
either Azure Automation State Configuration, Ansible, or Chef.
You could consider these alternatives to hosting in Azure Web Apps:
Azure Virtual Machines handles workloads that require a high degree of control, or depend on OS
components and services that are not possible with Web Apps (for example, the Windows GAC, or COM).
Service Fabric is a good option if the workload architecture is focused around distributed components that
benefit from being deployed and run across a cluster with a high degree of control. Service Fabric can also
be used to host containers.
Azure Functions provides an effective serverless approach if the workload architecture is centered around
fine grained distributed components, requiring minimal dependencies, where individual components are
only required to run on demand (not continuously) and orchestration of components is not required.
This decision tree for Azure compute services may help when choosing the right path to take for a migration.

Management and Security Considerations


Consider leveraging one of the tokenization tasks available in the VSTS marketplace.
Azure Key Vault tasks can download secrets from an Azure Key Vault into your release. You can then use
those secrets as variables in your release definition, which avoids storing them in source control.
Use release variables in your release definitions to drive configuration changes of your environments.
Release variables can be scoped to an entire release or a given environment. When using variables for secret
information, ensure that you select the padlock icon.
Deployment gates should be used in your release pipeline. This lets you leverage monitoring data in
association with external systems (for example, incident management or additional bespoke systems) to
determine whether a release should be promoted.
Where manual intervention in a release pipeline is required, use the approvals functionality.
Consider using Application Insights and additional monitoring tools as early as possible in your release
pipeline. Many organizations only begin monitoring in their production environment. By monitoring your
other environments, you can identify bugs earlier in the development process and avoid issues in your
production environment.

Deploy the scenario


Prerequisites
You must have an existing Azure account. If you don't have an Azure subscription, create a free account
before you begin.
You must sign up for an Azure DevOps organization. For more information, see Quickstart: Create your
organization.
Walk-through
Azure DevOps Projects will deploy an App Service Plan, App Service, and an App Insights resource for you, as well
as configure an Azure Pipelines pipeline for you.
Once you've configure a pipeline with Azure DevOps Projects and the build is completed, review the associated
code changes, work items, and test results. You will notice that no test results are displayed, because the code does
not contain any tests to run.
The pipeline creates a release definition and a continuous deployment trigger, deploying our application into the
Dev environment. As part of a continuous deployment process, you may see releases that span multiple
environments. A release can span both infrastructure (using techniques such as infrastructure-as-code), and can
also deploy the application packages required along with any post-configuration tasks.

Pricing
Azure DevOps costs depend on the number of users in your organization that require access, along with other
factors like the number of concurrent build/releases required and number of test users. For more information, see
Azure DevOps pricing.
This pricing calculator provides an estimate for running Azure DevOps with 20 users.
Azure DevOps is billed on a per-user per-month basis. There may be additional charges depending on concurrent
pipelines needed, in addition to any additional test users or user basic licenses.

Related resources
Review the following resources to learn more about CI/CD and Azure DevOps:
What is DevOps?
DevOps at Microsoft - How we work with Azure DevOps
Step-by-step Tutorials: DevOps with Azure DevOps
DevOps Checklist
Create a CI/CD pipeline for .NET with Azure DevOps Projects
Event-based cloud automation on Azure
11/2/2020 • 14 minutes to read • Edit Online

Automating workflows and repetitive tasks on the cloud using serverless technologies, can dramatically improve
productivity of an organization's DevOps team. A serverless model is best suited for automation scenarios that fit
an event driven approach. This reference architecture illustrates two such cloud automation scenarios:
1. Cost center tagging: This implementation tracks the cost centers of each Azure resource. The Azure Policy
service tags all new resources in a group with a default cost center ID. The Event Grid monitors resource
creation events, and then calls an Azure function. The function interacts with Azure Active Directory, and
validates the cost center ID for the new resource. If different, it updates the tag and sends out an email to
the resource owner. The REST queries for Azure Active Directory are mocked out for simplicity. Azure AD
can also be integrated using the Azure AD PowerShell module or the ADAL for Python library.
2. Throttling response: This implementation monitors a Cosmos DB database for throttling. Azure Monitor
alerts are triggered when data access requests to CosmosDB exceed the capacity in Request Units (or RUs).
An Azure Monitor action group is configured to call the automation function in response to these alerts. The
function scales the RUs to a higher value, increasing the capacity and in turn stopping the alerts. Note that
the CosmosDB autopilot mode (Preview) is an alternative to this implementation.

The reference implementations for this architecture are available on GitHub.


The functions in these implementations are written in PowerShell and Python, two of the most common scripting
languages used in automation. They are deployed using Azure Functions Core Tools in Azure CLI. Alternatively, you
can try a preview version of PowerShell cmdlet to deploy and manage Azure Functions.

Patterns in event-based automation


Event-based automation scenarios are best implemented using Azure Functions. They follow these common
patterns:
Respond to events on resources . These are responses to events such as an Azure resource or resource
group getting created, deleted, changed, and so on. This pattern uses Event Grid to trigger the function for
such events. The cost center tagging implementation is an example of this pattern. Other common
scenarios include:
granting the DevOps teams access to newly created resource groups,
sending notification to the DevOps when a resource is deleted, and
responding to maintenance events for resources such as Azure Virtual Machines (VMs).
Scheduled tasks . These are typically maintenance tasks executed using timer-triggered functions.
Examples of this pattern are:
stopping a VM at night, and starting in the morning,
reading Blob Storage content at regular intervals, and converting to a Cosmos DB document,
periodically scanning for resources no longer in use, and removing them, and
automated backups.
Process Azure aler ts . This pattern leverages the ease of integrating Azure Monitor alerts and action
groups with Azure Functions. The function typically takes remedial actions in response to metrics, log
analytics, and alerts originating in the applications as well as the infrastructure. The throttling response
implementation is an example of this pattern. Other common scenarios are:
truncating the table when SQL Database reaches maximum size,
restarting a service in a VM when it is erroneously stopped, and
sending notifications if a function is failing.
Orchestrate with external systems . This pattern enables integration with external systems, using Logic
Apps to orchestrate the workflow. Logic Apps connectors can easily integrate with several third-party
services as well as Microsoft services such as Microsoft 365. Azure Functions can be used for the actual
automation. The cost center tagging implementation demonstrates this pattern. Other common scenarios
include:
monitoring IT processes such as change requests or approvals, and
sending email notification when automation task is completed.
Expose as a web hook or API . Automation tasks using Azure Functions can be integrated into third-
party applications or even command-line tools, by exposing the function as a web hook/API using an HTTP
trigger. Multiple authentication methods are available in both PowerShell and Python to secure external
access to the function. The automation happens in response to the app-specific external events, for example,
integration with power apps or GitHub. Common scenarios include:
triggering automation for a failing service, and
onboarding users to the organization's resources.
Create ChatOps interface . This pattern enables customers to create a chat-based operational interface,
and run development and operations functions and commands in-line with human collaboration. This can
integrate with the Azure Bot Framework and use Microsoft Teams or Slack commands for deployment,
monitoring, common questions, and so on. A ChatOps interface creates a real-time system for managing
production incidents, with each step documented automatically on the chat. Read How ChatOps can help
you DevOps better for more information.
Hybrid automation . This pattern uses the Azure App Service Hybrid Connections to install a software
component on your local machine. This component allows secure access to resources on that machine. The
ability to manage hybrid environments is currently available on Windows-based systems using PowerShell
functions. Common scenarios include:
managing your on-premises machines, and
managing other systems behind the firewall (for example, an on-premises SQL Server) through a jump
server.

Architecture
The architecture consists of the following blocks:
Azure Functions . Azure Functions provide the event-driven, serverless compute capabilities in this architecture. A
function performs automation tasks, when triggered by events or alerts. In the reference implementations, a
function is invoked with an HTTP request. Code complexity should be minimized, by developing the function that
is stateless , and idempotent .
Multiple executions of an idempotent function create the same results. To maintain idempotency, the function
scaling in the throttling scenario is kept simplistic. In real world automation, make sure to scale up or down
appropriately.
Read the Optimize the performance and reliability of Azure Functions for best practices when writing your
functions.
Logic Apps . Logic Apps can be used to perform simpler tasks, easily implemented using the built-in connectors.
These tasks can range from email notifications, to integrating with external management applications. To learn
how to use Logic Apps with third-party applications, read basic enterprise integration in Azure.
Logic Apps provides a no code or low code visual designer, and may be used alone in some automation scenarios.
Read this comparison between Azure Functions and Logic Apps to see which service can fit your scenario.
Event Grid . Event Grid has built-in support for events from other Azure services, as well as custom events (also
called custom topics). Operational events such as resource creation can be easily propagated to the automation
function, using the Event Grid's built-in mechanism.
Event Grid simplifies the event-based automation with a publish-subscribe model, allowing reliable automation for
events delivered over HTTP.
Azure Monitor . Azure Monitor alerts can monitor for critical conditions, and take corrective action using Azure
Monitor action groups. These action groups are easily integrated with Azure Functions. This is useful to watch for
and fix any error conditions in your infrastructure, such as database throttling.
Automation action . This broad block represents other services that your function can interact with, to provide
the automation functionality. For example, Azure Active Directory for tag validation as in the first scenario, or a
database to provision as in the second scenario.

Resiliency considerations
Azure Functions
Handle HTTP timeouts
To avoid HTTP timeouts for a longer automation task, queue this event in a Service Bus, and handle the actual
automation in another function. The throttling response automation scenario illustrates this pattern, even though
the actual Cosmos DB RU provisioning is fast.

Durable functions, which maintain state between invocations, provide an alternative to the above approach. These
are currently supported only in JavaScript and C#.
Log failures
As a best practice, the function should log any failures in carrying out automation tasks. This allows for proper
troubleshooting of the error conditions. The reference implementations use the Application Insights as the
telemetry system.
Concurrency
Verify the concurrency requirement for your automation function. Concurrency is limited by setting the variable
maxConcurrentRequests in the file host.json. This setting limits the number of concurrent function instances
running in your function app. Since every instance consumes CPU and memory, this value needs to be adjusted
for CPU-intensive operations. Lower the maxConcurrentRequests if your function calls appear to be too slow or
aren't able to complete. See the section Configure host behaviors to better handle concurrency for more details.
Idempotency
Make sure your automation function is idempotent. Both Azure Monitor and Event Grid may emit alerts or events
that indicate progression such as your subscribed event is resolved, fired, in progress, etc., your resource is being
provisioned, created successfully, etc., or even send false alerts due to a misconfiguration. Make sure your function
acts only on the relevant alerts and events, and ignores all others, so that false or misconfigured events do not
cause unwanted results. For more information, read this blog post on idempotency patterns.
Event Grid
If your workflow uses Event Grid, check if your scenario could generate a high volume of events, enough to clog
the grid. See Event Grid message delivery and retry to understand how it handles events when delivery isn't
acknowledged, and modify your logic accordingly. The cost center workflow does not implement additional checks
for this, since it only watches for resource creation events in a resource group. Monitoring resources created in an
entire subscription, can generate larger number of events, requiring a more resilient handling.
Azure Monitor
If a sufficiently large number of alerts are generated, and the automation updates Azure resources, throttling limits
of the Azure Resource Manager might be reached. This can negatively affect the rest of the infrastructure in that
subscription. Avoid this situation by limiting the frequency of alerts getting generated by the Azure Monitor. You
may also limit the alerts getting generated for a particular error. Refer to the documentation on Azure Monitor
alerts for more information.

Security considerations
Control access to the function
Restrict access to an HTTP-triggered function by setting the authorization level. With anonymous authentication,
the function is easily accessible with a URL such as http://<APP_NAME>.azurewebsites.net/api/<FUNCTION_NAME> .
Function level authentication can obfuscate the http endpoint, by requiring function keys in the URL. This level is
set in the file function.json:

{
"bindings": [
{
"authLevel": "function",
"type": "httpTrigger",
"direction": "in",
"name": "Request",
"methods": [
"get",
"post"
]
},
{
"type": "http",
"direction": "out",
"name": "Response"
}
]
}

For production environment, additional strategies might be required to secure the function. In the reference
implementations, the functions are executed within the Azure platform by other Azure services, and will not be
exposed to the internet. Function authorization is sufficient for functions accessed as web hooks.
Consider adding security layers on top of function authentication, such as,
authenticating with client certificates, or
making sure the caller is part of or has access to the directory that hosts the function, by using Easy Auth
integration.
Note that function-level authentication is the only option available to Azure Monitor action groups.
If the function needs to be called from a third-party application or service, it is recommended to provide access to
it with an API Management layer. This layer should enforce authentication. API Management now has a
consumption tier integrated with Azure Functions, which allows you to pay only if the API gets executed. For more
information, read Create and expose your functions with OpenAPI.
If the calling service supports service endpoints, the following costlier options could be considered:
Use a dedicated App Service plan, where you can lock down the functions in a virtual network to limit access to
it. This is not possible in a consumption-based serverless model.
Use the Azure Functions Premium plan, which includes a dedicated virtual network to be used by your function
apps.
To compare pricing and features between these options, read Azure Functions scale and hosting.
Control what the function can access
Managed identities for Azure resources, an Azure Active Directory feature, simplifies how the function
authenticates and accesses other Azure resources and services. The code does not need to manage the
authentication credentials, since it is managed by Azure AD.
There are two types of managed identities:
System-assigned managed identities : These are created as part of the Azure resource, and cannot be
shared among multiple resources. These get deleted when the resource is deleted. Use these for scenarios,
which involve single Azure resource or which need independent identities. Both the reference
implementations use system-assigned identities since they update only a single resource. Managed
identities are only required to update another resource. For example, a function can read the resource tags
without a managed identity. See these instructions to add a system-assigned identity to your function.
User-assigned managed identities : These are created as stand-alone Azure resources. These can be
shared across multiple resources, and need to be explicitly deleted. Read these instructions on how to add
user-assigned identity to your function. Use these for scenarios that:
Require access to multiple resources that can share a single identity, or
Need pre-authorization to secure resources during provisioning, or
Have resources that are recycled frequently, while permissions need to be consistent.
Once the identity is assigned to the Azure function, assign it a role using role-based access control (RBAC) to
access the resources. For example, to update a resource, the Contributor role will need to be assigned to the
function identity.

Cost considerations
Use the Azure pricing calculator to estimate costs. Here are some considerations for lowering cost.
Azure Logic Apps
Logic apps have a pay-as-you-go pricing model. Triggers, actions, and connector executions are metered each time
a logic app runs. All successful and unsuccessful actions, including triggers, are considered as executions.
Logic apps have also a fixed pricing model. If you need to run logic apps that communicate with secured resources
in an Azure virtual network, you can create them in an Integration Service Environment (ISE).
For details, see Pricing model for Azure Logic Apps.
In this architecture, logic apps are used in the cost center tagging scenario to orchestrate the workflow.
Built-in connectors are used to connect to Azure Functions and send email notification an when an automation
task is completed. The functions are exposed as a web hook/API using an HTTP trigger. Logic apps are triggered
only when an HTTPS request occurs. This is a cost effective way when compared to a design where functions
continuously poll and check for certain criteria. Every polling request is metered as an action.
For more information, see Logic Apps pricing.
Azure Functions
Azure Functions are available with the following three pricing plans.
Consumption plan . This is the most cost-effective, serverless plan available, where you only pay for the
time your function runs. Under this plan, functions can run for up to 10 minutes at a time.
Premium plan . Consider using Azure Functions Premium plan for automation scenarios with additional
requirements, such as a dedicated virtual network, a longer execution duration, and so on. These functions
can run for up to an hour, and should be chosen for longer automation tasks such as running backups,
database indexing, or generating reports.
App Ser vice plan . Hybrid automation scenarios that use the Azure App Service Hybrid Connections, will
need to use the App Service plan. The functions created under this plan can run for unlimited duration,
similar to a web app.
In this architecture Azure Functions are used for tasks such as updating tags in Azure Active Directory, or changing
cosmos DB configuration by scaling up the RUs to a higher value. The Consumption plan is the appropriate for
this use case because those tasks are interactive and not on going.
Azure Cosmos DB
Azure Cosmos DB bills for provisioned throughput and consumed storage by hour. Provisioned throughput is
expressed in Request Units per second (RU/s), which can be used for typical database operations, such as inserts,
reads. Storage is billed for each GB used for your stored data and index. See Cosmos DB pricing model for more
information.
In this architecture, when data access requests to Cosmos DB exceed the capacity in Request Units (or RUs), Azure
Monitor triggers alerts. In response to those alerts, an Azure Monitor action group is configured to call the
automation function. The function scales the RUs to a higher value. This helps to keep the cost down because you
only pay for the resources that your workloads need on a per-hour basis.
To get a quick cost estimate of your workload, use the Cosmos DB capacity calculator.
For more information, see the Cost section in Microsoft Azure Well-Architected Framework.

Deployment considerations
For critical automation workflows that manage behavior of your application, zero downtime deployment must be
achieved using an efficient DevOps pipeline. For more information, read serverless backend deployment.
If the automation covers multiple applications, keep the resources required by the automation in a separate
resource group. A single resource group can be shared between automation and application resources, if the
automation covers a single application.
If the workflow involves a number of automation functions, group the functions catering to one scenario in a
single function app. Read Manage function app for more information.
As you deploy your application you will need to monitor it. Consider using Application Insights to enable the
developers to monitor performance and detect issues.
For more information, see the DevOps section in Microsoft Azure Well-Architected Framework.

Deploy the solution


To deploy the reference implementations for this architecture, see the deployment steps on GitHub for the desired
workflow.

Next steps
Learn more about the serverless implementations.
Microservices architecture on Azure Service Fabric
11/2/2020 • 27 minutes to read • Edit Online

This reference architecture shows a microservices architecture deployed to Azure Service Fabric. It shows a basic
cluster configuration that can be the starting point for most deployments.

A reference implementation of this architecture is available on GitHub.

NOTE
This article focuses on the Reliable Services programming model for Service Fabric. Using Service Fabric to deploy and
manage containers is beyond the scope of this article.

Architecture
The architecture consists of the following components. For other terms, see Service Fabric terminology overview.
Ser vice Fabric cluster . A network-connected set of virtual machines (VMs) into which your microservices are
deployed and managed.
Vir tual machine scale sets . Virtual machine scale sets allow you to create and manage a group of identical, load
balanced, and autoscaling VMs. It also provides the fault and upgrade domains.
Nodes . The nodes are the VMs that belong to the Service Fabric cluster.
Node types . A node type represents a virtual machine scale set that deploys a collection of nodes. A Service Fabric
cluster has at least one node type. In a cluster with multiple node types, one must be declared the Primary node
type. The primary node type in the cluster runs the Service Fabric system services. These services provide the
platform capabilities of Service Fabric. The primary node type also acts as the seed nodes for the cluster, which are
the nodes that maintain the availability of the underlying cluster. Configure additional node types to run your
services.
Ser vices . A service performs a standalone function that can start and run independently of other services.
Instances of services get deployed to nodes in the cluster. There are two varieties of service in Service Fabric:
Stateless ser vice . A stateless service does not maintain state within the service. If state persistence is required,
then state is written to and retrieved from an external store, such as Azure Cosmos DB.
Stateful ser vice . The service state is kept within the service itself. Most stateful services implement this
through Service Fabric's Reliable Collections.
Ser vice Fabric Explorer . Service Fabric Explorer is an open-source tool for inspecting and managing Service
Fabric clusters.
Azure Pipelines . Pipelines is part of Azure DevOps Services and runs automated builds, tests, and deployments.
You can also use third-party CI/CD solutions such as Jenkins.
Azure Monitor . Azure Monitor collects and stores metrics and logs, including platform metrics for the Azure
services in the solution and application telemetry. Use this data to monitor the application, set up alerts and
dashboards, and perform root cause analysis of failures. Azure Monitor integrates with Service Fabric to collect
metrics from controllers, nodes, and containers, as well as container and node logs.
Azure Key Vault . Use Key Vault to store any application secrets used by the microservices, such as connection
strings.
Azure API Management . In this architecture, API Management acts as an API gateway that accepts requests from
clients and routes them to your services.

Design considerations
This reference architecture is focused on microservices architectures. A microservice is a small, independently
versioned unit of code. It is discoverable through service discovery mechanisms and can communicate with other
services over APIs. Each service is self-contained and should implement a single business capability. For more
information about how to decompose your application domain into microservices, see Using domain analysis to
model microservices.
Service Fabric provides an infrastructure to build, deploy, and upgrade microservices efficiently. It also provides
options for auto scaling, managing state, monitoring health, and restarting services in case of failure.
Service Fabric follows an application model where an application is a collection of microservices. The application is
described in an application manifest file that defines the different types of service contained in that application, and
pointers to the independent service packages. The application package also usually contains parameters that serve
as overrides for certain settings used by the services. Each service package has a manifest file that describes the
physical files and folders that are necessary to run that service, including binaries, configuration files, and read-
only data for that service. Services and applications are independently versioned and upgradable.
Optionally, the application manifest can describe services that are automatically provisioned when an instance of
the application is created. These are called default services. In this case, the application manifest also describes how
these services should be created, including the service's name, instance count, security/isolation policy, and
placement constraints.
NOTE
Avoid using default services if you want to control the life time of your services. Default services are created when the
application is created, and run as long as the application is running.

For more information about understanding Service Fabric, see So you want to learn about Service Fabric?
Choose an application-to -service packaging model
A tenet of microservices is that each service can be independently deployed. In Service Fabric, if you group all of
your services into a single application package, and one service fails to upgrade, the entire application upgrade
gets rolled back, which prevents other service from being upgraded.
For that reason, in a microservices architecture, we recommend using multiple application packages. Put one or
more closely related service types into a single application type. If your team is responsible for a set of services that
run for the same duration and need to be updated at the same time, have the same lifecycle, or share resources
such as dependencies or configuration, then place those services types in the same application type.
Service Fabric programming models
When you add a microservice to a Service Fabric application, decide whether it has state or data that needs to be
made highly available and reliable. If so, can it store data externally or is the data contained as part of the service?
Choose a stateless service if you don't need to store data or want to store data in external storage. If you want to
maintain state or data as part of the service (for example, you need that data to reside in memory close to the
code), or cannot tolerate a dependency on an external store, consider choosing a stateful service.
If you have existing code that you want to run on Service Fabric, you can run it as a guest executable, which is an
arbitrary executable that runs as a service. Alternatively, you can package the executable in a container that has all
the dependencies needed for deployment. Service Fabric models both containers and guest executables as stateless
services. For guidance about choosing a model, see Service Fabric programming model overview.
With guest executables, you are responsible of maintaining the environment in which it runs. For example, suppose
that a guest executable requires Python. If the executable is not self-contained, you need to make sure that the
required version of Python is pre-installed in the environment. Service Fabric does not manage the environment.
Azure offers multiple mechanisms to set up the environment, including custom virtual machine images and
extensions.
To access a guest executable through a reverse proxy, make sure you have added the UriScheme attribute to the
Endpoint element in the guest executable's service manifest.

<Endpoints>
<Endpoint Name="MyGuestExeTypeEndpoint" Port="8090" Protocol="http" UriScheme="http" PathSuffix="api"
Type="Input"/>
</Endpoints>

If the service has additional routes, specify the routes in the PathSuffix value. The value should not be prefixed or
suffixed with '/'. Another way is to add the route in the service name.

<Endpoints>
<Endpoint Name="MyGuestExeTypeEndpoint" Port="8090" Protocol="http" PathSuffix="api" Type="Input"/>
</Endpoints>

For more information, see:


Package an application
Package and deploy an existing executable to Service Fabric
API gateway
An API gateway (ingress) sits between external clients and the microservices. It acts as a reverse proxy, routing
requests from clients to microservices. It may also perform various cross-cutting tasks such as authentication, SSL
termination, and rate limiting.
Azure API Management is recommended for most scenarios, but Traefik is a popular open-source alternative. Both
technology options are integrated with Service Fabric.
API Management exposes a public IP address and routes traffic to your services. It runs in a dedicated subnet in
the same virtual network as the Service Fabric cluster. It can access services in a node type that is exposed
through a load balancer with a private IP address. This option is only available in the Premium and Developer
tiers of API Management. For production workloads, use the Premium tier. Pricing information is described in
API Management pricing. For more information, see Service Fabric with Azure API Management overview.
Traefik supports features such as routing, tracing, logs, and metrics. Traefik runs as a stateless service in the
Service Fabric cluster. Service versioning can be supported through routing. For information on how to set up
Traefik for service ingress and as the reverse proxy within the cluster, see Azure Service Fabric Provider. For
more information about using Traefik with Service Fabric, see Intelligent routing on Service Fabric with Traefik
(blog post).
Traefik, unlike Azure API Management, does not have functionality to resolve the partition of a stateful service (with
more than one partition) to which a request is routed. For more information, see Add a matcher for partitioning
services.
Other API management options include Azure Application Gateway and Azure Front Door. These services can be
used in conjunction with API Management to perform tasks such as routing, SSL termination, and firewall.
Interservice communication
To facilitate service-to-service communication, consider using HTTP as the communication protocol. As a baseline
for most scenarios, we recommend using the reverse proxy service for service discovery.
Communication protocol. In a microservices architecture, services need to communicate with each other with
minimum coupling at runtime. To enable language-agnostic communication, HTTP is an industry-standard with
a wide range of tools and HTTP servers that are available in different languages, all supported by Service Fabric.
Therefore, using HTTP instead of Service Fabric's built-in service remoting is recommended for most workloads.
Service discovery. To communicate with other services within a cluster, a client service needs to resolve the
target service's current location. In Service Fabric, services can move between nodes, causing the service
endpoints to change dynamically. To avoid connections to stale endpoints, Service Fabric's Naming Service can
be used to retrieve updated endpoint information. However, Service Fabric also provides a built-in reverse proxy
service that abstracts the naming service. This option is easier to use and results in simpler code.
Other options for interservice communication include,
Traefik for advanced routing.
DNS for compatibility scenarios where a service expects to use DNS.
ServicePartitionClient<TCommunicationClient> class. The class caches service endpoints and can enable better
performance, as calls go directly between services without intermediaries or custom protocols.

Scalability considerations
Service Fabric supports scaling these cluster entities:
Scaling the number of nodes for each node type.
Scaling services.
This section is focused on autoscaling. You can choose to manually scale in situations where appropriate. For
example, a situation where manual intervention is required to set the number of instances.
Initial cluster configuration for scalability
When you create a Service Fabric cluster, provision the node types based on your security and scalability needs.
Each node type is mapped to a virtual machine scale set and can be scaled independently.
Create a node type for each group of services that have different scalability or resource requirements. Start by
provisioning a node type (which becomes the Primary node type) for the Service Fabric system services. Then
create separate node types to run your public or front-end services, and other node types as necessary for your
backend and private or isolated services. Specify placement constraints so that the services are only deployed to
the intended node types.
Specify the durability tier for each node type. The durability tier represents the ability for Service Fabric to
influence virtual machine scale set updates and maintenance operations. For production workloads, choose the
Silver or higher durability tier. For information about each tier, see The durability characteristics of the cluster.
If using the Bronze durability tier, certain operations require manual steps. For node types with Bronze durability
tier additional steps are required during scale in. For more information on scaling operations, see this guide.
Scaling nodes
Service Fabric supports autoscaling for scale-in and scale-out. Each node type can be configured for autoscaling
independently.
Each node type can have a maximum of 100 nodes. Start with a smaller set of nodes and add more nodes
depending on your load. If you require more than 100 nodes in a node type, you will need to add more node types.
For details, see Service Fabric cluster capacity planning considerations. A virtual machine scale set does not scale
instantaneously, so consider that factor when you set up autoscale rules.
To support automatic scale-in, configure the node type to have the Silver or Gold durability tier. This makes sure
that scaling in is delayed until Service Fabric is finished relocating services and that the virtual machine scale sets
inform Service Fabric that the VMs are removed, not just down temporarily.
For more information about scaling at the node/cluster level, see Scaling Azure Service Fabric clusters.
Scaling services
Stateless and stateful services apply different approaches to scaling.
Autoscaling for stateless services
Use the average partition load trigger. This trigger determines when the service is scaled in or out, based on a
load threshold value specified in the scaling policy. You can also set how often the trigger is checked. See
Average partition load trigger with instance-based scaling. This allows you to scale up to the number of
available nodes.
Set InstanceCount to -1 in the service manifest, which tells Service Fabric to run an instance of the service on
every node. This approach enables the service to scale dynamically as the cluster scales. As the number of
nodes in the cluster changes, Service Fabric automatically creates and deletes service instances to match.

NOTE
In some cases, you might want to manually scale your service. For example, if you have a service that reads from Event Hubs,
you might want a dedicated instance to read from each event hub partition, to avoid concurrent access to the partition.

Scaling for stateful services


For a stateful service, scaling is controlled by the number of partitions, the size of each partition, and the number of
partitions/replicas running on a given machine.
If you are creating partitioned services, make sure each node gets adequate replicas for even distribution of the
workload without causing resource contentions. If more nodes are added, Service Fabric distributes the
workloads onto the new machines by default. For example, if there are 5 nodes and 10 partitions, by default
Service Fabric will place two primary replicas on each node. If you scale out the nodes, you can achieve greater
performance, because the work is evenly distributed across more resources. For information about scenarios
that take advantage of this strategy, see Scaling in Service Fabric.
Adding or removing partitions is not well supported. Another option commonly used to scale is to dynamically
create or delete services or whole application instances. An example of that pattern is described in Scaling by
creating or removing new named services.
For more information, see:
Scale a Service Fabric cluster in or out using autoscale rules or manually
Scale a Service Fabric cluster programmatically
Scale a Service Fabric cluster out by adding a virtual machine scale set
Using metrics to balance load
Depending on how you design the partition, you might have nodes with replicas that get more traffic than others.
To avoid this situation, partition the service state so that it is distributed across all partitions. Use the range
partitioning scheme with a good hash algorithm. See Get started with partitioning.
Service Fabric uses metrics to know how to place and balance services within a cluster. You can specify a default
load for each metric associated with a service when that service is created. Service Fabric then takes that load into
account when placing the service, or whenever the service needs to move (for example, during upgrades) to try to
balance the nodes in the cluster.
The initially specified default load for a service will not change over the lifetime of the service. To capture changing
metrics for a given service, we recommend that you monitor your service and then report the load dynamically.
This allows Service Fabric to adjust the allocation based on the reported load at a given time. Use the
IServicePartition.ReportLoad method to report custom metrics. For more information, see Dynamic load.

Availability considerations
Place your services in a node type other than the primary node type. The Service Fabric system services are always
deployed to the primary node type. If your services are deployed to the primary node type, they might compete
with system services for resources and interfere with the system services. If a node type is expected to host stateful
services, make sure there are at least five node instances and you select the Silver or Gold Durability tier.
Consider constraining the resources of your services. See Resource governance mechanism.
Do not mix resource governed and resource non-governed services on the same node type. The non-governed
services might consume too many resources, affecting the resource governed services. Specify placement
constraints to make sure that those types of services do not run on the same set of nodes. See Specify resource
governance. (This is an example of the Bulkhead pattern.)
Specify the CPU cores and memory to reserve for a service instance. For information about usage and
limitations of resource governance policies, see Resource governance.
Make sure every service's target instance or replica count is greater than 1 to avoid a single point of failure (SPOF).
The largest number that you can use as service instance or replica count equals the number nodes that to which
the service is constrained.
Make sure every stateful service has at least two active secondary replicas. Five replicas are recommended for
production workloads.
For more information, see Availability of Service Fabric services.

Security considerations
Here are some key points for securing your application on Service Fabric:
Virtual network
Consider defining subnet boundaries for each virtual machine scale set to control the flow of communication. Each
node type has its own virtual machine scale set in a subnet within the Service Fabric cluster's virtual network.
Network Security Groups (NSGs) can be added to the subnets to allow or reject network traffic. For example, with
front-end and back-end node types, you can add an NSG to the backend subnet to accept inbound traffic only the
front-end subnet.
When calling external Azure Services from the cluster, use Virtual Network service endpoints if the Azure service
supports it. Using a service endpoint secures the service to only the cluster's Virtual Network. For example, if you
are using Cosmos DB to store data, configure the Cosmos DB account with a service endpoint to allow access only
from a specific subnet. See Access Azure Cosmos DB resources from virtual networks.
Endpoints and interservice communication
Do not create an unsecured Service Fabric cluster. If the cluster exposes management endpoints to the public
internet, anonymous users can connect to it. Unsecured clusters are not supported for production workloads. See:
Service Fabric cluster security scenarios.
To secure your interservice communications:
Consider enabling HTTPS endpoints in your ASP.NET Core or Java web services.
Establish a secure connection between the reverse proxy and services. For details, see Connect to a secure
service.
If you are using an API gateway, you can offload authentication to the gateway. Make sure that the individual
services cannot be reached directly (without the API gateway) unless additional security is in place to authenticate
messages whether they come from the gateway.
Do not expose the Service Fabric reverse proxy publicly. Doing so causes all services that expose HTTP endpoints to
be addressable from outside the cluster, introducing security vulnerabilities and potentially exposing additional
information outside the cluster unnecessarily. If you want to access a service publicly, use an API gateway. Some
options are mentioned in the API gateway section.
Remote desktop is useful for diagnostic and troubleshooting, but make sure not to leave it open otherwise it
causes a security hole.
Secrets and certificates
Store secrets such as connection strings to data stores in Azure Key Vault. The Key Vault must be in the same region
as the virtual machine scale set. You will need to:
Authenticate the service's access to the Key Vault.
Enable managed identity on the virtual machine scale set that hosts the service.
Store your secrets in the Key Vault.
Add secrets in a format that can be translated to a key-value pair. For example, CosmosDB--AuthKey. When
the configuration is built, "--" is converted into ":".
Access those secrets in your service.
Add the Key Vault URI in your appSettings.json. In your service, add the configuration provider that reads
from the Key Vault, builds the configuration, and accesses the secret from the built configuration.
Here's an example where the Workflow service stores a secret in the Key Vault in the format "CosmosDB--
Database".
namespace Fabrikam.Workflow.Service
{
public class ServiceStartup
{
public static void ConfigureServices(StatelessServiceContext context, IServiceCollection services)
{
var preConfig = new ConfigurationBuilder()

.AddJsonFile(context, "appsettings.json");

var config = preConfig.Build();

if (config["AzureKeyVault:KeyVaultUri"] is var keyVaultUri &&


!string.IsNullOrWhiteSpace(keyVaultUri))
{
preConfig.AddAzureKeyVault(keyVaultUri);
config = preConfig.Build();
}
}
}

To access the secret, specify the secret name in the built config.

if(builtConfig["CosmosDB:Database"] is var database && !string.IsNullOrEmpty(database))


{
// Use the secret.
}

Do not use client certificates to access Service Fabric Explorer. Instead, use Azure Active Directory (Azure AD). Also
see, Azure services that support Azure AD authentication.
Do not use self-signed certificates for production.
Data at rest protection
If you have attached data disks to the virtual machine scale sets of the Service Fabric cluster and your services save
data on those disks, you must encrypt the disks. For more information, see Encrypt OS and attached data disks in a
virtual machine scale set with Azure PowerShell (Preview).
For more information about securing Service Fabric, see:
Azure Service Fabric security overview
Azure Service Fabric security best practices
Azure Service Fabric security checklist

Resiliency considerations
To recover from failures and maintain a fully functioning state, the application must implement certain resiliency
patterns. Here are some common patterns:
Retry pattern: To handle errors that are expected to be transient, such as resources being temporarily
unavailable.
Circuit breaker: To address faults that might need longer to fix.
Bulkhead pattern: To isolate resources per service.
This reference implementation uses Polly, an open-source option, to implement all of those patterns.

Monitoring considerations
Before you explore the monitoring options, we recommend you read this article about diagnosing common
scenarios with Service Fabric. You can think of monitoring data in these sets:
Application metrics and logs
Service Fabric health and event data
Infrastructure metrics and logs
Metrics and logs for dependent services
These are the two main options for analyzing that data:
Application Insights
Log Analytics
You can use Azure Monitor to set up dashboards for monitoring and to send alerts to operators. There are also
some third-party monitoring tools that are integrated with Service Fabric, such as Dynatrace. For details, see Azure
Service Fabric Monitoring Partners.
Application metrics and logs
Application telemetry provides data about your service that can help you monitor the health of your service and
identify issues. To add traces and events in your service:
Microsoft.Extensions.Logging if you are developing your service with ASP.NET Core. For other frameworks, use a
logging library of your choice such as Serilog.
You can add your own instrumentation by using the TelemetryClient class in the SDK and view the data in
Application Insights. See Add custom instrumentation to your application.
Log ETW events by using EventSource. This option is available by default in a Visual Studio Service Fabric
solution.
Application Insights provides a lot of built-in telemetry: requests, traces, events, exceptions, metrics, dependencies.
If your service exposes HTTP endpoints, enable Application Insights by calling the UseApplicationInsights
extension method for Microsoft.AspNetCore.Hosting.IWebHostBuilder. For information about instrumenting your
service for Application Insights, see these articles:
Tutorial: Monitor and diagnose an ASP.NET Core application on Service Fabric using Application Insights.
Application Insights for ASP.NET Core
Application Insights .NET SDK
Application Insights SDK for Service Fabric
To view the traces and event logs, use Application Insights as one of sinks for structured logging. Configure
Application Insights with your instrumentation key by calling the AddApplicationInsights extension method. In
this example, the instrumentation key is stored as a secret in the Key Vault.

.ConfigureLogging((hostingContext, logging) =>


{
logging.AddApplicationInsights(hostingContext.Configuration
["ApplicationInsights:InstrumentationKey"]);
})

If your service does not expose HTTP endpoints, you need to write a custom extension that sends traces to
Application Insights. For an example, see the Workflow service in the reference implementation.
ASP.NET Core services use the ILogger interface for application logging. To make these application logs available in
Azure Monitor, send the ILogger events to Application Insights. For more information, see ILogger in an ASP.NET
Core application. Application Insights can add correlation properties to ILogger events, useful for visualizing
distributed tracing.
For more information, see:
Application logging
Add logging to your Service Fabric application
Service Fabric health and event data
Service Fabric telemetry includes health metrics and events about the operation and performance of a Service
Fabric cluster and its entities: its nodes, applications, services, partitions, and replicas.
EventStore. A stateful system service that collects events related to the cluster and its entities. Service Fabric
uses EventStore to write Service Fabric events to provide information about your cluster and can be used for
status updates, troubleshooting, monitoring. It can also correlate events from different entities at a given time to
identify issues in the cluster. The service exposes those events through a REST API. For information about how
to query the EventStore APIs, see Query EventStore APIs for cluster events. You can view the events from
EventStore in Log Analytics by configuring your cluster with WAD extension.
HealthStore. Provides a snapshot of the current health of the cluster. A stateful service that aggregates all health
data reported by entities in a hierarchy. The data is visualized in Service Fabric Explorer. The HealthStore also
monitors application upgrades. You can use health queries in PowerShell, a .NET application, or REST APIs. See,
Introduction to Service Fabric health monitoring.
Consider implementing internal custom watchdog services. Those services can periodically report custom
health data such as faulty states of running services. For more information, see Custom health reports. You can
read the health reports using the Service Fabric explorer.
Infrastructure metrics and logs
Infrastructure metrics help you to understand resource allocation in the cluster. Here are the main options for
collecting this information:
Windows Azure Diagnostics (WAD). Collect logs and metrics at the node level on Windows. You can use WAD by
configuring the IaaSDiagnostics VM extension on any virtual machine scale set that is mapped to a node type to
collect diagnostic events, such as Windows event logs, performance counters, ETW/manifests system and
operational events, and custom logs.
Log Analytics agent. Configure the MicrosoftMonitoringAgent VM extension to send Windows event logs,
performance counters, and custom logs to Log Analytics.
There is some overlap in the type of metrics collected through the preceding mechanisms, such as performance
counters. Where there is overlap, we recommend using the Log Analytics agent. Because there is no Azure storage
for the Log Analytics agent, there is low latency. Also, the performance counters in IaaSDiagnostics cannot be fed
into Log Analytics easily.

For information about using VM extensions, see Azure virtual machine extensions and features.
To view the data, configure Log Analytics to view the data collected through WAD. For information about how to
configure Log Analytics to read events from a storage account, see Set up Log Analytics for a cluster.
You can also view performance logs and telemetry data related to a Service Fabric cluster, workloads, network
traffic, pending updates, and more. See Performance Monitoring with Log Analytics.
Service Map solution in Log Analytics provides information about the topology of the cluster (that is, the processes
running in each node). Send the data in the storage account to Application Insights. There might be some delay in
getting data into Application Insights. If you want to see the data real time, consider configuring Event Hub using
sinks and channels. For more information, see Event aggregation and collection using Windows Azure Diagnostics.
Dependent service metrics
Application Insights Application Map provides the topology of the application by using HTTP dependency calls
made between services, with the installed Application Insights SDK.
Service Map solution in Log Analytics provides information about inbound and outbound traffic from/to
external services. In addition, Service Map integrates with other solutions such as updates or security.
Custom watchdogs can be used to report error conditions on external services. For example, the service could
report an error health report if it cannot access an external service or data storage (Azure Cosmos DB).
Distributed tracing
In microservices architecture, several services often participate to complete a task. The telemetry from each of
those services is correlated by using context fields (operation ID, request ID, and so forth) in a distributed trace. By
using Application Map in Application Insights, you can build the view of distributed logical operation and visualize
the entire service graph of your application. You can also use transaction diagnostics in Application Insight to
correlate server-side telemetry. For more information, see Unified cross-component transaction diagnostics.
Application Insights Application Map provides the topology of the application by using HTTP dependency calls
made between services, with the installed Application Insights SDK. It's also important to correlate tasks that are
dispatched asynchronously using a queue. For details about sending correlation telemetry in a queue message, see
Queue instrumentation.
For more information, see:
Performing a query across multiple resources
Telemetry correlation in Application Insights
Alerts and Dashboards
Application Insights and Log Analytics support an extensive query language (Kusto query language) that lets you
retrieve and analyze log data. Use the queries to create data sets and visualize it in diagnostics dashboards.
Use Azure Monitor alerts to notify sysadmins when certain conditions occur in specific resources. The notification
could be an email, Azure function, call a web hook, and so on. For more information, see Alerts in Azure Monitor.
Log search alert rules allow you to define and run a Kusto query against a Log Analytics workspace at regular
intervals. An alert is created if the query result matches a certain condition.

Cost considerations
Use the Azure pricing calculator to estimate costs. Other considerations are described in the Cost section in
Microsoft Azure Well-Architected Framework.
Here are some points to consider for some of the services used in this architecture.
Azure Service Fabric
You are charged for the compute instances, storage, networking resources, and IP addresses you choose when
creating a Service Fabric cluster. There are deployment charges for Service Fabric.
Virtual machine scale sets
In this architecture, the microservices are deployed into nodes that are virtual machine scale sets. You are charged
for the Azure VMs that are deployed as part of the cluster and underlying infrastructure resources, such as storage
and networking. There are no incremental charges for the virtual machine scale sets service.
Azure API Management (APIM )
APIM is used as a gateway to route the requests from clients to your services in the cluster.
There are different pricing options. The Consumption option is charged on a pay-per-use basis and includes a
gateway component. Based on your workload, choose an option described in Api Management pricing.
Azure Application Insights
Application insights is used for collect telemetry for all services and also to view the traces and event logs in a
structured way. The pricing for Azure Application Insights is a Pay-As-You-Go model based on data volume
ingested and optionally for longer data retention. For more information, see Manage Usage and Cost For
Application Insights.
Azure Monitor
For Azure Monitor Log Analytics, you are charged for data ingestion and retention. For more information, see
Azure Monitor Pricing for more information.
Azure Key Vault
Azure Key Vault is used to store the Application Insight's instrumentation key as a secret. Azure Key Vault is offered
in two service tiers. If you don't need HSM-protected keys, choose the Standard . For information about the
features in each tier, see Key Vault pricing.
Azure DevOps Services
This reference architecture only uses Azure Pipelines. Azure offers the Azure Pipeline as an individual Service. You
are allowed a free Microsoft-hosted job with 1,800 minutes per month for CI/CD and one self-hosted job with
unlimited minutes per month, extra jobs have charges. For more information, see Azure DevOps Services Pricing.

DevOps considerations
Azure Pipelines
The reference implementation is deployed using Azure Pipelines. For DevOps considerations in a microservices
architecture, see CI/CD for microservices
You can also learn how to deploy a container application with CI/CD to a Service Fabric cluster, in this tutorial.

Deploy the solution


To deploy the reference implementation for this architecture, follow the steps in the GitHub repo.

Next steps
Using domain analysis to model microservices
Designing a microservices architecture
Serverless Framework multicloud solutions
11/2/2020 • 7 minutes to read • Edit Online

This article describes how the Microsoft Commercial Software Engineering (CSE) team partnered with a global
retailer to deploy a highly-available serverless solution across both Azure and Amazon Web Services (AWS) cloud
platforms, using the Serverless Framework.
In serverless computing, the cloud provider dynamically allocates microservices resources to run code, and only
charges for the resources used. Serverless computing abstracts app code from infrastructure implementation, code
deployment, and operational aspects like planning and maintenance.
As with other services, each cloud provider has its own serverless implementation, and it's difficult for customers
to use a different provider without considerable operational impact and costs. Potential customers may view this
situation as weakening their bargaining position and agility. Vendor lock-in is one of the greatest obstacles to
enterprise cloud adoption.
The open-source Serverless Framework is a universal cloud interface for developing and deploying serverless
computing solutions across cloud providers. Open-sourcing and common APIs for serverless functions help
providers, customers, and partners build cross-cloud solutions for best-of-breed services. The Serverless
Framework reduces barriers to cloud adoption by addressing the problems of vendor lock-in and cross-cloud
provider redundancy. Customers can optimize their solutions based on cost, agility, and other considerations.
CSE and the Azure product team collectively rewrote the Serverless CLI to support new Azure Functions features
like Premium Functions, API Management, and KeyVault. The Serverless CLI now provides a standard interface for
GitOps deployment to both Azure and AWS. The team also developed the Serverless Multicloud Library, which
provides a normalized runtime API to deploy serverless apps to both AWS and Azure.
This design provides high availability with active-active failover between multiple cloud platforms, as opposed to
active-passive failover. If the service of one cloud provider becomes unhealthy or unavailable, this solution can
reroute requests to another cloud platform.
This project met the following technical goals:
Create a cross-industry solution.
Use the Multicloud Serverless Library to support a cloud-agnostic API that interfaces with microservices
wherever they are deployed.
Support a GitOps CI/CD process workflow for development, testing, and deployment on all supported cloud
platforms.
Use API-based access via an authenticated cloud gateway, and load balance between cloud platforms by using
the gateway as a router.
Other potential benefits of using the Serverless Framework include:
Prevention or reduction of vendor lock-in
40-60+% code reduction during development by using the Multicloud Serverless Library
Development of best-of-breed solutions that combine different cloud providers' services
Elimination of most platform and infrastructure complexity and maintenance requirements
Easier data sharing, performance and cost comparisons, and ability to take advantage of special offerings
Active-active high availability

Potential use cases


Write client-side applications for multiple platforms by using a cloud-agnostic API from the Serverless
Multicloud Library.
Deploy a collection of functional microservices in a serverless framework to multiple cloud platforms.
Use a cloud-agnostic app across cloud platforms without knowing or caring which platform is hosting it.

Architecture

The user app can come from any source capable of logging into the cloud. In this implementation, the user logs
into a gateway app that load balances requests 50-50 between the Azure and AWS clouds.
Any response also routes through the API Manager gateway, which then sends it to the requestor app.
The Serverless Framework
This solution uses the Serverless Framework, available from Serverless, Inc. The free version of the Serverless
Framework includes a CLI, additional plugins, and limited monitoring services. The Pro edition features operational
capabilities across clouds, such as enhanced monitoring and alerts. The framework supports Node.js and Python
languages, and both AWS and Azure cloud hosts.
To use Azure with the Serverless Framework, you need:
Node.js, to package microservices
Azure Functions, to provide functionality comparable to other cloud platforms
The Serverless Framework, to support multicloud deployment and monitoring
The Serverless Multicloud Library, to provide normalized runtime APIs for developers
The Azure Functions Serverless Plugin, to support multicloud deployment. This plugin wasn't initially up to
parity with the comparable AWS Lambda plug-in, and was extended for this project.
The following figure shows the processing pipeline. The middleware layers represent any intermediate functionality
needed before reaching the handler.
Cloud-agnostic APIs
The serverless implementation on each platform supports individual functions as microservices, one to each
functional VM node, and executes processing as needed. Each AWS Lambda function has a corresponding Azure
Functions function. The Serverless Multicloud Library builds analogous microservices from either cloud into a
cloud-agnostic normalized REST API that client apps can use to interface with either platform. Because the
abstracted API layer provides code to address the corresponding microservices for each platform, transactions
don't need translation. The cloud-agnostic interface lets user apps interact with the cloud without knowing or
caring which cloud platform they're accessing.
The following diagram illustrates this concept:

CI/CD with GitOps


A primary job of the Serverless Framework is to abstract away all the infrastructure concerns of deploying an app
to the cloud. Using a manifest-based approach, the Serverless Framework can deal with all deployment issues,
allowing deployment to be automated as needed to support GitOps.
Although this initial project used manual deployments, it's realistic to implement manifest-driven serverless builds,
tests, and deployments within or across clouds. This process can use a GitOps developer workflow: building from
Git, using quality gates for test and evaluation, and pushing serverless solutions onto both cloud providers.
Performing all deployments using the Serverless Framework from the beginning of the project is the most efficient
way to proceed.
API manager
The API Manager can be an existing or custom application. The Apigee™ API Manager in this implementation
acted only as a router to provide a 50-50 transaction load balance to the two cloud platforms, and was
underutilized for its capabilities.
The API Manager must be able to:
Be deployed inside or outside a cloud platform as needed
Route messages to and from both cloud platforms
Log traffic requests to coordinate asynchronous message traffic
Relay requests and responses using the common REST API from and to the user application
Monitor the health of both cloud serverless framework deployments to validate their ability to receive requests
Perform automated health and availability checks on each cloud platform, to support routing and high
availability

Alternatives
Other languages such as Python could implement the solution, as long as they're supported by the
serverless implementations of the cloud platforms, AWS Lambda and Azure Functions in this case. This
project used Node.js to package the microservices, because the customer was comfortable with Node.js, and
both AWS and Azure platforms support it.
The solution can use any cloud platform that supports the Serverless Framework, not just Azure and AWS.
Currently, the Serverless Framework reports compatibility with eight different cloud providers. The only
caveat is to ensure that the elements that support the multicloud architecture or its equivalent are available
on the target cloud platforms.
The API Manager in this initial implementation acted only as a router to provide a 50-50 transaction load
balance to the two cloud platforms. The API Manager could incorporate other business logic for specific
scenarios.

Considerations
This article doesn't describe security solutions, although the initial deployment included them. There are
many possible security solutions, some platform dependent, and this framework should accommodate any
reasonable solution. User authentication is the minimum security assumed.
Because it's difficult to articulate the differences between AWS and Azure serverless functional offerings,
early effort should focus on mapping the functions available on each cloud platform and identifying
necessary transformation requirements. You can develop a platform-agnostic API from this information.
Using an open-source solution may introduce risks, due to long-term maintenance and support challenges
with any open-source software.
In the free Serverless Framework, monitoring is limited primarily to the administrative dashboard.
Monitoring is usually available in the paid enterprise offering. Currently, the Azure Functions Serverless
Plugin doesn't include provisions for observability or monitoring, and would need modification to
implement these capabilities.
This solution could use metrics to compare performance and costs between cloud platforms, enabling
customers to seamlessly optimize usage across cloud platforms.
Deploy the solution
A traditional Blue-Green Deployment develops and deploys an app to two separate but identical environments,
blue and green, increasing availability and reducing risk. The blue environment is usually the production
environment that normally handles live traffic, and the green environment is a failover deployment as needed.
Typically, the CI/CD pipeline automatically deploys both blue and green environments within the same cloud
platform. This configuration is considered an active-passive configuration.
In the multicloud solution, blue-green deployment is implemented in both cloud platforms. In the serverless case,
two duplicate sets of microservices are deployed for each cloud platform, one as the production environment and
the other as the failover environment. This active-passive setup within each cloud platform reduces the risk that
this platform will be down, increasing its availability, and enabling multicloud active-active high availability.

A secondary benefit of blue-green deployment is the ability to use the failover deployment on each cloud platform
as a test environment for microservices updates, before releasing them to the production deployment.

Related resources
Sample code and README for this implementation on GitHub
Serverless Framework
Serverless application architectures using Event Grid
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Event Grid connects data sources and event handlers. For example, use Event Grid to instantly trigger a serverless
function to run image analysis each time a new photo is added to a blob storage container.

Architecture

Download an SVG of this architecture.


Serverless event processing using Azure Functions
11/2/2020 • 7 minutes to read • Edit Online

This reference architecture shows a serverless, event-driven architecture that ingests a stream of data, processes
the data, and writes the results to a back-end database.

A reference implementation for this architecture is available on GitHub.

Architecture
Event Hubs ingests the data stream. Event Hubs is designed for high-throughput data streaming scenarios.

NOTE
For IoT scenarios, we recommend IoT Hub. IoT Hub has a built-in endpoint that’s compatible with the Azure Event Hubs API,
so you can use either service in this architecture with no major changes in the backend processing. For more information,
see Connecting IoT Devices to Azure: IoT Hub and Event Hubs.

Function App . Azure Functions is a serverless compute option. It uses an event-driven model, where a piece of
code (a “function”) is invoked by a trigger. In this architecture, when events arrive at Event Hubs, they trigger a
function that processes the events and writes the results to storage.
Function Apps are suitable for processing individual records from Event Hubs. For more complex stream
processing scenarios, consider Apache Spark using Azure Databricks, or Azure Stream Analytics.
Cosmos DB . Cosmos DB is a multi-model database service. For this scenario, the event-processing function
stores JSON records, using the Cosmos DB SQL API.
Queue storage . Queue storage is used for dead letter messages. If an error occurs while processing an event, the
function stores the event data in a dead letter queue for later processing. For more information, see Resiliency
Considerations.
Azure Monitor . Monitor collects performance metrics about the Azure services deployed in the solution. By
visualizing these in a dashboard, you can get visibility into the health of the solution.
Azure Pipelines . Pipelines is a continuous integration (CI) and continuous delivery (CD) service that builds, tests,
and deploys the application.
Scalability considerations
Event Hubs
The throughput capacity of Event Hubs is measured in throughput units. You can autoscale an event hub by
enabling auto-inflate, which automatically scales the throughput units based on traffic, up to a configured
maximum.
The Event Hub trigger in the function app scales according to the number of partitions in the event hub. Each
partition is assigned one function instance at a time. To maximize throughput, receive the events in a batch, instead
of one at a time.
Cosmos DB
Throughput capacity for Cosmos DB is measured in Request Units (RU). In order to scale a Cosmos DB container
past 10,000 RU, you must specify a partition key when you create the container, and include the partition key in
every document that you create.
Here are some characteristics of a good partition key:
The key value space is large.
There will be an even distribution of reads/writes per key value, avoiding hot keys.
The maximum data stored for any single key value will not exceed the maximum physical partition size (10 GB).
The partition key for a document won't change. You can't update the partition key on an existing document.
In the scenario for this reference architecture, the function stores exactly one document per device that is sending
data. The function continually updates the documents with latest device status, using an upsert operation. Device
ID is a good partition key for this scenario, because writes will be evenly distributed across the keys, and the size
of each partition will be strictly bounded, because there is a single document for each key value. For more
information about partition keys, see Partition and scale in Azure Cosmos DB.

Resiliency considerations
When using the Event Hubs trigger with Functions, catch exceptions within your processing loop. If an unhandled
exception occurs, the Functions runtime does not retry the messages. If a message cannot be processed, put the
message into a dead letter queue. Use an out-of-band process to examine the messages and determine corrective
action.
The following code shows how the ingestion function catches exceptions and puts unprocessed messages onto a
dead letter queue.
[FunctionName("RawTelemetryFunction")]
[StorageAccount("DeadLetterStorage")]
public static async Task RunAsync(
[EventHubTrigger("%EventHubName%", Connection = "EventHubConnection", ConsumerGroup
="%EventHubConsumerGroup%")]EventData[] messages,
[Queue("deadletterqueue")] IAsyncCollector<DeadLetterMessage> deadLetterMessages,
ILogger logger)
{
foreach (var message in messages)
{
DeviceState deviceState = null;

try
{
deviceState = telemetryProcessor.Deserialize(message.Body.Array, logger);
}
catch (Exception ex)
{
logger.LogError(ex, "Error deserializing message", message.SystemProperties.PartitionKey,
message.SystemProperties.SequenceNumber);
await deadLetterMessages.AddAsync(new DeadLetterMessage { Issue = ex.Message, EventData = message
});
}

try
{
await stateChangeProcessor.UpdateState(deviceState, logger);
}
catch (Exception ex)
{
logger.LogError(ex, "Error updating status document", deviceState);
await deadLetterMessages.AddAsync(new DeadLetterMessage { Issue = ex.Message, EventData = message,
DeviceState = deviceState });
}
}
}

Notice that the function uses the Queue storage output binding to put items in the queue.
The code shown above also logs exceptions to Application Insights. You can use the partition key and sequence
number to correlate dead letter messages with the exceptions in the logs.
Messages in the dead letter queue should have enough information so that you can understand the context of
error. In this example, the DeadLetterMessage class contains the exception message, the original event data, and the
deserialized event message (if available).

public class DeadLetterMessage


{
public string Issue { get; set; }
public EventData EventData { get; set; }
public DeviceState DeviceState { get; set; }
}

Use Azure Monitor to monitor the event hub. If you see there is input but no output, it means that messages are
not being processed. In that case, go into Log Analytics and look for exceptions or other errors.

Disaster recovery considerations


The deployment shown here resides in a single Azure region. For a more resilient approach to disaster-recovery,
take advantage of geo-distribution features in the various services:
Event Hubs . Create two Event Hubs namespaces, a primary (active) namespace and a secondary (passive)
namespace. Messages are automatically routed to the active namespace unless you fail over to the
secondary namespace. For more information, see Azure Event Hubs Geo-disaster recovery.
Function App . Deploy a second function app that is waiting to read from the secondary Event Hubs
namespace. This function writes to a secondary storage account for dead letter queue.
Cosmos DB . Cosmos DB supports multiple write regions, which enables writes to any region that you add
to your Cosmos DB account. If you don’t enable multi-write, you can still fail over the primary write region.
The Cosmos DB client SDKs and the Azure Function bindings automatically handle the failover, so you don’t
need to update any application configuration settings.
Azure Storage . Use RA-GRS storage for the dead letter queue. This creates a read-only replica in another
region. If the primary region becomes unavailable, you can read the items currently in the queue. In
addition, provision another storage account in the secondary region that the function can write to after a
fail-over.

Cost considerations
Use the Azure Pricing calculator to estimates costs. Here are some other considerations.
Azure Functions
Azure Functions supports two hosting models.
Consumption plan .
Compute power is automatically allocated when your code is running.
App Ser vice plan.
A set of VMs are allocated for your code. The App Service plan defines the number of VMs and the VM size.
In this architecture, each event that arrives on Event Hubs, triggers a function that processes that event. From a
cost perspective, the recommendation is to use consumption plan because you pay only for the compute
resources you use.
Azure Cosmos DB
Azure Cosmos DB bills for provisioned throughput and consumed storage by hour. Provisioned throughput is
expressed in Request Units per second (RU/s), which can be used for typical database operations, such as inserts,
reads. The price is based on the capacity in RU/s that you reserve. Also, you have to reserve a minimum of 400
RUs per container, where a concurrent read of 1KB document consumes 1 RU. If your app does not need to be this
intensive, consider using a single container because each container has a fixed cost.
In this reference architecture, the function stores exactly one document per device that is sending data. The
function continually updates the documents with latest device status, using an upsert operation, which is cost
effective in terms of consumed storage. For more information, see Cosmos DB pricing model.
Storage is billed for each GB used for your stored data and index.
Use the Cosmos DB capacity calculator to get a quick estimate of the workload cost.
For more information, see the Cost section in Microsoft Azure Well-Architected Framework.

DevOps considerations
Use Infrastructure as code (IaC) when possible. IaC manages the infrastructure, application, and storage resources
with a declarative approach like Azure Resource Manager. That will help in automating deployment using DevOps
as a continuous integration and continuous delivery (CI/CD) solution. Templates should be versioned and included
as part of the release pipeline.
When creating templates, group resources as a way to organize and isolate them per workload. A common way to
think about workload is a single serveless application or a virtual network. The goal of workload isolation is to
associate the resources to a team, so that the DevOps team can independently manage all aspects of those
resources and perform CI/CD.
This architecture includes steps to configure the Drone Status Function App using Azure Pipelines with YAML and
Azure Functions Slots.
As you deploy your services you will need to monitor them. Consider using Application Insights to enable the
developers to monitor performance and detect issues.
For more information, see the DevOps section in Microsoft Azure Well-Architected Framework.

Deploy the solution


To deploy this reference architecture, view the GitHub readme.

Next steps
To learn more about the reference implementation, read Code walkthrough: Serverless application with Azure
Functions.
Unified logging for microservices apps
11/2/2020 • 9 minutes to read • Edit Online

Logging uses discrete event messages to track and report application data in a centralized way. Log events provide
an overview of application execution state, track code errors or application failures, and deliver informational
messages. Automation can read event logs and notify relevant parties if events meet a criterion or threshold.
Tracing focuses on the continuous flow of an application. Tracing follows program execution through various
methods and services from beginning to end, while understanding data state and transitions.
Monitoring applies application instrumentation to both tracing or logging data to provide metrics that teams can
use to make informed decisions. These metrics can aggregate log or trace data in a dashboard that gives a holistic
view of application health, from utilization to error count.
Whenever an application fails, teams need to know:
Why did the application fail?
When did the application exception occur?
Which method caused the exception?
What were the events recorded up to the point of application failure?
Did this exception lead to potential data corruption?
Logging, tracing, and monitoring can provide the answers to these questions, as well as monitor application usage
and performance.
For traditional on-premises or monolithic applications, logging, tracing, and monitoring happen within a single
process domain. The application consists of a single executable and its dependencies. The executable runs under
the single process space on a single virtual machine (VM), or across multiple VMs to increase performance.
Cloud-native development differs from on-premises methodology. Many cloud-native applications consist of
platform-as-a-service (PaaS) services built around the microservices architectural paradigm. Microservices
architectures involve discrete, loosely coupled services that work within their own process boundaries.
Microservices architectures:
Consist of discrete services that are easier to build and simpler to maintain.
Focus on business capabilities.
Work well with automated continuous integration (CI) and continuous delivery (CD) systems.
Are more fault-tolerant, because a single service failure doesn't bring down the application.
Allow services to scale independently of each other to provide better utilization and cost optimization.
Logging and tracing for cloud-native distributed applications can be complicated. A single application request can
interact with many microservices. Each microservice generates its own logging, and determining the application
execution process flow can be difficult. Because microservices can handle hundreds of requests concurrently,
wading through logs to manually determine event correlations can be a laborious task.
Several Azure services and strategies can help automate and manage effective logging, tracing, and monitoring for
microservices applications. With unified logging, development and operations teams can gain deep insights from
within as well as from outside application domains. Unified logging and monitoring help ensure that deployed
cloud-native applications remain reliable, scalable, redundant, resilient, and secure.

Architecture
The following architecture uses Azure services to build a unified logging and monitoring system.

1. The application emits events from both the API and the user interface to Event Hubs and Application Insights.
2. Application Insights queries short-term logging, tracing, and monitoring data.
3. Stream Analytics jobs can use the Event Hubs data to trigger Logic Apps workflows.
4. A Logic Apps job calls the representational state transfer (REST) endpoint of an Information Technology Service
Management (ITSM) system, and sends notifications to the development team.
5. Azure Sentinel automation uses Playbooks powered by Azure Logic Apps to generate security alerts.
6. Keeping event logs in long-term storage allows later analysis and diagnostics with Log Analytics.
For applications that use Azure VMs, the following infrastructure-as-a-service (IaaS) architecture includes Azure
Monitor to monitor the performance and health of the VMs that run the application.
Components
When building a cloud-native distributed microservices architecture, teams can use the following Azure services to
build an effective logging, tracing, and monitoring solution. Teams can view near real-time metrics and gain insight
into application health through Application Insights.
Event Hubs
Azure Event Hubs is a fully managed, real-time ingestion service that can stream large volumes of events per
second. As such, Event Hubs is the perfect candidate for a central log ingestion pipeline. Event Hubs can send event
messages to Azure Data Lake or Azure Blob Storage to use for later analysis.
Blob Storage
Azure Blob Storage is a cloud scalable storage layer for storing unstructured text or binary object data. Event Hubs
can store events as Avro files within Blob Storage. Azure Log Analytics can then query the log data for insights.
Azure Data Lake Storage
Azure Data Lake Storage is built on Azure Blob Storage. Data Lake Storage is a cloud scalable storage repository
that can store data in any format for long periods of time. Developers can then query the objects stored within the
Data Lake for investigation.
Azure Stream Analytics
Azure Stream Analytics is a real-time, serverless analytics engine designed for critical machine workloads. Stream
Analytics can process event messages that meet critical indicators.
Logic Apps
Azure Logic Apps is a serverless cloud service that lets developers schedule and orchestrate common task
workflows with a visual designer. The power of Logic Apps is the connectors that integrate with a number of first
and third-party services with little or no code.
Azure Monitor
Azure Monitor is a service that maximizes availability and performance by collecting, analyzing, and acting on
telemetry from cloud-native applications. With Azure Monitor, teams can create operational dashboards that detect
issues and alert teams of critical situations.
If the team uses an ITSM system, Logic Apps can call the REST endpoint of the system and create an issue with the
appropriate severity level. This process provides quick notification to all relevant teams and ensures that triaging is
immediate and useful. For more information, see Stream Analytics and Azure Logic Apps.
Application Insights
Application Insights, a feature of Azure Monitor, is an extensible Application Performance Management (APM)
service for developers and DevOps teams. Application Insights can monitor live services, detect anomalies in
performance and analytics tools, diagnose and trace problems, and query log data. You can use Application Insights
to do distributed tracing through the Application Insights SDK. You can also use Application Insights from within
Visual Studio.
Azure Sentinel
Azure Sentinel is a Security Information and Event Management (SIEM) and Security Orchestration, Automation,
and Response (SOAR) service. Sentinel provides a unified overview of the cloud estate through native integration
of Azure services. Sentinel can collect information from the cloud as well as from downstream dependent systems
in customers' data centers.
Azure Sentinel provides a dashboard view of the current security posture and allows administrators a global view
on potentially malicious events such as failed logins, suspicious credentials, and the relevant connections from
these events. Site reliability engineering (SRE) teams can use Log Analytics to query the data. You can also
designate automation to trigger when Sentinel rules generate security alerts. Automation in Azure Sentinel uses
Playbooks powered by Azure Logic Apps. For more information, see Tutorial: Investigate incidents with Azure
Sentinel.

Issues and considerations


The following practices help microservices architectures perform unified logging in a cloud-native environment:
All application-generated requests should have a unique identifier, usually called a correlation ID, that they pass
through each microservice. Each microservice accepts the correlation ID as part of the request, and all logs
emitted by the microservice contain the correlation ID.
When the application processes the request, it returns the correlation ID as part of the response. The application
then uses the correlation ID when emitting its own logs.
All logs, except audit logs, should be emitted to the same hub and stored in a central repository. If audit logs are
required for security or compliance, it's best to store them in a separate data store. Make sure that recorded and
stored information meets regulatory guidelines and doesn't contain any personally identifiable information.
Log data should use JSON format.
Logging should be asynchronous. Asynchronous logging helps to reduce overhead by delegating the call to a
background task. The application doesn't need to await the results of the operation and can continue the logical
program flow.
Logging should use a logging framework if possible. Don't expend engineering effort creating a logging system
unless there's a clear business need. Serilog is a popular open-source logging framework that provides support
for the Azure ecosystem through community supported extensions.
Logging levels
During application development and after deployment, make sure to use the common logging levels appropriately:
Trace is the most detailed and most verbose log level. The Trace level outputs detailed, possibly sensitive data
about application state during program execution. Tracing can affect application performance, so don't use it in
pre-production or production environments. Enable this level sparingly and locally, and only during
development.
Debug is for debugging code locally during software development. This level tends to have no long-term value.
Information level describes the general application flow when users are interacting with the system in
production.
Warning level in production indicates that an event could be potentially problematic. Warn tells automated
alerting tools to pick up the event and notify the relevant teams to investigate.
Error level in production logs errors within the application, which could be logic errors.
Critical/Fatal level means the application is unable to fulfill the request and has caused an unrecoverable error.
Synthetic logging
Most application traffic is variable, and so are the generated logs. Applications that don't generate high traffic may
not generate enough logs to test and diagnose issues. Synthetic logging uses automation to emulate user behavior
and generate logs for the application's monitoring systems.
Using an automation tool like Selenium, developers can create a test suite of user interactions. Microservices and
API-first services can use Apache JMeter to test the functional behavior of a service and understand performance.
For testing and development, Postman can test APIs locally and integrate with the CI/CD pipeline to automate API
testing.
Synthetic transactions can simulate user behavior in your application or augment established traffic patterns. Tests
can run scheduled or ad hoc.
Synthetic logging is a valuable tool for identifying problems and analyzing telemetry data. Synthetic logging helps
development and operations teams ascertain that critical application processes are behaving as expected, and can
also ensure that apps meet the non-functional requirements of availability, performance, response time, and
resiliency.
Structured logging
A key aspect of logging is the structure of the log itself. Log data is essentially unstructured, so it can be hard to
query for specific events, implement automated alerting, or correlate related events.
A structured format makes event data readable and able to be parsed by automated systems. JSON is the data
interchange format used by most web services today, and its familiar schema is well suited for structured logging.
When defining log structure, you can add context to every request with the following objects:
Correlation ID for the request. The ID chains related log events together to provide a narrative for events and
help establish where issues occur in distributed systems. The correlation ID should be a globally unique value.
Date and time in UTC
Service name
HTTP status codes
Browser type
Event severity
Pertinent information from the request type that can be used to help diagnose problems
The following code shows examples of the structured logging objects:
{
"CorrelationId": "715eec8f-fefc-45e2-a352-95aa389ddb8f"
"Environment": "Live",
"StatusCode: 500,
"Severity: "Error",
"Application": "Contso Web Shop",
"Service": "PaymentsService",
"EventTimeUTC:" "2020-04-27T13:19Z",
"BrowserType": "Chromium",
"Data":{
"Runtime":"Net Core",
"Message": "System.NullReferenceException: Object reference not set to an instance of an object.",
"Method": "PaymentProcesser"
}
}

Incorporating the preceding changes into a distributed application allows team members to retrieve logs from the
complete lifecycle of the request by using the correlation ID. Structured logging makes it easier to search through
logs when issues occur, as well as allowing automated alerting.
DevOps Checklist
11/2/2020 • 14 minutes to read • Edit Online

DevOps is the integration of development, quality assurance, and IT operations into a unified culture and set of
processes for delivering software. Use this checklist as a starting point to assess your DevOps culture and process.

Culture
Ensure business alignment across organizations and teams. Conflicts over resources, purpose, goals, and
priorities within an organization can be a risk to successful operations. Ensure that the business, development, and
operations teams are all aligned.
Ensure the entire team understands the software lifecycle. Your team needs to understand the overall
lifecycle of the application, and which part of the lifecycle the application is currently in. This helps all team
members know what they should be doing now, and what they should be planning and preparing for in the future.
Reduce cycle time. Aim to minimize the time it takes to move from ideas to usable developed software. Limit the
size and scope of individual releases to keep the test burden low. Automate the build, test, configuration, and
deployment processes whenever possible. Clear any obstacles to communication among developers, and between
developers and operations.
Review and improve processes. Your processes and procedures, both automated and manual, are never final.
Set up regular reviews of current workflows, procedures, and documentation, with a goal of continual
improvement.
Do proactive planning. Proactively plan for failure. Have processes in place to quickly identify issues when they
occur, escalate to the correct team members to fix, and confirm resolution.
Learn from failures. Failures are inevitable, but it's important to learn from failures to avoid repeating them. If
an operational failure occurs, triage the issue, document the cause and solution, and share any lessons that were
learned. Whenever possible, update your build processes to automatically detect that kind of failure in the future.
Optimize for speed and collect data. Every planned improvement is a hypothesis. Work in the smallest
increments possible. Treat new ideas as experiments. Instrument the experiments so that you can collect
production data to assess their effectiveness. Be prepared to fail fast if the hypothesis is wrong.
Allow time for learning. Both failures and successes provide good opportunities for learning. Before moving on
to new projects, allow enough time to gather the important lessons, and make sure those lessons are absorbed by
your team. Also give the team the time to build skills, experiment, and learn about new tools and techniques.
Document operations. Document all tools, processes, and automated tasks with the same level of quality as
your product code. Document the current design and architecture of any systems you support, along with
recovery processes and other maintenance procedures. Focus on the steps you actually perform, not theoretically
optimal processes. Regularly review and update the documentation. For code, make sure that meaningful
comments are included, especially in public APIs, and use tools to automatically generate code documentation
whenever possible.
Share knowledge. Documentation is only useful if people know that it exists and can find it. Ensure the
documentation is organized and easily discoverable. Be creative: Use brown bags (informal presentations), videos,
or newsletters to share knowledge.

Development
Provide developers with production-like environments. If development and test environments don't match
the production environment, it is hard to test and diagnose problems. Therefore, keep development and test
environments as close to the production environment as possible. Make sure that test data is consistent with the
data used in production, even if it's sample data and not real production data (for privacy or compliance reasons).
Plan to generate and anonymize sample test data.
Ensure that all authorized team members can provision infrastructure and deploy the application.
Setting up production-like resources and deploying the application should not involve complicated manual tasks
or detailed technical knowledge of the system. Anyone with the right permissions should be able to create or
deploy production-like resources without going to the operations team.

This recommendation doesn't imply that anyone can push live updates to the production deployment. It's
about reducing friction for the development and QA teams to create production-like environments.

Instrument the application for insight. To understand the health of your application, you need to know how
it's performing and whether it's experiencing any errors or problems. Always include instrumentation as a design
requirement, and build the instrumentation into the application from the start. Instrumentation must include event
logging for root cause analysis, but also telemetry and metrics to monitor the overall health and usage of the
application.
Track your technical debt. In many projects, release schedules can get prioritized over code quality to one
degree or another. Always keep track when this occurs. Document any shortcuts or other suboptimal
implementations, and schedule time in the future to revisit these issues.
Consider pushing updates directly to production. To reduce the overall release cycle time, consider pushing
properly tested code commits directly to production. Use feature toggles to control which features are enabled.
This allows you to move from development to release quickly, using the toggles to enable or disable features.
Toggles are also useful when performing tests such as canary releases, where a particular feature is deployed to a
subset of the production environment.

Testing
Automate testing. Manually testing software is tedious and susceptible to error. Automate common testing tasks
and integrate the tests into your build processes. Automated testing ensures consistent test coverage and
reproducibility. Integrated UI tests should also be performed by an automated tool. Azure offers development and
test resources that can help you configure and execute testing. For more information, see Development and test.
Test for failures. If a system can't connect to a service, how does it respond? Can it recover once the service is
available again? Make fault injection testing a standard part of review on test and staging environments. When
your test process and practices are mature, consider running these tests in production.
Test in production. The release process doesn't end with deployment to production. Have tests in place to ensure
that deployed code works as expected. For deployments that are infrequently updated, schedule production testing
as a regular part of maintenance.
Automate performance testing to identify performance issues early. The impact of a serious performance
issue can be as severe as a bug in the code. While automated functional tests can prevent application bugs, they
might not detect performance problems. Define acceptable performance goals for metrics like latency, load times,
and resource usage. Include automated performance tests in your release pipeline, to make sure the application
meets those goals.
Perform capacity testing. An application might work fine under test conditions, and then have problems in
production due to scale or resource limitations. Always define the maximum expected capacity and usage limits.
Test to make sure the application can handle those limits, but also test what happens when those limits are
exceeded. Capacity testing should be performed at regular intervals.
After the initial release, you should run performance and capacity tests whenever updates are made to production
code. Use historical data to fine-tune tests and to determine what types of tests need to be performed.
Perform automated security penetration testing. Ensuring your application is secure is as important as
testing any other functionality. Make automated penetration testing a standard part of the build and deployment
process. Schedule regular security tests and vulnerability scanning on deployed applications, monitoring for open
ports, endpoints, and attacks. Automated testing does not remove the need for in-depth security reviews at regular
intervals.
Perform automated business continuity testing. Develop tests for large-scale business continuity, including
backup recovery and failover. Set up automated processes to perform these tests regularly.

Release
Automate deployments. Automate deploying the application to test, staging, and production environments.
Automation enables faster and more reliable deployments, and ensures consistent deployments to any supported
environment. It removes the risk of human error caused by manual deployments. It also makes it easy to schedule
releases for convenient times, to minimize any effects of potential downtime. Have systems in place to detect any
problems during rollout, and have an automated way to roll forward fixes or roll back changes.
Use continuous integration. Continuous integration (CI) is the practice of merging all developer code into a
central codebase on a regular schedule, and then automatically performing standard build and test processes. CI
ensures that an entire team can work on a codebase at the same time without having conflicts. It also ensures that
code defects are found as early as possible. Preferably, the CI process should run every time that code is
committed or checked in. At the very least, it should run once per day.

Consider adopting a trunk based development model. In this model, developers commit to a single branch (the
trunk). There is a requirement that commits never break the build. This model facilitates CI, because all feature
work is done in the trunk, and any merge conflicts are resolved when the commit happens.

Consider using continuous deliver y. Continuous delivery (CD) is the practice of ensuring that code is always
ready to deploy, by automatically building, testing, and deploying code to production-like environments. Adding
continuous delivery to create a full CI/CD pipeline will help you detect code defects as soon as possible, and
ensures that properly tested updates can be released in a very short time.

Continuous deployment is an additional process that automatically takes any updates that have passed
through the CI/CD pipeline and deploys them into production. Continuous deployment requires robust
automatic testing and advanced process planning, and may not be appropriate for all teams.

Make small incremental changes. Large code changes have a greater potential to introduce bugs. Whenever
possible, keep changes small. This limits the potential effects of each change, and makes it easier to understand
and debug any issues.
Control exposure to changes. Make sure you're in control of when updates are visible to your end users.
Consider using feature toggles to control when features are enabled for end users.
Implement release management strategies to reduce deployment risk . Deploying an application update
to production always entails some risk. To minimize this risk, use strategies such as canary releases or blue-green
deployments to deploy updates to a subset of users. Confirm the update works as expected, and then roll the
update out to the rest of the system.
Document all changes. Minor updates and configuration changes can be a source of confusion and versioning
conflict. Always keep a clear record of any changes, no matter how small. Log everything that changes, including
patches applied, policy changes, and configuration changes. (Don't include sensitive data in these logs. For
example, log that a credential was updated, and who made the change, but don't record the updated credentials.)
The record of the changes should be visible to the entire team.
Consider making infrastructure immutable. Immutable infrastructure is the principle that you shouldn't
modify infrastructure after it's deployed to production. Otherwise, you can get into a state where ad hoc changes
have been applied, making it hard to know exactly what changed. Immutable infrastructure works by replacing
entire servers as part of any new deployment. This allows the code and the hosting environment to be tested and
deployed as a block. Once deployed, infrastructure components aren't modified until the next build and deploy
cycle.

Monitoring
Make systems obser vable. The operations team should always have clear visibility into the health and status of
a system or service. Set up external health endpoints to monitor status, and ensure that applications are coded to
instrument the operations metrics. Use a common and consistent schema that helps you correlate events across
systems. Azure Diagnostics and Application Insights are the standard method of tracking the health and status of
Azure resources. Microsoft Operation Management Suite also provides centralized monitoring and management
for cloud or hybrid solutions.
Aggregate and correlate logs and metrics . A properly instrumented telemetry system will provide a large
amount of raw performance data and event logs. Make sure that telemetry and log data is processed and
correlated in a short period of time, so that operations staff always have an up-to-date picture of system health.
Organize and display data in ways that give a cohesive view of any issues, so that whenever possible it's clear
when events are related to one another.

Consult your corporate retention policy for requirements on how data is processed and how long it should be
stored.

Implement automated aler ts and notifications. Set up monitoring tools like Azure Monitor to detect patterns
or conditions that indicate potential or current issues, and send alerts to the team members who can address the
issues. Tune the alerts to avoid false positives.
Monitor assets and resources for expirations. Some resources and assets, such as certificates, expire after a
given amount of time. Make sure to track which assets expire, when they expire, and what services or features
depend on them. Use automated processes to monitor these assets. Notify the operations team before an asset
expires, and escalate if expiration threatens to disrupt the application.

Management
Automate operations tasks. Manually handling repetitive operations processes is error-prone. Automate these
tasks whenever possible to ensure consistent execution and quality. Code that implements the automation should
be versioned in source control. As with any other code, automation tools must be tested.
Take an infrastructure-as-code approach to provisioning. Minimize the amount of manual configuration
needed to provision resources. Instead, use scripts and Azure Resource Manager templates. Keep the scripts and
templates in source control, like any other code you maintain.
Consider using containers. Containers provide a standard package-based interface for deploying applications.
Using containers, an application is deployed using self-contained packages that include any software,
dependencies, and files needed to run the application, which greatly simplifies the deployment process.
Containers also create an abstraction layer between the application and the underlying operating system, which
provides consistency across environments. This abstraction can also isolate a container from other processes or
applications running on a host.
Implement resiliency and self-healing. Resiliency is the ability of an application to recover from failures.
Strategies for resiliency include retrying transient failures, and failing over to a secondary instance or even another
region. For more information, see Designing reliable Azure applications . Instrument your applications so that
issues are reported immediately and you can manage outages or other system failures.
Have an operations manual. An operations manual or runbook documents the procedures and management
information needed for operations staff to maintain a system. Also document any operations scenarios and
mitigation plans that might come into play during a failure or other disruption to your service. Create this
documentation during the development process, and keep it up to date afterwards. This is a living document, and
should be reviewed, tested, and improved regularly.
Shared documentation is critical. Encourage team members to contribute and share knowledge. The entire team
should have access to documents. Make it easy for anyone on the team to help keep documents updated.
Document on-call procedures. Make sure on-call duties, schedules, and procedures are documented and
shared to all team members. Keep this information up-to-date at all times.
Document escalation procedures for third-par ty dependencies. If your application depends on external
third-party services that you don't directly control, you must have a plan to deal with outages. Create
documentation for your planned mitigation processes. Include support contacts and escalation paths.
Use configuration management. Configuration changes should be planned, visible to operations, and
recorded. This could take the form of a configuration management database, or a configuration-as-code approach.
Configuration should be audited regularly to ensure that what's expected is actually in place.
Get an Azure suppor t plan and understand the process. Azure offers a number of support plans. Determine
the right plan for your needs, and make sure the entire team knows how to use it. Team members should
understand the details of the plan, how the support process works, and how to open a support ticket with Azure. If
you are anticipating a high-scale event, Azure support can assist you with increasing your service limits. For more
information, see the Azure Support FAQs.
Follow least-privilege principles when granting access to resources. Carefully manage access to
resources. Access should be denied by default, unless a user is explicitly given access to a resource. Only grant a
user access to what they need to complete their tasks. Track user permissions and perform regular security audits.
Use role-based access control. Assigning user accounts and access to resources should not be a manual
process. Use role-based access control (RBAC) grant access based on Azure Active Directory identities and groups.
Use a bug tracking system to track issues. Without a good way to track issues, it's easy to miss items,
duplicate work, or introduce additional problems. Don't rely on informal person-to-person communication to track
the status of bugs. Use a bug tracking tool to record details about problems, assign resources to address them, and
provide an audit trail of progress and status.
Manage all resources in a change management system. All aspects of your DevOps process should be
included in a management and versioning system, so that changes can be easily tracked and audited. This includes
code, infrastructure, configuration, documentation, and scripts. Treat all these types of resources as code
throughout the test/build/review process.
Use checklists. Create operations checklists to ensure processes are followed. It's common to miss something in
a large manual, and following a checklist can force attention to details that might otherwise be overlooked.
Maintain the checklists, and continually look for ways to automate tasks and streamline processes.
For more about DevOps, see What is DevOps? on the Visual Studio site.
Extend Azure Resource Manager template
functionality
11/2/2020 • 2 minutes to read • Edit Online

In 2016, the Microsoft patterns & practices team created a set of Azure Resource Manager template building blocks
with the goal of simplifying resource deployment. Each building block contains a set of prebuilt templates that
deploy sets of resources specified by separate parameter files.
The building block templates are designed to be combined together to create larger and more complex
deployments. For example, deploying a virtual machine in Azure requires a virtual network, storage accounts, and
other resources. The virtual network building block template deploys a virtual network and subnets. The virtual
machine building block template deploys storage accounts, network interfaces, and the actual VMs. You can then
create a script or template to call both building block templates with their corresponding parameter files to deploy
a complete architecture with one operation.
While developing the building block templates, p&p designed several concepts to extend Azure Resource Manager
template functionality. In this series, we will describe several of these concepts so you can use them in your own
templates.

NOTE
These articles assume you have an advanced understanding of Azure Resource Manager templates.
Update a resource in an Azure Resource Manager
template
11/2/2020 • 3 minutes to read • Edit Online

There are some scenarios in which you need to update a resource during a deployment. You might encounter this
scenario when you cannot specify all the properties for a resource until other, dependent resources are created. For
example, if you create a backend pool for a load balancer, you might update the network interfaces (NICs) on your
virtual machines (VMs) to include them in the backend pool. And while Resource Manager supports updating
resources during deployment, you must design your template correctly to avoid errors and to ensure the
deployment is handled as an update.
First, you must reference the resource once in the template to create it and then reference the resource by the same
name to update it later. However, if two resources have the same name in a template, Resource Manager throws an
exception. To avoid this error, specify the updated resource in a second template that's either linked or included as a
subtemplate using the Microsoft.Resources/deployments resource type.
Second, you must either specify the name of the existing property to change or a new name for a property to add
in the nested template. You must also specify the original properties and their original values. If you fail to provide
the original properties and values, Resource Manager assumes you want to create a new resource and deletes the
original resource.

Example template
Let's look at an example template that demonstrates this. Our template deploys a virtual network named
firstVNet that has one subnet named firstSubnet . It then deploys a virtual network interface (NIC) named nic1
and associates it with our subnet. Then, a deployment resource named updateVNet includes a nested template that
updates our firstVNet resource by adding a second subnet named secondSubnet .

{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {},
"resources": [
{
"apiVersion": "2016-03-30",
"name": "firstVNet",
"location":"[resourceGroup().location]",
"type": "Microsoft.Network/virtualNetworks",
"properties": {
"addressSpace":{"addressPrefixes": [
"10.0.0.0/22"
]},
"subnets":[
{
"name":"firstSubnet",
"properties":{
"addressPrefix":"10.0.0.0/24"
}
}
]
}
},
{
"apiVersion": "2015-06-15",
"type":"Microsoft.Network/networkInterfaces",
"name":"nic1",
"location":"[resourceGroup().location]",
"dependsOn": [
"firstVNet"
],
"properties": {
"ipConfigurations":[
{
"name":"ipconfig1",
"properties": {
"privateIPAllocationMethod":"Dynamic",
"subnet": {
"id": "
[concat(resourceId('Microsoft.Network/virtualNetworks','firstVNet'),'/subnets/firstSubnet')]"
}
}
}
]
}
},
{
"apiVersion": "2015-01-01",
"type": "Microsoft.Resources/deployments",
"name": "updateVNet",
"dependsOn": [
"nic1"
],
"properties": {
"mode": "Incremental",
"parameters": {},
"template": {
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {},
"variables": {},
"resources": [
{
"apiVersion": "2016-03-30",
"name": "firstVNet",
"location":"[resourceGroup().location]",
"type": "Microsoft.Network/virtualNetworks",
"properties": {
"addressSpace": "[reference('firstVNet').addressSpace]",
"subnets":[
{
"name":"[reference('firstVNet').subnets[0].name]",
"properties":{
"addressPrefix":"
[reference('firstVNet').subnets[0].properties.addressPrefix]"
}
},
{
"name":"secondSubnet",
"properties":{
"addressPrefix":"10.0.1.0/24"
}
}
]
}
}
],
"outputs": {}
}
}
}
],
"outputs": {}
}
Let's take a look at the resource object for our firstVNet resource first. Notice that we specify again the settings
for our firstVNet in a nested template—this is because Resource Manager doesn't allow the same deployment
name within the same template and nested templates are considered to be a different template. By again specifying
our values for our firstSubnet resource, we are telling Resource Manager to update the existing resource instead
of deleting it and redeploying it. Finally, our new settings for secondSubnet are picked up during this update.

Try the template


An example template is available on GitHub. To deploy the template, run the following Azure CLI commands:

az group create --location <location> --name <resource-group-name>


az deployment group create -g <resource-group-name> \
--template-uri https://raw.githubusercontent.com/mspnp/template-examples/master/example1-update/deploy.json

Once deployment has finished, open the resource group you specified in the portal. You see a virtual network
named firstVNet and a NIC named nic1 . Click firstVNet , then click subnets . You see the firstSubnet that was
originally created, and you see the secondSubnet that was added in the updateVNet resource.

Then, go back to the resource group and click nic1 then click IP configurations . In the IP configurations
section, the subnet is set to firstSubnet (10.0.0.0/24) .

The original firstVNet has been updated instead of re-created. If firstVNet had been re-created, nic1 would not
be associated with firstVNet .

Next steps
Learn how deploy a resource based on a condition, such as whether a parameter value is present. See
Conditionally deploy a resource in an Azure Resource Manager template.
Conditionally deploy a resource in an Azure Resource
Manager template
11/2/2020 • 3 minutes to read • Edit Online

There are some scenarios in which you need to design your template to deploy a resource based on a condition,
such as whether or not a parameter value is present. For example, your template may deploy a virtual network and
include parameters to specify other virtual networks for peering. If you've not specified any parameter values for
peering, you don't want Resource Manager to deploy the peering resource.
To accomplish this, use the condition element in the resource to test the length of your parameter array. If the
length is zero, return false to prevent deployment, but for all values greater than zero return true to allow
deployment.

Example template
Let's look at an example template that demonstrates this. Our template uses the condition element to control
deployment of the Microsoft.Network/virtualNetworks/virtualNetworkPeerings resource. This resource creates a
peering between two Azure Virtual Networks in the same region.
Let's take a look at each section of the template.
The parameters element defines a single parameter named virtualNetworkPeerings :

{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"virtualNetworkPeerings": {
"type": "array",
"defaultValue": []
}
},

Our virtualNetworkPeerings parameter is an array and has the following schema:

"virtualNetworkPeerings": [
{
"name": "firstVNet/peering1",
"properties": {
"remoteVirtualNetwork": {
"id": "[resourceId('Microsoft.Network/virtualNetworks','secondVNet')]"
},
"allowForwardedTraffic": true,
"allowGatewayTransit": true,
"useRemoteGateways": false
}
}
]

The properties in our parameter specify the settings related to peering virtual networks. We'll provide the values
for these properties when we specify the Microsoft.Network/virtualNetworks/virtualNetworkPeerings resource in
the resources section:
"resources": [
{
"type": "Microsoft.Resources/deployments",
"apiVersion": "2017-05-10",
"name": "[concat('vnp-', copyIndex())]",
"condition": "[greater(length(parameters('virtualNetworkPeerings')), 0)]",
"dependsOn": [
"firstVNet", "secondVNet"
],
"copy": {
"name": "iterator",
"count": "[length(variables('peerings'))]",
"mode": "serial"
},
"properties": {
"mode": "Incremental",
"template": {
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
},
"variables": {
},
"resources": [
{
"type": "Microsoft.Network/virtualNetworks/virtualNetworkPeerings",
"apiVersion": "2016-06-01",
"location": "[resourceGroup().location]",
"name": "[variables('peerings')[copyIndex()].name]",
"properties": "[variables('peerings')[copyIndex()].properties]"
}
],
"outputs": {
}
}
}
}
]

There are a couple of things going on in this part of our template. First, the actual resource being deployed is an
inline template of type Microsoft.Resources/deployments that includes its own template that actually deploys the
Microsoft.Network/virtualNetworks/virtualNetworkPeerings .

Our name for the inline template is made unique by concatenating the current iteration of the copyIndex() with
the prefix vnp- .
The condition element specifies that our resource should be processed when the greater() function evaluates to
true . Here, we're testing if the virtualNetworkPeerings parameter array is greater() than zero. If it is, it evaluates
to true and the condition is satisfied. Otherwise, it's false .
Next, we specify our copy loop. It's a serial loop that means the loop is done in sequence, with each resource
waiting until the last resource has been deployed. The count property specifies the number of times the loop
iterates. Here, normally we'd set it to the length of the virtualNetworkPeerings array because it contains the
parameter objects specifying the resource we want to deploy. However, if we do that, validation will fail if the array
is empty because Resource Manager notices that we are attempting to access properties that do not exist. We can
work around this, however. Let's take a look at the variables we'll need:
"variables": {
"workaround": {
"true": "[parameters('virtualNetworkPeerings')]",
"false": [{
"name": "workaround",
"properties": {}
}]
},
"peerings": "[variables('workaround')[string(greater(length(parameters('virtualNetworkPeerings')), 0))]]"
},

Our workaround variable includes two properties, one named true and one named false . The true property
evaluates to the value of the virtualNetworkPeerings parameter array. The false property evaluates to an empty
object including the named properties that Resource Manager expects to see—note that false is actually an array,
just as our virtualNetworkPeerings parameter is, which will satisfy validation.
Our peerings variable uses our workaround variable by once again testing if the length of the
virtualNetworkPeerings parameter array is greater than zero. If it is, the string evaluates to true and the
workaround variable evaluates to the virtualNetworkPeerings parameter array. Otherwise, it evaluates to false
and the workaround variable evaluates to our empty object in the first element of the array.
Now that we've worked around the validation issue, we can simply specify the deployment of the
Microsoft.Network/virtualNetworks/virtualNetworkPeerings resource in the nested template, passing the name and
properties from our virtualNetworkPeerings parameter array. You can see this in the template element nested in
the properties element of our resource.

Try the template


An example template is available on GitHub. To deploy the template, run the following Azure CLI commands:

az group create --location <location> --name <resource-group-name>


az deployment group create -g <resource-group-name> \
--template-uri https://raw.githubusercontent.com/mspnp/template-examples/master/example2-
conditional/deploy.json

Next steps
Use objects instead of scalar values as template parameters. See Use an object as a parameter in an Azure
Resource Manager template
Use an object as a parameter in an Azure Resource
Manager template
11/2/2020 • 5 minutes to read • Edit Online

When you author Azure Resource Manager templates, you can either specify resource property values directly in
the template or define a parameter and provide values during deployment. It's fine to use a parameter for each
property value for small deployments, but there is a limit of 255 parameters per deployment. Once you get to
larger and more complex deployments you may run out of parameters.
One way to solve this problem is to use an object as a parameter instead of a value. To do this, define the
parameter in your template and specify a JSON object instead of a single value during deployment. Then, reference
the subproperties of the parameter using the parameter() function and dot operator in your template.
Let's take a look at an example that deploys a virtual network resource. First, let's specify a VNetSettings
parameter in our template and set the type to object :

...
"parameters": {
"VNetSettings":{"type":"object"}
},

Next, let's provide values for the VNetSettings object:

NOTE
To learn how to provide parameter values during deployment, see the parameters section of understand the structure and
syntax of Azure Resource Manager templates.

"parameters":{
"VNetSettings":{
"value":{
"name":"VNet1",
"addressPrefixes": [
{
"name": "firstPrefix",
"addressPrefix": "10.0.0.0/22"
}
],
"subnets":[
{
"name": "firstSubnet",
"addressPrefix": "10.0.0.0/24"
},
{
"name":"secondSubnet",
"addressPrefix":"10.0.1.0/24"
}
]
}
}
}

As you can see, our single parameter actually specifies three subproperties: name , addressPrefixes , and subnets .
Each of these subproperties either specifies a value or other subproperties. The result is that our single parameter
specifies all the values necessary to deploy our virtual network.
Now let's have a look at the rest of our template to see how the VNetSettings object is used:

...
"resources": [
{
"apiVersion": "2015-06-15",
"type": "Microsoft.Network/virtualNetworks",
"name": "[parameters('VNetSettings').name]",
"location":"[resourceGroup().location]",
"properties": {
"addressSpace":{
"addressPrefixes": [
"[parameters('VNetSettings').addressPrefixes[0].addressPrefix]"
]
},
"subnets":[
{
"name":"[parameters('VNetSettings').subnets[0].name]",
"properties": {
"addressPrefix": "[parameters('VNetSettings').subnets[0].addressPrefix]"
}
},
{
"name":"[parameters('VNetSettings').subnets[1].name]",
"properties": {
"addressPrefix": "[parameters('VNetSettings').subnets[1].addressPrefix]"
}
}
]
}
}
]

The values of our VNetSettings object are applied to the properties required by our virtual network resource
using the parameters() function with both the [] array indexer and the dot operator. This approach works if you
just want to statically apply the values of the parameter object to the resource. However, if you want to dynamically
assign an array of property values during deployment you can use a copy loop. To use a copy loop, you provide a
JSON array of resource property values and the copy loop dynamically applies the values to the resource's
properties.
There is one issue to be aware of if you use the dynamic approach. To demonstrate the issue, let's take a look at a
typical array of property values. In this example the values for our properties are stored in a variable. Notice we
have two arrays here—one named firstProperty and one named secondProperty .
"variables": {
"firstProperty": [
{
"name": "A",
"type": "typeA"
},
{
"name": "B",
"type": "typeB"
},
{
"name": "C",
"type": "typeC"
}
],
"secondProperty": [
"one","two", "three"
]
}

Now let's take a look at the way we access the properties in the variable using a copy loop.

{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
...
"copy": {
"name": "copyLoop1",
"count": "[length(variables('firstProperty'))]"
},
...
"properties": {
"name": { "value": "[variables('firstProperty')[copyIndex()].name]" },
"type": { "value": "[variables('firstProperty')[copyIndex()].type]" },
"number": { "value": "[variables('secondProperty')[copyIndex()]]" }
}
}

The copyIndex() function returns the current iteration of the copy loop, and we use that as an index into each of
the two arrays simultaneously.
This works fine when the two arrays are the same length. The issue arises if you've made a mistake and the two
arrays are different lengths—in this case your template will fail validation during deployment. You can avoid this
issue by including all your properties in a single object, because it is much easier to see when a value is missing.
For example, let's take a look another parameter object in which each element of the propertyObject array is the
union of the firstProperty and secondProperty arrays from earlier.
"variables": {
"propertyObject": [
{
"name": "A",
"type": "typeA",
"number": "one"
},
{
"name": "B",
"type": "typeB",
"number": "two"
},
{
"name": "C",
"type": "typeC"
}
]
}

Notice the third element in the array? It's missing the number property, but it's much easier to notice that you've
missed it when you're authoring the parameter values this way.

Using a property object in a copy loop


This approach becomes even more useful when combined with the [serial copy loop][azure-resource-manager-
create-multiple], particularly for deploying child resources.
To demonstrate this, let's look at a template that deploys a network security group (NSG) with two security rules.
First, let's take a look at our parameters. When we look at our template we'll see that we've defined one parameter
named networkSecurityGroupsSettings that includes an array named securityRules . This array contains two JSON
objects that specify a number of settings for a security rule.
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters":{
"networkSecurityGroupsSettings": {
"value": {
"securityRules": [
{
"name": "RDPAllow",
"description": "allow RDP connections",
"direction": "Inbound",
"priority": 100,
"sourceAddressPrefix": "*",
"destinationAddressPrefix": "10.0.0.0/24",
"sourcePortRange": "*",
"destinationPortRange": "3389",
"access": "Allow",
"protocol": "Tcp"
},
{
"name": "HTTPAllow",
"description": "allow HTTP connections",
"direction": "Inbound",
"priority": 200,
"sourceAddressPrefix": "*",
"destinationAddressPrefix": "10.0.1.0/24",
"sourcePortRange": "*",
"destinationPortRange": "80",
"access": "Allow",
"protocol": "Tcp"
}
]
}
}
}
}

Now let's take a look at our template. Our first resource named NSG1 deploys the NSG. Our second resource
named loop-0 performs two functions: first, it dependsOn the NSG so its deployment doesn't begin until NSG1 is
completed, and it is the first iteration of the sequential loop. Our third resource is a nested template that deploys
our security rules using an object for its parameter values as in the last example.

{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"networkSecurityGroupsSettings": {"type":"object"}
},
"variables": {},
"resources": [
{
"apiVersion": "2015-06-15",
"type": "Microsoft.Network/networkSecurityGroups",
"name": "NSG1",
"location":"[resourceGroup().location]",
"properties": {
"securityRules":[]
}
},
{
"apiVersion": "2015-01-01",
"type": "Microsoft.Resources/deployments",
"name": "loop-0",
"dependsOn": [
"NSG1"
],
"properties": {
"mode":"Incremental",
"parameters":{},
"template": {
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {},
"variables": {},
"resources": [],
"outputs": {}
}
}
},
{
"apiVersion": "2015-01-01",
"type": "Microsoft.Resources/deployments",
"name": "[concat('loop-', copyIndex(1))]",
"dependsOn": [
"[concat('loop-', copyIndex())]"
],
"copy": {
"name": "iterator",
"count": "[length(parameters('networkSecurityGroupsSettings').securityRules)]"
},
"properties": {
"mode": "Incremental",
"template": {
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {},
"variables": {},
"resources": [
{
"name": "[concat('NSG1/' ,
parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].name)]",
"type": "Microsoft.Network/networkSecurityGroups/securityRules",
"apiVersion": "2016-09-01",
"location":"[resourceGroup().location]",
"properties":{
"description": "
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].description]",
"priority":"
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].priority]",
"protocol":"
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].protocol]",
"sourcePortRange": "
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].sourcePortRange]",
"destinationPortRange": "
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].destinationPortRange]",
"sourceAddressPrefix": "
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].sourceAddressPrefix]",
"destinationAddressPrefix": "
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].destinationAddressPrefix]",
"access":"
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].access]",
"direction":"
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].direction]"
}
}
],
"outputs": {}
}
}
}
],
"outputs": {}
}
Let's take a closer look at how we specify our property values in the securityRules child resource. All of our
properties are referenced using the parameter() function, and then we use the dot operator to reference our
securityRules array, indexed by the current value of the iteration. Finally, we use another dot operator to reference
the name of the object.

Try the template


An example template is available on GitHub. To deploy the template, clone the repo and run the following Azure
CLI commands:

git clone https://github.com/mspnp/template-examples.git


cd template-examples/example3-object-param
az group create --location <location> --name <resource-group-name>
az deployment group create -g <resource-group-name> \
--template-uri https://raw.githubusercontent.com/mspnp/template-examples/master/example3-object-
param/deploy.json \
--parameters deploy.parameters.json

Next steps
Learn how to create a template that iterates through an object array and transforms it into a JSON schema. See
Implement a property transformer and collector in an Azure Resource Manager template
Implement a property transformer and collector in an
Azure Resource Manager template
11/2/2020 • 6 minutes to read • Edit Online

In use an object as a parameter in an Azure Resource Manager template, you learned how to store resource
property values in an object and apply them to a resource during deployment. While this is a very useful way to
manage your parameters, it still requires you to map the object's properties to resource properties each time you
use it in your template.
To work around this, you can implement a property transform and collector template that iterates your object array
and transforms it into the JSON schema expected by the resource.

IMPORTANT
This approach requires that you have a deep understanding of Resource Manager templates and functions.

Let's take a look at how we can implement a property collector and transformer with an example that deploys a
network security group. The diagram below shows the relationship between our templates and our resources
within those templates:

Our calling template includes two resources:


A template link that invokes our collector template .
The network security group resource to deploy.
Our collector template includes two resources:
An anchor resource.
A template link that invokes the transform template in a copy loop.
Our transform template includes a single resource: an empty template with a variable that transforms our
source JSON to the JSON schema expected by our network security group resource in the main template .

Parameter object
We'll be using our securityRules parameter object from objects as parameters. Our transform template will
transform each object in the securityRules array into the JSON schema expected by the network security group
resource in our calling template .
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"networkSecurityGroupsSettings": {
"value": {
"securityRules": [
{
"name": "RDPAllow",
"description": "allow RDP connections",
"direction": "Inbound",
"priority": 100,
"sourceAddressPrefix": "*",
"destinationAddressPrefix": "10.0.0.0/24",
"sourcePortRange": "*",
"destinationPortRange": "3389",
"access": "Allow",
"protocol": "Tcp"
},
{
"name": "HTTPAllow",
"description": "allow HTTP connections",
"direction": "Inbound",
"priority": 200,
"sourceAddressPrefix": "*",
"destinationAddressPrefix": "10.0.1.0/24",
"sourcePortRange": "*",
"destinationPortRange": "80",
"access": "Allow",
"protocol": "Tcp"
}
]
}
}
}
}

Let's look at our transform template first.

Transform template
Our transform template includes two parameters that are passed from the collector template :
source is an object that receives one of the property value objects from the property array. In our example,
each object from the "securityRules" array will be passed in one at a time.
state is an array that receives the concatenated results of all the previous transforms. This is the collection of
transformed JSON.
Our parameters look like this:

{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"source": { "type": "object" },
"state": {
"type": "array",
"defaultValue": [ ]
}
},

Our template also defines a variable named instance . It performs the actual transform of our source object into
the required JSON schema:

"variables": {
"instance": [
{
"name": "[parameters('source').name]",
"properties":{
"description": "[parameters('source').description]",
"protocol": "[parameters('source').protocol]",
"sourcePortRange": "[parameters('source').sourcePortRange]",
"destinationPortRange": "[parameters('source').destinationPortRange]",
"sourceAddressPrefix": "[parameters('source').sourceAddressPrefix]",
"destinationAddressPrefix": "[parameters('source').destinationAddressPrefix]",
"access": "[parameters('source').access]",
"priority": "[parameters('source').priority]",
"direction": "[parameters('source').direction]"
}
}
]

},

Finally, the output of our template concatenates the collected transforms of our state parameter with the current
transform performed by our instance variable:

"resources": [],
"outputs": {
"collection": {
"type": "array",
"value": "[concat(parameters('state'), variables('instance'))]"
}

Next, let's take a look at our collector template to see how it passes in our parameter values.

Collector template
Our collector template includes three parameters:
source is our complete parameter object array. It's passed in by the calling template . This has the same name
as the source parameter in our transform template but there is one key difference that you may have
already noticed: this is the complete array, but we only pass one element of this array to the transform
template at a time.
transformTemplateUri is the URI of our transform template . We're defining it as a parameter here for
template reusability.
state is an initially empty array that we pass to our transform template . It stores the collection of
transformed parameter objects when the copy loop is complete.
Our parameters look like this:

"parameters": {
"source": { "type": "array" },
"transformTemplateUri": { "type": "string" },
"state": {
"type": "array",
"defaultValue": [ ]
}

Next, we define a variable named count . Its value is the length of the source parameter object array:
"variables": {
"count": "[length(parameters('source'))]"
},

As you might suspect, we use it for the number of iterations in our copy loop.
Now let's take a look at our resources. We define two resources:
loop-0 is the zero-based resource for our copy loop.
loop- is concatenated with the result of the copyIndex(1) function to generate a unique iteration-based name
for our resource, starting with 1 .

Our resources look like this:

"resources": [
{
"type": "Microsoft.Resources/deployments",
"apiVersion": "2015-01-01",
"name": "loop-0",
"properties": {
"mode": "Incremental",
"parameters": { },
"template": {
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": { },
"variables": { },
"resources": [ ],
"outputs": {
"collection": {
"type": "array",
"value": "[parameters('state')]"
}
}
}
}
},
{
"type": "Microsoft.Resources/deployments",
"apiVersion": "2015-01-01",
"name": "[concat('loop-', copyindex(1))]",
"copy": {
"name": "iterator",
"count": "[variables('count')]",
"mode": "serial"
},
"dependsOn": [
"loop-0"
],
"properties": {
"mode": "Incremental",
"templateLink": { "uri": "[parameters('transformTemplateUri')]" },
"parameters": {
"source": { "value": "[parameters('source')[copyindex()]]" },
"state": { "value": "[reference(concat('loop-', copyindex())).outputs.collection.value]" }
}
}
}
],

Let's take a closer look at the parameters we're passing to our transform template in the nested template. Recall
from earlier that our source parameter passes the current object in the source parameter object array. The
state parameter is where the collection happens, because it takes the output of the previous iteration of our copy
loop—notice that the reference() function uses the copyIndex() function with no parameter to reference the
name of our previous linked template object—and passes it to the current iteration.

Finally, the output of our template returns the output of the last iteration of our transform template :

"outputs": {
"result": {
"type": "array",
"value": "[reference(concat('loop-', variables('count'))).outputs.collection.value]"
}
}

It may seem counterintuitive to return the output of the last iteration of our transform template to our calling
template because it appeared we were storing it in our source parameter. However, remember that it's the last
iteration of our transform template that holds the complete array of transformed property objects, and that's
what we want to return.
Finally, let's take a look at how to call the collector template from our calling template .

Calling template
Our calling template defines a single parameter named networkSecurityGroupsSettings :

...
"parameters": {
"networkSecurityGroupsSettings": {
"type": "object"
}

Next, our template defines a single variable named collectorTemplateUri :

"variables": {
"collectorTemplateUri": "[uri(deployment().properties.templateLink.uri, 'collector.template.json')]"
}

As you would expect, this is the URI for the collector template that will be used by our linked template resource:

{
"apiVersion": "2015-01-01",
"name": "collector",
"type": "Microsoft.Resources/deployments",
"properties": {
"mode": "Incremental",
"templateLink": {
"uri": "[variables('collectorTemplateUri')]",
"contentVersion": "1.0.0.0"
},
"parameters": {
"source" : {"value": "[parameters('networkSecurityGroupsSettings').securityRules]"},
"transformTemplateUri": { "value": "[uri(deployment().properties.templateLink.uri,
'transform.json')]"}
}
}
}

We pass two parameters to the collector template :


source is our property object array. In our example, it's our networkSecurityGroupsSettings parameter.
transformTemplateUri is the variable we just defined with the URI of our collector template .
Finally, our Microsoft.Network/networkSecurityGroups resource directly assigns the output of the collector linked
template resource to its securityRules property:

{
"apiVersion": "2015-06-15",
"type": "Microsoft.Network/networkSecurityGroups",
"name": "networkSecurityGroup1",
"location": "[resourceGroup().location]",
"properties": {
"securityRules": "[reference('collector').outputs.result.value]"
}
}
],
"outputs": {
"instance":{
"type": "array",
"value": "[reference('collector').outputs.result.value]"
}

Try the template


An example template is available on GitHub. To deploy the template, clone the repo and run the following Azure CLI
commands:

git clone https://github.com/mspnp/template-examples.git


cd template-examples/example4-collector
az group create --location <location> --name <resource-group-name>
az deployment group create -g <resource-group-name> \
--template-uri https://raw.githubusercontent.com/mspnp/template-examples/master/example4-
collector/deploy.json \
--parameters deploy.parameters.json
Azure DevOps CI/CD pipelines for chatbots
11/2/2020 • 11 minutes to read • Edit Online

This article presents a DevOps approach to setting up a continuous integration and continuous deployment (CI/CD)
pipeline that deploys a chatbot app and its infrastructure as code.
DevOps is a common development strategy for building custom applications like bots. DevOps pipelines use
modern CI/CD processes to manage software builds, deployments, testing, and monitoring. Azure DevOps tools
can help you accelerate your software delivery and focus on your code, rather than the supporting infrastructure
and operations.
Infrastructure as code uses Azure Resource Manager (ARM) templates or open-source alternatives to define and
deploy an app's underlying infrastructure. You can co-locate the software and infrastructure code in the same
source control repository, and deploy both through your pipelines.

Architecture

In this example:
1. Developers check in the bot logic, test code, ARM infrastructure templates, and build pipeline code to a Git
repository.
2. Continuous integration triggers the Azure Pipelines build pipeline to build and test the bot solution.
3. Continuous deployment through an Azure Pipelines release pipeline triggers automated deployment of the app
and infrastructure to Azure App Service and Azure Bot Services.
4. Azure Bot Services runs the bot.
5. Azure Application Insights collects telemetry on app and bot performance.
Source control
Source control lets development teams track software code changes and inspect older code versions. With
infrastructure as code, you can define and deploy your underlying infrastructure using ARM templates or open-
source alternatives, and also maintain those templates in source control. You can store unit test and build pipeline
code in the same source control repository.
The example scenario in GitHub stores application, test, infrastructure, and build pipeline code in three main folders:
The application folder contains the bot logic and unit test code.
The arm-templates folder stores the infrastructure as code.
The pipelines folder contains the definition for the YAML build pipeline.
ARM templates
The example uses several linked ARM templates to deploy the solution. Capability templates, similar to units or
modules, implement specific pieces of functionality. An end-to-end solution or orchestrator template brings those
underlying capabilities together. There are a couple of benefits to this approach:
The underlying capability templates can be individually unit tested.
The underlying capability templates could be defined as standards for an organization, and reused in multiple
solutions.
In this example, the deployment.json orchestrator template deploys the following capability templates:
Azure App Service Plan
Azure App Service instance
Azure Bot Service
Azure Application Insights
If you want your chatbot to use Cognitive Services such as LUIS and QnA Maker, be sure to add them in the
template as well.
Bot development and testing
The example uses a simple C# chatbot designed to work in the Microsoft Teams channel. You can open the example
solution in Visual Studio.
To develop a new bot, create a C# Bot Framework project in Visual Studio. You can use the Echo Bot template, or
start an empty project. For a comprehensive tutorial on creating a Bot Framework project, see Create a bot with the
Bot Framework SDK for .NET. For information on configuring a bot to run on multiple channels, see Manage
channels in Bot Service. Adding a bot in Microsoft Teams requires a few additional steps and considerations. For
details, see Bots in Microsoft Teams.
You can test and debug your chatbot during development with the Bot Framework emulator, a desktop app that lets
you test and debug bots built with the BotBuilder SDK.
Unit testing tests specific parts of source code against a desired outcome. Unit testing chatbots uses the same
principles and best practices as testing other apps. For more information about unit testing, see Unit testing in .NET
and Unit testing best practices.
You can add a unit test project to a chatbot solution in Visual Studio, and choose between MSTest or XUnit for the
project. For a live view of the unit tests, configure Live Unit Testing.
Continuous integration and build
Azure Pipelines is the Azure DevOps service that implements build, test, and deployment pipelines for your apps.
You define the triggers and tasks for build pipelines in a YAML build definition file that you can store in source
control. You can also use the Classic visual interface to create a build pipeline by selecting Use the classic editor
on the pipeline creation page.
If you enable continuous integration in the triggers section of your build pipeline, new commits to the repository
automatically kick off the build. YAML build pipelines have a CI trigger on all branches by default. Once you check in
your code, the build pipeline runs tests to validate the code, and builds the appropriate parts of the software. The
number and types of tests and tasks you run depend on your wider build and release strategy.
Continuous deployment and release
An Azure Pipelines release pipeline automates your release process and enables continuous delivery.
In the release pipeline, you define a number of tasks to deploy your app and underlying infrastructure. You can
define many variables for the overall pipeline, so you don't have to input them for every task.
The release pipeline depends on the build pipeline to get the output for the bot logic, and it links to the Git repo to
get the ARM templates to deploy the infrastructure. For deployment, the ARM templates need to be accessible from
a public endpoint such as a Git repo or Azure Blob Storage account.
App Service Plan and App Service instance
An Azure App Service Plan is the underlying server farm used to deploy an Azure App Service instance. The
appservicePlan.json and appservice.json ARM templates create the App Service Plan and App Service instance for
the chatbot app.
Bot Services
The Azure Bot Service provides an endpoint for the bot to communicate through the chosen channels. The Bot
Service also provides a mechanism to associate the bot to a pricing tier, based upon the expected requests and
throughput. The deployed botservicePlan.json ARM template creates the Bot Service for the chatbot app.
To deploy to Azure, the bot needs an Application ID and password, and the ability to register channels to run in. The
example botservice.json template includes channel registration, but you also need to register the app in the Azure
portal. To register an app in the Azure portal, see App registration.
Application Insights
Azure Application Insights can collect telemetry from the app about what the bot logic is doing and how it is
performing. You can also connect Application Insights directly to the Bot Service. The deployed
applicationInsights.json ARM template creates Application Insights for the chatbot app. The template also outputs
the Application Insights Instrumentation Key and Resource ID. You use these to configure the Bot Service endpoint
and collect telemetry directly from Azure Bot Service.

Issues and considerations


Although ARM templates don't need to be compiled, you can validate their quality. For example, you could
do linting on the ARM templates. See the ARM template toolkit for more ARM template analysis and test
tools from the Azure Resource Manager team. Consider which pipeline, build or release, is most appropriate
for the tests, based on your development lifecycle.
The linked ARM templates for deploying the infrastructure need to be accessible from a public endpoint, like
a Git repository or an Azure Blob Storage account. If you upload the templates to a storage account, they
remain secure, as they are held in a private mode but can be accessed using some form of SAS token.
A common practice is to use Azure KeyVault instead of storing secrets in Azure DevOps. If the Service
Principal connection to your Azure Subscription has appropriate access policies to the Azure KeyVault, it can
download secrets from the KeyVault to use as variables in your pipeline, which avoids storing them in source
control. The name of the secret will be set with the associated value. For example, a secret in the KeyVault
called botMicrosoftAppPassword could be referenced by $(botMicrosoftAppPassword) in the release pipeline
definition.
You can set up Bot Analytics to gain additional insight into the performance of your bot. To set up Bot
Analytics, you need an API Key from Application Insights, but you can't create this key by using an ARM
template. You can create the key manually from your Application Insights resource in the Azure portal.

Deploy this scenario


You can build and deploy the example scenario from GitHub.
Prerequisites
This solution requires an existing Azure account, and an Azure DevOps organization and team project.
Create a free Azure account
Create an Azure DevOps Organization
Create a project in Azure DevOps
Bot development and registration
The example uses a simple C# chatbot designed to work in the Microsoft Teams channel. The chatbot replies to
anyone writing VSTS with the phrase You mentioned "VSTS" above, but I think you meant Azure DevOps .
You can open the example solution in Visual Studio.
To deploy to Azure, the bot needs an Application ID and password, and the ability to register channels to run in. The
example botservice.json ARM template includes channel registration, but you also need to register the app in the
Azure portal. To register an app in the Azure portal, see App registration. When you register the app, note the
Application ID and password, as you'll use them to set Azure Pipelines variables for deployment.
Bot testing
Test and debug your chatbot during development with the Bot Framework emulator, a desktop app that lets you
test and debug bots built with the BotBuilder SDK. Download the emulator from the GitHub releases page. Once
installed, start talking to your bot either by connecting to an endpoint or by opening a bot saved locally. The
emulator provides information in JSON format, which is easy to read and comprehend.
The example solution also includes a unit test project, and the example build pipeline runs the unit tests.
Build pipeline
To use the example YAML file to create a build pipeline:
1. In the left navigation of your Azure DevOps project, select Pipelines and then select New pipeline .
2. Under Where is your code , select GitHub , and then select your local clone of the example repository.
3. Under Configure your pipeline , select Existing Azure Pipelines YAML file and then select or type the
/cicdbots/pipelines/bot-logic.build.yml file from the repo. Select Continue .
4. Review the code, and then select Run .
Release pipeline
The release pipeline deploys the bot and infrastructure code. It can also create an Azure Storage account and copy
the ARM templates to it.
To create the release pipeline:
1. In the left navigation of your Azure DevOps project, select Pipelines > Releases , and then select New
pipeline .
2. Under Select a template , select Empty job . Change the stage name to something meaningful, like Dev .
3. Under Ar tifacts on the left, select Add , then under Add an ar tifact , select Build .
4. Select your build pipeline from the dropdown. Make a note of the Source Alias , as you will need it when
creating the release tasks. Select Add .
5. Select Ar tifacts > Add again, then under Add an ar tifact , select GitHub . You may have to authorize the
GitHub connection.
6. Under Source (repositor y) , select the example repository that contains the ARM templates, and then select
Add .
7. On the pipeline page, select the Variables tab. Select Add , and set the following name/value pairs:
NAME VA L UE

appInsightsName Name for the Application Insights resource for logging bot
telemetry

appSer viceName Name for the web app, which will become part of the end
website name: <appServiceName>.azurewebsites.net

appSer vicePlanName Name for the App Service Plan to contain the web app

botAppInsightsApiKey The API Key for Application Insights, which you create
manually and copy from your Application Insights resource
in the Azure portal

botMicrosoftAppId The Application ID that you got from the Azure portal
when you registered your app

botMicrosoftAppPassword The password that you set in the Azure portal when you
registered your app. Set the padlock on this variable to
hide the value.

botSer viceDisplayName Name for the bot as it appears in a channel

botSer viceName Name for the Azure Bot Service in the Azure portal

developerAppInsightsApplicationId The Application ID output when you created the


Application Insights resource

location Azure region in which to create the Azure resources

resourceGroupName Name for the resource group to create to contain the


Azure resources

storageAccountName Storage account for the linked ARM templates

8. On the pipeline page, select the Tasks tab. Select the + next to Agent job , and under Add tasks , select and
configure the following tasks. After selecting each task, select Add and then fill out the form, using variables
where appropriate.
Download pipeline ar tifacts to download the bot logic
Azure resource group deployment to deploy the bot infrastructure ARM templates
Azure App Ser vice deploy to deploy the bot to the App Service
You can also add Azure resource group deployment and Azure file copy tasks to create an Azure Storage
account and copy the ARM templates to the storage account. You can use the storageAccount.json ARM template in
the example repo to create the storage account.

Next steps
Here are some additional items to further enhance the scenario:
Deploy additional services to enhance your bot, including LUIS.
Deploy a back-end store for your bot to interact with, such as Azure Cosmos DB.
Automate the generation of the Application Insights API Key, and consider storing the key in an Azure KeyVault
that you can reference during deployment time.
Related resources
Design a CI/CD pipeline using Azure DevOps
What is source control?
Understand the structure and syntax of Azure Resource Manager templates
ARM template reference guide for Microsoft.Storage resource types
CI/CD for Azure VMs
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Azure is a world-class cloud for hosting virtual machines running Windows or Linux. Whether you use ASP.NET,
Java, Node.js, or PHP to develop applications, you'll need a continuous integration and continuous deployment
(CI/CD) pipeline to push changes to these virtual machines automatically.
Azure DevOps provides the CI/CD pipeline, starting with a Git repository for managing your application source
code and infrastructure code (ARM templates), a Build system for producing packages and other build artifacts, and
a Release Management system for setting up a pipeline to deploy your changes through dev, test, and production
environments. The pipeline uses ARM templates to provision or update your infrastructure as necessary in each
environment, and then deploys the updated build. You can also use Azure DevTest Labs to automatically tear down
test resources that are not in use.

Architecture

Download an SVG of this architecture.

Data Flow
1. Change application source code
2. Commit Application Code and Azure Resource Manager (ARM) Template
3. Continuous integration triggers application build and unit tests
4. Continuous deployment trigger orchestrates deployment of application artifacts with environment-specific
parameters
5. Deployment to QA environment
6. Deployment to staging environment
7. Deployment to production environment
8. Application Insights collects and analyses health, performance, and usage data
9. Review health, performance and usage information
10. Update backlog item

Components
Virtual Machines: Provision Windows and Linux virtual machines in seconds
Azure DevTest Labs: Quickly create environments using reusable templates and artifacts
Application Insights: Detect, triage, and diagnose issues in your web apps and services.
Azure DevOps: Build and deploy multi-platform apps to get the most from Azure services

Next steps
Use Azure DevOps to Deploy to a Windows Virtual Machine
Use Azure DevOps to manage a virtual machine in Azure DevTest Labs
Performance monitoring with Application Insights
Git on Azure DevOps
CI/CD for Azure Web Apps
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Azure Web Apps is a fast and simple way to create web apps using ASP.NET, Java, Node.js, or PHP. Deliver value
faster to your customers with a continuous integration and continuous deployment (CI/CD) pipeline that pushes
each of your changes automatically to Web Apps.

Architecture

Download an SVG of this architecture.

Data Flow
1. Change application source code.
2. Commit application code and Web Apps web.config file.
3. Continuous integration triggers application build and unit tests.
4. Continuous deployment trigger orchestrates deployment of application artifacts with environment-specific
parameters.
5. Deployment to Web Apps.
6. Azure Application Insights collects and analyzes health, performance, and usage data.
7. Review health, performance, and usage information.
8. Update backlog item.

Components
Application Insights: Detect, triage, and diagnose issues in your web apps and services.
Web Apps: Quickly create and deploy mission critical Web apps at scale.
Azure DevOps: Services for teams to share code, track work, and ship software.
Visual Studio: A creative launch pad for viewing and editing code, then debugging, building, and publishing apps
for Android, iOS, Windows, the web, and the cloud.

Next steps
Performance monitoring with Application Insights
Use Azure DevOps to deploy to an Azure Web App
Git on Azure DevOps
Deploy to Azure Web Apps using Visual Studio and Azure DevOps

Pricing Calculator
Customize and get pricing estimates

Deploy to Azure
Deploy to Azure
CI/CD for Containers
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Containers make it easy for you to continuously build and deploy your applications. By orchestrating deployment
of those containers using Kubernetes in Azure Kubernetes Service (AKS), you can achieve replicable, manageable
clusters of containers.
By setting up a continuous build to produce your container images and orchestration, Azure DevOps increases the
speed and reliability of your deployment.

Architecture

Download an SVG of this architecture.

Data Flow
1. Change application source code
2. Commit Application Code
3. Continuous integration triggers application build, container image build and unit tests
4. Container image pushed to Azure Container Registry
5. Continuous deployment trigger orchestrates deployment of application artifacts with environment-specific
parameters
6. Deployment to Azure Kubernetes Service (AKS)
7. Container is launched using Container Image from Azure Container Registry
8. Application Insights collects and analyses health, performance, and usage data
9. Review health, performance and usage information
10. Update backlog item

Components
Container Registry: Store and manage container images across all types of Azure deployments
Azure Kubernetes Service (AKS): Simplify the deployment, management, and operations of Kubernetes
Application Insights: Detect, triage, and diagnose issues in your web apps and services
Azure DevOps: Build and deploy multi-platform apps to get the most from Azure services

Next steps
Pushing Docker images to Azure Container Registry
Authenticate Azure Kubernetes Service (AKS) cluster to Azure Container Registry
Performance monitoring with Application Insights
Git on Azure DevOps

Pricing Calculator
Customize and get pricing estimates
Container CI/CD using Jenkins and Kubernetes on
Azure Kubernetes Service (AKS)
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Containers make it easy for you to continuously build and deploy applications. By orchestrating the deployment of
those containers using Azure Kubernetes Service (AKS), you can achieve replicable, manageable clusters of
containers.
By setting up a continuous build to produce your container images and orchestration, you can increase the speed
and reliability of your deployment.

Architecture

Download an SVG of this architecture.

Data Flow
1. Change application source code.
2. Commit code to GitHub.
3. Continuous Integration Trigger to Jenkins.
4. Jenkins triggers a build job using Azure Kubernetes Service (AKS) for a dynamic build agent.
5. Jenkins builds and pushes Docker container to Azure Container Registry.
6. Jenkins deploys your new containerized app to Kubernetes on Azure.
7. Container Service (AKS), backed by Azure Cosmos DB.
8. Grafana displays visualization of infrastructure and application metrics via Azure Monitor.
9. Monitor application and make improvements.

Components
Azure Kubernetes Service (AKS): Simplify the deployment, management, and operations of Kubernetes.
Container Registry: Store and manage container images across all types of Azure deployments.
Azure Cosmos DB: Globally distributed, multi-model database for any scale.
Azure Monitor: Highly granular and real-time monitoring data for any Azure resource.
Visual Studio Code: Build and deploy multi-platform apps to get the most from Azure services.

Next steps
Integrating Jenkins with Azure Container Service and Kubernetes
Pushing Docker images to Azure Container Registry
Connect existing Node.js to Cosmos DB using the MongoDB connector
Monitor your Azure services using Grafana
Get the Azure Extension Pack for Visual Studio Code

Deploy to Azure
Deploy to Azure
DevSecOps in Azure
11/2/2020 • 3 minutes to read • Edit Online

Security is a prime concern for businesses that store any sort of custom or client data. The solution that is
covering the management and interface of this data should be developed with security in mind. DevSecOps
involves utilizing security best practices from the beginning of development, shifting the focus on security away
from auditing at the end and towards development in the beginning using a shift-left strategy.

Download an SVG of this architecture.

Data flow
1. Azure Active Directory (AD) can be configured as the identity provider for GitHub. Multi-factor authentication
can be enabled for extra security.
2. Developers commit to GitHub Enterprise, driven by work items and bugs tracked with Azure Boards.
3. GitHub Enterprise can integrate automatic security and dependency scanning through GitHub Advanced
Security and GitHub Open Source Security.
4. Pull Requests trigger CI builds and automated testing in Azure Pipelines.
5. The CI build in Azure Pipelines generates a Docker container image that is stored to Azure Container Registry,
which is to be used at release time by Azure Kubernetes Service.
6. Upon uploading to the Azure Container Registry, Azure Security Center will scan the image for Azure-native
vulnerabilities and for security recommendations for the pushed image.
7. A release on Azure Pipelines integrates the Terraform tool, managing both the cloud infrastructure as code,
provisioning resources such as Azure Kubernetes Service, Application Gateway, and Azure Cosmos DB.
8. Azure Pipelines enable Continuous Delivery (CD) to Azure Kubernetes Service, by accessing the Container
Registry through a secure service connection.
9. Azure Policy can be applied to Azure Pipelines to enforce post-deployment gateways, and can be applied
directly to the AKS engine for policy enforcement.
10. Azure Key Vault is used to securely inject secrets and credentials into an application at runtime, abstracting
sensitive information away from developers.
11. End users can authenticate with Azure AD B2C, required to use MFA for extra security, and be routed through
an Application Gateway that can load balance and protect core services.
12. Continuous monitoring with Azure Monitor extends to release pipelines to gate or rollback releases based on
monitoring data. Azure Monitor also ingests security logs and can alert on suspicious activity.
13. As addition and final part of a DevSecOps flow, Azure Security Center will be able to do active threat
monitoring on the Azure Kubernetes Service, on both Node level (VM threats) and internals.

Components
Azure Active Directory provides identity and access management services for your organization, allowing
control over access to the resources inside Azure, GitHub Enterprise, and Azure DevOps.
Source code is hosted on GitHub Enterprise, where developers can collaborate within your organization and
the open-source communities. GitHub Enterprise offers advanced security features to identify vulnerabilities in
the code you write and in open-source dependencies
Use Azure Boards to plan work and track its progress, using Agile tools such as Kanban boards.
Azure Pipelines is a service that provides Continuous Integration and Continuous Delivery jobs, to build and
release your application automatically.
Host your Docker container images on Azure Container Registry. This service includes container image
scanning with the integration with Azure Security Center.
Azure Kubernetes Service offers a Kubernetes cluster that is fully managed by Azure, to ensure availability and
security of your infrastructure.
Terraform is a third-party product developed by HashiCorp that allows infrastructure automation on Azure, as
well as on other environments.
Azure Policy lets you create, assign, and manage policies. These policies enforce different rules and effects over
your resources, so those resources stay compliant with your corporate standards and service level agreements.
It integrates with Azure Kubernetes Service too.
You can use Azure Key Vault to store certificates, connection strings, tokens, and other secrets. This sensitive
information is read by your application at run-time, so it's abstracted away from your developers.
Azure Cosmos DB is a globally distributed, multi-model database service, that is fully managed and compatible
with multiple APIs, including MongoDB, Cassandra, SQL.
Azure Application Gateway is a Layer-7 load balancer with support for advanced routing rules and a Web
Application Firewall (WAF).
Using Azure Monitor lets you get insights on the availability and performance of your application and
infrastructure. It also gives you access to signals to monitor your solution's health and spot abnormal activity
early.
Using Azure AD B2C you can provide identity services to consumers (end-users) of your application, even if
they're not part of your organization.

Next steps
DevSecOps in GitHub

Related resources
Using Azure Security Center you can supply threat and vulnerability management for your deployed container-
based solution.
DevSecOps in GitHub
11/13/2020 • 8 minutes to read • Edit Online

Starting with the first steps of development, DevSecOps adheres to security best practices. By using a shift-left
strategy, DevSecOps redirects the security focus. Instead of pointing toward auditing at the end, it shifts to
development in the beginning. Besides producing robust code, this fail fast approach helps to resolve problems
early on, when they're easy to fix.
With many security capabilities, GitHub offers tools that support every part of a DevSecOps workflow:
Browser-based IDEs with built-in security extensions.
Agents that continuously monitor security advisories and replace vulnerable and out-of-date dependencies.
Search capabilities that scan source code for vulnerabilities.
Action-based workflows that automate every step of development, testing, and deployment.
Spaces that provide a way to privately discuss and resolve security threats and then publish the information.
Combined with the monitoring and evaluation power of Azure, these features provide a superb service for
building secure cloud solutions.

Potential use cases


GitHub DevSecOps installations cover many security scenarios. Possibilities include the following cases:
Developers who want to take advantage of pre-configured environments that offer security capabilities.
Administrators who rely on having up-to-date, prioritized security reports at their fingertips, along with details
on affected code and suggested fixes.
Streamlined organizations that need systems to automatically acquire new, uncompromised security devices
when secrets are left exposed in code.
Development teams that could benefit from automatic upgrades when newer or more secure versions of
external packages become available.

Architecture
Architecture diagram highlighting security checks that run in a GitHub DevSecOps environment. After Azure AD
authenticates developers, Codespaces run security scans. GitHub Actions then test security and encrypt sensitive
data. In production, Azure Policy, Azure Security Center, and Azure Monitor evaluate deployed software for risks.
Download an .svg of this architecture.
1. When developers access GitHub resources, GitHub redirects them to Azure Active Directory (Azure AD) for
SAML authentication. In a single sign-on (SSO) procedure, the Microsoft Authenticator app then uses FIDO2
strong authentication. The passwordless FIDO2 security keys align with the latest Fast Identity Online (FIDO)
Alliance specifications.
2. Developers begin working on tasks in Codespaces. Organized into containers, these pre-built development
environments provide correctly configured IDEs that are equipped with required security scanning extensions.
3. When developers commit new code, GitHub Actions automatically scan the code to quickly find vulnerabilities
and coding errors.
4. Pull requests (PRs) trigger code builds and automated testing through GitHub Actions. GitHub encrypts secrets
and credentials at rest and obfuscates these entries in logs.
5. GitHub Actions deploy build artifacts to Azure App Service while making changes to other cloud resources,
such as service endpoints.
6. Azure Policy evaluates Azure resources that are in deployment. Defined policies then potentially deny releases,
modify cloud resources, or create warning events in activity logs.
7. Azure Security Center identifies attacks targeting applications that are running in deployed projects.
8. Azure Monitor continuously tracks and evaluates app behavior. When threats materialize, this service sends
alerts to start the process of rolling code back to previous commits.

Components
Azure AD is a multi-tenant, cloud-based identity service that controls access to Azure and other cloud apps like
Microsoft 365 and GitHub.
GitHub provides a code-hosting platform that developers can use for collaborating on both open-source and
inner-source projects.
Codespaces is an online development environment. Hosted by GitHub and powered by Visual Studio Code, this
tool provides a complete development solution in the cloud.
GitHub Security works to eliminate threats in a number of ways. Agents and services identify vulnerabilities in
repositories and in dependent packages. They also upgrade dependencies to up-to-date, secure versions.
GitHub Actions are custom workflows that provide continuous integration (CI) and continuous deployment
(CD) capabilities directly in repositories. Computers called runners host these CI/CD jobs.
App Service provides a framework for building, deploying, and scaling web apps. This platform offers built-in
infrastructure maintenance, security patching, and scaling.
Azure Policy helps teams manage and prevent IT issues through policy definitions that can enforce rules for
cloud resources. For instance, if your project is about to deploy a virtual machine with an unrecognized SKU,
Azure Policy alerts you to the problem and stops the deployment.
Azure Security Center provides unified security management and advanced threat protection across hybrid
cloud workloads.
Azure Monitor collects and analyzes app telemetry, such as performance metrics and activity logs. When this
service identifies irregular conditions, it alerts apps and personnel.
Security
GitHub Security provides multiple features for addressing security risks:
Code scanning inspects code for known vulnerabilities and coding errors. As an example, if a developer
leaves a database connection string exposed in code, this feature discovers the secret. After verifying its
validity with the database, GitHub starts the process of obtaining an uncompromised string. These checks
use CodeQL, a code analysis platform that improves upon traditional analyzers by treating code as data.
Scans automatically run at scheduled times or after certain events occur, like commits or pushes.
GitHub Dependabot checks for outdated or vulnerable packages and applications. This automated agent
updates software, replacing out-of-date or insecure dependencies with newer, secure versions. For instance,
if your project uses an open-source library, Dependabot examines that library. Suppose the library doesn't
encrypt sensitive cleartext that it stores in a database. In this case, Dependabot creates a PR to upgrade the
library to a version that encrypts the data.
Vulnerability management identifies and updates known vulnerabilities in code and in software packages
that the code uses. It runs checks whenever the following events occur:
A repository's dependencies change (for instance, when a project switches from .NET to .NET Core).
A notification appears from WhiteSource. This third-party service tracks vulnerabilities by
continuously scanning open-source repositories.
A new vulnerability enters the GitHub Advisory Database. Entries in this database originate from the
following sources:
The National Vulnerability Database: A standardized repository of vulnerabilities that the U.S.
government maintains.
GitHub tracking: A combination of machine learning and human review that GitHub conducts to
detect vulnerabilities in public commits.
GitHub security advisories: Information about vulnerabilities that development teams make public
on GitHub.
PHP Security Advisories Database: References to known security vulnerabilities in PHP projects
and libraries.
When GitHub identifies a vulnerability, it takes the steps illustrated in the following diagram.
Architecture diagram illustrating a chain of events in a GitHub DevSecOps implementation. At the outset,
GitHub identifies a vulnerability and sends an email alert. Dependabot then creates a branch, updates the
vulnerability source, and creates a PR. The branch merges. In the final step, GitHub Actions deploy the new
app.

Download an .svg of this diagram.


1. GitHub sends an email alert to the organization owners and repository administrators.
2. GitHub Dependabot, a DevOps bot agent, automatically completes the following three tasks:
a. Creates a new branch in the repository.
b. Upgrades the necessary dependencies to the minimum possible secure version needed to eliminate the
vulnerability.
c. Creates a PR with the upgraded dependency.
3. When the PR is approved, the new branch merges with the base branch.
4. The merged branch triggers CI/CD tasks in GitHub Actions.
5. GitHub Actions deploy the new app version to a test or staging environment.

Considerations
To keep GitHub DevSecOps solutions aligned with the tenets of the Azure Well-Architected Framework, consider
the following points when deciding how to implement this pattern.
Cost optimization
GitHub bills customers for GitHub Actions by the minute. In addition, the choice of operating system that hosts
Actions jobs affects the per-minute consumption rate and per-minute cost. Wherever possible, choose Linux to
host Actions. See About billing for GitHub actions.
Project managers on tight schedules may worry that adding security measures will delay development.
Experience the opposite by saving time with these guidelines:
Shift testing left, closer to the source. Teams make fewer mistakes as a result.
Address issues during programming, rather than months down the line in production. Then developers
don't need to refresh their knowledge of the code.
Operational excellence
Run automated tests and maintenance in environments that use vulnerability management capabilities,
since these capabilities change code and its dependencies on your behalf. Automated testing identifies any
issues that result from these changes.
Take advantage of Azure Policy features. Besides denying deployments and logging compliance issues,
these policies can also modify resources, making them compliant, even if they aren't deployed that way. For
example, if you try to deploy a storage account in Azure that uses HTTP, Azure Policy detects the situation.
Policies can then automatically change the deployment and force the storage account to use HTTPS.
Azure Resource Manager uses JSON templates to describe the resources involved in deployment. Teams
can also manage these template documents by using DevOps tools, like version control, code collaboration,
and CI/CD workflows.
One concern with DevSecOps is that code scans can generate noisy results filled with false positives,
leading to the following types of problems:
Developers waste time investigating nonexistent problems.
Addressing security issues interrupts workflow.
Having lost trust in security tools because of the inaccuracies, developers ignore results.
Overcome these obstacles by integrating security into the software lifecycle:
Employ tools like Codespaces that embed scanning checks in IDEs, meaning developers use them in
familiar environments.
Make security checks a regular part of code reviews instead of an afterthought.
Put developers in charge of high-precision scans, but leave noisier checks to security teams.
Performance efficiency
For long-running or complex Actions, host your own runners for CI/CD jobs. You can then choose computers with
powerful processing capabilities and ample memory. See About self-hosted runners.
Reliability
If you work with the on-premises deployment of GitHub.com that GitHub Enterprise Server provides, use a
highly available failover configuration for increased resiliency.
If you opt for self-hosted Actions runners, consider distributing them geographically.
Security
Using self-hosted Actions runners for public repositories isn't advised. A malicious user could join your repo
and create a PR that runs unsafe code on computers in your network. GitHub-hosted runners remove this risk.
Scan your code using the CodeQL analysis engine. CodeQL can discover potential vulnerabilities and coding
errors. It can run both on a schedule and when events occur, such as a commit or a PR creation. See About code
scanning.
Make sure to configure Dependabot security updates, which can remove known threats from projects.
You can augment the code-scanning capabilities of GitHub by adding GitHub third-party code-scanning tools
that produce Static Analysis Results Interchange Format (SARIF) files. GitHub then creates alerts when those
tools identify potential security issues.

Next steps
Training materials, tools, and other resources on GitHub DevSecOps
Tips for getting started with GitHub DevSecOps

Related resources
DevSecOps in Azure
Shift left
Fail fast
SARIF files
DevTest and DevOps for IaaS
11/2/2020 • 5 minutes to read • Edit Online

Infrastructure as a service (IaaS) is a form of cloud computing that provides virtualized computing resources.
Development testing (DevTest) is a software development approach that integrates testing early in the
development phase. In this solution, configuring DevTest operations for an IaaS application reduces the cost and
overhead of development and test environments, while facilitating faster development through automated virtual
machine (VM) and VM image integration and deployment.
DevOps is a set of practices that combine software development and IT operations to shorten the development
cycle and provide high-quality continuous deployments (CD). Using Azure DevOps with DevTest environments
lends power and focus to the IaaS development process.

Architecture

1. Instead of manually configuring development environments, developers can use Windows Virtual Desktop
images, pre-configured with the libraries, tools, and runtimes they need for their projects. Adding a
developer to an Azure DevTest Subscription makes the appropriate Windows Virtual Desktop image
available to them from the DevTest environment.
2. Source code is available in GitHub repos, which integrate seamlessly with Azure DevOps. The Visual Studio
development environment combines GitHub source code editing with features like work item and pull
request tracking.
3. Azure Pipelines triggers automated continuous integration (CI) builds from GitHub repos and automatically
delivers them to the DevTest environments, reaching quality assurance (QA) testing quickly with low
developer overhead. Azure Pipelines uses Azure Key Vault to securely consume secrets like credentials and
connection strings required for release and deployment configurations.
4. Azure Boards connects back to the GitHub repos, letting developers track both code and tasks in one
location. Automated testing also generates bugs for any build or release failures, feeding the results back
into the development cycle.
5. Azure Boards work items come from automated testing, manual QA testing, and added features. Developers
create feature branches and associate them with work items to track development, creating more iterations
of the development loop.
6. Azure Pipelines deploys builds to the low-cost Azure DevTest Labs subscription environments, where
developers and testers can rapidly provision VMs. Azure Policy regulates and limits DevTest VM numbers
and costs, and audits VM usage to provide insight and tracking.
Developers can deploy VMs as quickly as needed within the lab, while staying within resource and cost
parameters set by managers and administrators. The flexibility of low-cost labs provides developers with all
the environments they need for rapid iterative progress.
Testers can operate within their own DevTest Labs environment, pulling ready-for-test images into the test
team's labs separately from the development team. The ability of developers and testers to work in parallel
DevTest Labs contributes to rapid iteration of ready-to-release image versions.
7. As the developed VM images reach a release state, Azure Pipelines triggers releases, which generalize the
targeted images for destinations like virtual machine scale sets, and promotes them out of DevTest and into
a Production environment.
User Acceptance Testing (UAT) validates a staged VM or virtual machine scale set before deployment to
Production.
Approvals are usually required for releases to higher-cost, client-facing Production destinations. Production
remains isolated and protected from inadvertent or unapproved deployments.
This scenario uses a separate DevTest and Production Azure Active Directory (Azure AD) per subscription to create
a distinct separation of concerns. To meet compliance requirements, the Production subscription's Azure AD might
need to include a smaller cross section of users than the DevTest Azure AD.
Azure Monitor works across subscriptions to monitor VMs in both the Production and DevTest environments.

Components
Azure DevTest Labs provides labs that have all the necessary tools and software to create environments.
Developers can efficiently self-manage resources without waiting for approvals. With DevTest Labs, teams
can control costs and regulate resources per lab, granting developers permission and flexibility to operate
their sandboxes within cost constraints.
Azure VM Image Builder service provides baseline VM images that developers can customize. The service
facilitates the creation and patching of images and can be called as an Azure Pipelines task.
Shared Image Gallery acts as a VM image repository for IaaS solutions. VM Image Builder can build directly
into a Shared Image Gallery, facilitating an Azure Pipelines CI/CD process of versioning the VM-based
application.
GitHub is a code hosting platform for version control and collaboration. A GitHub repository contains all
project files and their revision history. Developers can work together to contribute, discuss, and manage
code in the repository.
Azure Pipelines deploys the VM application images. Pipelines can also deploy the VM resources themselves,
through Azure Resource Manager (ARM) templates. This infrastructure-as-code can be source controlled
and configured for CI/CD, ensuring that the infrastructure remains up to date.
Azure Key Vault securely stores and tightly controls access to secrets like API keys, passwords, and
certificates. For more information about Key Vault in DevOps scenarios, see DevSecOps in Azure.
Azure Boards is a service for managing work for software projects. Azure Boards brings a rich set of
capabilities, including native support for Scrum and Kanban methodologies, customizable dashboards, and
integrated reporting.
Azure Active Directory (Azure AD) enterprise identity platform provides single sign-on and multifactor
authentication to govern user access to resources. In this scenario, a separate Azure AD per subscription
creates a distinct separation of concerns between Azure users.
Azure Policy governs resources to meet organizational standards and compliance. In a DevTest role, Azure
Policy can regulate and limit the number and costs of VMs in the subscription. Auditing can provide insights
and track the usage of the DevTest VMs.
Azure Monitor can work across subscriptions to monitor VMs in both Production and DevTest
environments. Azure Monitor can collect log data from VM operating systems as well as crash dump files,
and aggregate them for viewing in Azure Security Center.

Alternatives
In situations where VM Image Builder and a Shared Image Gallery don't work, you can set up an image factory to
build VM images from the CI/CD pipeline and distribute them automatically to any Azure DevTest Labs registered
to those images. For more information, see Run an image factory from Azure DevOps.

Next steps
DevSecOps in Azure
DevTest and DevOps for PaaS solutions
DevTest and DevOps for microservices
Set up Azure DevOps
Create a lab in Azure DevTest Labs
Create your first Windows virtual machine in the Azure portal
DevTest and DevOps for PaaS
11/2/2020 • 6 minutes to read • Edit Online

In platform as a service (PaaS) cloud computing, a service provider delivers a platform for clients to develop, run,
and manage applications without having to build and maintain infrastructure. Solution development based on
Azure Platform resources and services removes the overhead of managing virtual machines, virtual networks,
disks, and related configurations.
Development testing (DevTest) is a software development approach that integrates testing early in the
development phase. DevOps is a set of practices that combine software development and IT operations, to shorten
the development cycle and provide continuous delivery with high quality.
This solution architecture combines reduced overhead and a DevOps toolchain to support rapid DevTest iteration
cycles and a fast development environment with PaaS resources.

Architecture

1. Developers use GitHub Codespaces pre-configured development environments to avoid manually


configuring development environments. Developers can use Codespaces in-browser or through Visual
Studio Code.
2. Source code is available in GitHub repos, which integrate seamlessly with Azure DevOps.
3. Similar to Azure DevOps, GitHub Actions sets automated builds to trigger on GitHub commits or pull
requests. In the current solution, the build action publishes artifacts to a GitHub Package source package for
deployment.
4. Azure Boards connects back to GitHub Repos to generate work items from automated and manual testing,
and tracks all work in the development cycle.
5. Developers create feature or fix branches that associate work items with development sprints or cycles,
tracking all code with tasks and requirements. With this approach, developers maintain an active
development loop velocity that is critical to agile software development.
6. Azure Pipelines is the solution for Azure PaaS deployments. Azure Pipelines handles continuous deployment
(CD) and release tasks, which consume the package versions exposed by GitHub Actions builds. Azure
Pipelines uses Azure Key Vault to securely consume secrets like credentials and connection strings required
for release and deployment configurations.
7. Azure Pipelines deployment tasks specifically support Azure PaaS resources, including web application
deployments:
a. In conjunction with the continuous builds from GitHub Actions, a continuous release deploys the latest
application build to a Development environment.
b. The pipeline promotes the build selected for testing up to the next deployment ring, the Test
environment.
c. After the build passes all tests, it's promoted and staged in the next outer ring for User Acceptance
Testing (UAT).
All three DevTest environments exist under a discounted-cost Azure DevTest Subscription. Azure Policy can
add additional rules for the size and number of App Services or PaaS databases to limit costs.
8. When solution quality reaches a Production state, Azure Pipelines triggers a Production release. The pipeline
limits Production deployments to the minimum number necessary.
For an App Service, a Pipelines release can deploy the targeted build version to the Production resource in
the Pre-Production app slot, and then swap slots to Production in place. If something goes wrong during the
swap of Pre-Production to Production, App Service can roll back the swap for extra deployment safety.
In this solution, a single Azure Active Directory (Azure AD) manages identity for both the DevTest and Production
subscriptions. Role-based access control (RBAC) restricts access to protected resources, preventing unauthorized
or inadvertent modification of Production resources. Developers don't have the same access control levels in
Production as in their DevTest sandboxes.
Terraform provisions and modifies resources for each environment, tailoring size, instance count, and other
properties. The current solution models an environment stack of Azure App Service, Azure Cache for Redis, and
Azure SQL Database.
Azure Monitor works across subscriptions to monitor all environments and collect logs, crash dump reports, and
application data.

Components
GitHub is a code hosting platform for version control and collaboration, with other integrated features:
A GitHub source-control repository contains all project files and their revision history. Developers can
work together to contribute, discuss, and manage code in the repository.
GitHub Codespaces provide developers with pre-configured development environments that cover
library dependencies, user configuration, and extensions. Developers can provision codespaces in
GitHub, and use them either in-browser or through Visual Studio Code.
GitHub Actions provide a suite of build and release workflows, covering continuous integration (CI),
automated testing, and Azure Resource Manager (ARM) template or container deployments.
GitHub Packages package hosting service allows hosting software packages and using them as
dependencies in projects.
Azure Boards is a service for managing work for software projects. Azure Boards brings a rich set of
capabilities including native support for Scrum and Kanban methodologies, customizable dashboards, and
integrated reporting.
Azure Pipelines deploys the VM application images. Pipelines can also deploy the VM resources themselves,
through Azure Resource Manager (ARM) templates. This infrastructure-as-code can be source controlled
and configured for CI/CD, ensuring that the infrastructure remains up to date.
Azure Key Vault securely stores and tightly controls access to secrets like API keys, passwords, and
certificates. For more information about Key Vault in DevOps scenarios, see DevSecOps in Azure and
DevSecOps in GitHub.
Azure AD enterprise identity platform provides single sign-on and multifactor authentication to govern user
access to resources. In the current scenario, Azure AD works with RBAC to restrict access per environment.
Terraform is a third-party tool that provisions and modifies resources per environment. The current solution
provisions an environment stack of Azure App Service, Azure Cache for Redis, and Azure SQL Database.
Terraform also supports cross-platform infrastructure-as-code configuration and deployment across Azure
and other cloud providers. Infrastructure-as-code lets developers validate infrastructure changes the same
way they validate code changes.
Azure App Service is a fully managed web hosting service for building web apps. App Service supports
DevOps capabilities like CD, package management, staging environments, custom domains, and TLS/SSL
certificates.
Azure Cache for Redis provides an in-memory data store based on the open-source software Redis, which
improves performance and scalability of applications that rely heavily on backend data stores.
Azure SQL Database is a fully managed PaaS database engine that handles most database management
functions like upgrading, patching, backups, and monitoring, without user involvement.
Azure Policy governs resources to meet organizational standards and compliance. In a DevTest role, Azure
Policy can regulate and limit the number and costs of VMs. Auditing can provide insights and track the
usage of the DevTest VMs.
Azure Monitor can monitor both Production and DevTest environments. Azure Monitor can collect log data
from VM operating systems and crash dump files, and aggregate them for viewing in Azure Security Center.
Application Insights is a feature of Azure Monitor that works with App Service to monitor application
performance and issues.

Considerations
Developing an Azure SQL Database design that adequately supports the web app is an important consideration.
Source control and versioning are as important for data schemas as for code and builds.
Tools like Entity Framework can provide source control for database schema changes via migrations. Migrations
codify changes to the database structure and provide methods to both advance forward and roll back changes if
issues occur.

Alternatives
While this architecture uses Azure Pipelines for deployments, GitHub Actions can also perform PaaS
deployments.
Some integrations available in Azure Pipelines, such as service connection or authentication directly into the
Azure backbone, don't currently exist in GitHub Actions. For these needs, consider using Azure Pipelines
instead of GitHub Actions for CI and build activities.
While Terraform is a powerful environment management tool, one of its main features is its cross-platform
capabilities. For applications that exist solely in Azure, consider using ARM templates instead of Terraform
for resource deployments. ARM templates have direct Microsoft support, have their own tasks in Azure
Pipelines, and may be able to deploy features or properties that aren't immediately available in Terraform.

Next steps
DevSecOps in Azure
DevTest and DevOps for IaaS solutions
DevTest and DevOps for microservices
Set up Azure DevOps
Create a lab in Azure DevTest Labs
Create a SQL database in minutes by using the Azure portal
How to create a web app with Azure Cache for Redis
DevTest and DevOps for microservices
11/2/2020 • 5 minutes to read • Edit Online

Microservice architectures design applications as collections of loosely coupled services. In a microservices


architecture, services are fine-grained and protocols are lightweight. Microservices offer benefits such as clear
separation of concerns and decoupling of dependencies.
Microservices introduce complexities in the development cycle compared to traditional monolithic applications.
Traditionally, development occurs in a local or virtual replica of the application stack, which configures and runs
compute and storage components locally in isolation. In a microservice model, developers need to test their
services against the existing architecture, catch integration issues early to save on build and deployment time, and
keep integrated builds clean over the lifecycle of the application.
Development testing (DevTest) is a software development approach that integrates testing early in the
development phase to speed development. DevOps is a set of practices that combine software development and IT
operations to shorten the development cycle and provide high-quality continuous delivery. Kubernetes is an open-
source container orchestration system for automating application deployments.
This solution architecture models a development and deployment environment that uses DevOps in DevTest for
rapid iterative development of an Azure Kubernetes Service (AKS) microservice application.

Architecture

1. Developers use Local Process with Kubernetes to run their local microservice versions within the context of
the development Kubernetes cluster. Connecting to the cluster while debugging the service allows quick
testing and development in the full application context.
2. Each microservice codebase uses a separate GitHub code repository for source control.
3. GitHub Actions builds the microservice container images and pushes them to Azure Container Registries.
GitHub Actions also updates the latest tag of repositories for continuous integration (CI), or tags
repositories for release.
4. GitHub Actions automated testing generates work items for Azure Boards, making all work items
manageable in one place.
5. Visual Studio Code extensions support Azure Boards and GitHub integration. Associating Azure Boards
work items with GitHub repos ties requirements to code, driving the development loop forward.
6. Commits merged into the integration branch trigger GitHub Actions builds and Docker pushes to the
DevTest container registries. Each microservice has its own repository in Container Registries, paralleling
the GitHub repositories. CI builds are usually tagged with latest, representing the most recent successful
microservice builds.
7. Azure Pipelines runs the Kubernetes apply command to trigger deployment of the updated Container
Registry images to the DevTest Kubernetes clusters. Azure can authenticate AKS to run unattended
Container Registry pulls, simplifying the continuous deployment (CD) process.
Azure Pipelines uses Azure Key Vault to securely consume secrets like credentials and connection strings
required for release and deployment configurations.
8. When a version of the application is ready for quality assurance (QA) testing, Azure Pipelines triggers a QA
release. The pipeline tags all appropriate images with the next incremental version, updates the Kubernetes
manifest to reflect the image tags, and runs the apply command. In this example, while a developer may
be iterating on a service in isolation, only builds integrated via CI/CD are moved over to deployment.
9. After testing has approved a version of the service for deployment, GitHub Actions promotes a release from
the DevTest Container Registry to a Production Container Registry. GitHub Actions tags the images with the
appropriate version and pushes them into the Production Container Registry, following container registry
best practices.
10. Azure Pipelines creates a release to Production. The pipeline imposes approval gates and pre-stage and
post-stage conditions to protect the Production environment from inadvertent or incorrect deployment.
The application uses Azure Cosmos DB for its globally distributed database tier.
All services and environments report metrics to Azure Monitor.
In this solution, a single Azure Active Directory (Azure AD) manages identity for both the DevTest and Production
subscriptions. Role-based access control (RBAC) restricts access to protected resources, preventing unauthorized
or inadvertent modification of Production resources. Developers don't have the same access control levels in
Production as in their DevTest sandboxes.

Components
Azure DevTest Labs provides labs that have all the necessary tools and software to create environments.
Developers can efficiently self-manage resources without waiting for approvals. With DevTest Labs, teams
can control costs and regulate resources per lab, granting developers permission and flexibility to operate
their sandboxes within cost constraints.
GitHub is a code hosting platform for version control and collaboration. A GitHub source-control repository
contains all project files and their revision history. Developers can work together to contribute, discuss, and
manage code in the repository.
GitHub Actions provides a suite of build and release workflows, covering CI, automated testing, and
container deployments.
Azure Boards is a service for managing work for software projects. Azure Boards brings a rich set of
capabilities including native support for Scrum and Kanban methodologies, customizable dashboards, and
integrated reporting.
Azure Pipelines is a fully featured CI/CD service that can automatically deploy updated Container Registry
images to Kubernetes clusters.
Azure Key Vault securely stores and tightly controls access to secrets like API keys, passwords, and
certificates. For more information about Key Vault in DevOps scenarios, see DevSecOps in Azure and
DevSecOps in GitHub.
Azure Container Registry supports building, storing, and managing container images and artifacts in
private registries for all types of container deployments.
Azure Kubernetes Service makes it simple to deploy managed Kubernetes clusters by offloading much of
the complexity, responsibility, and operational overhead to Azure.
Azure Active Directory (Azure AD) enterprise identity platform provides single sign-on and multifactor
authentication to govern user access. A single Azure AD can manage identity for all environments across
subscriptions. Role-based access control (RBAC) restricts access to protected resources, preventing
unauthorized or inadvertent modification of production resources.
Azure Cosmos DB is a fully managed, widely distributed database-as-a-service that supports high
availability, multi-region applications, and both SQL and NoSQL APIs. Azure Cosmos DB includes DevTest
features like a local Cosmos DB emulator that integrates with Azure DevOps, and low-cost tiers for
managing costs in DevTest sandboxes.
Azure Monitor can monitor both Production and DevTest environments. Azure Monitor collects log data
from VM operating systems and crash dump files, and aggregates them for viewing in Azure Security
Center.

Alternatives
Azure Repos is an alternative to GitHub for Git repository hosting. With Azure Repos, Azure Boards, and
Azure Pipelines, all Azure DevOps Services use the same portal and user interface, consolidating the
services developers need for DevOps activities.
Some integrations available in Azure Pipelines, such as service connection or authentication directly into the
Azure backbone, don't currently exist in GitHub Actions. For these needs, consider using Azure Pipelines
instead of GitHub Actions for CI and build activities.
In a widely distributed system, there are benefits to separating microservices into individual repositories.
Separation of ownership and permission is simpler, and projects in different languages are easier to
maintain than with a single repository. However, in solutions with fewer microservices all in the same
language or runtime, maintaining a single Git repository for the project may be easier.

Next steps
Design a microservice-oriented application
CI/CD pipeline for container-based workloads
DevSecOps in Azure
DevTest and DevOps for IaaS solutions
DevTest and DevOps for PaaS solutions
Set up Azure DevOps
Create a lab in Azure DevTest Labs
DevTest Image Factory
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

The image factory provides a great way for organizations to create, maintain, and distribute custom images with
Azure DevTest Labs. Whether you have globally distributed teams that need to work with a common set of custom
images, need to centrally manage the configuration of images to ensure they meet regulatory compliance and
security requirements, or complex software setup and configuration requirements, the image factory provides an
automated solution to manage it

Architecture

Download an SVG of this architecture.

Data Flow
1. With config as code, define the images to push and select which labs will receive the image.
2. IT admin checks into source code control of choice (such as Visual Studio Team Services or GitHub + Jenkins).
3. Orchestrator triggers "golden image" creation based on configuration in source code control that goes to the
image factory.
4. Image factory lab receives commands to create virtual machines (VMs) and custom images.
5. Specified images copied from image factory lab to team labs.
6. Team lab users claim VMs or create VMs with the latest images.

Components
Azure Lab Services: Set up labs for classrooms, trials, development and testing, and other scenarios
Virtual Machines: Provision Windows and Linux virtual machines in seconds
Azure DevOps: Services for teams to share code, track work, and ship software
Next steps
Azure Lab Services documentation
Virtual Machines documentation
Azure Devops documentation
Immutable Infrastructure CI/CD using Jenkins and
Terraform on Azure Virtual Architecture overview
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Azure is a world-class cloud for hosting virtual machines running Windows or Linux. Whether you use Java,
Node.js, Go, or PHP to develop your applications, you'll need a continuous integration and continuous deployment
(CI/CD) pipeline to push changes to these virtual machines automatically.

Architecture

Download an SVG of this architecture.

Data Flow
1. Change application source code.
2. Commit code to GitHub.
3. Continuous Integration Trigger to Jenkins.
4. Jenkins triggers a Packer image build to create a VM and stores it as a VM image using Azure Managed Disks.
5. Jenkins triggers Terraform to provision a new Virtual Machine Scale Set using the Azure Managed Disks VM
image.
6. Azure Log Analytics collects and analyzes logs.
7. Monitor application and make improvements.
Components
Managed Disks: Persistent, secured disk storage for Azure virtual machines.
Virtual Machine Scale Sets: Manage and scale up to thousands of Linux and Windows virtual machines.
Log Analytics: Collect, search, and visualize machine data from on-premises and cloud.
Visual Studio Code: Build and deploy multi-platform apps to get the most from Azure services.

Next steps
Use Terraform to create a Virtual Machine Scale Set from a Packer customer VM image
Use Terraform to plan and create a Virtual Machine Scale Set with Managed Disk
Use Log Analytics to collect data from Linux computers
Get the Azure Extensions Pack for Visual Studio Code

Deploy to Azure
Deploy to Azure
DevOps in a hybrid environment
11/2/2020 • 2 minutes to read • Edit Online

In many scenarios, the adoption of the Azure cloud as a business solution involves the migration of an on-premises
environment. It is usually not practical to deprecate the on-premises environment quickly, and in many cases the
on-premises environment will need to persist alongside the cloud environment for a significant amount of time.
The tools provided in Azure allow for the implementation of a DevOps strategy that capably manages both cloud
and on-premises environments in tandem.

Download an SVG of this architecture.

Data flow
1. GitHub Enterprise is used as the code repository for the application
2. Pull Requests trigger CI builds and automated testing in Azure Pipelines
3. Continuous monitoring with Azure Monitor extends to release pipelines to gate or rollback releases based on
monitoring data
4. A release on Azure Pipelines integrates the Terraform tool, managing both cloud and on-premises infrastructure
as code, provisioning resources such as Azure Web Apps, VMs, and databases in both locations
5. Azure Pipelines define both Continuous Delivery (CD) to a development environment in the cloud, and release
deployments to an on-premises production environment.
6. Azure Key Vault is used to securely inject secrets and credentials into a deployment, abstracting secrets away
from developers
7. Azure Monitor can be configured to log analytics from both the cloud and on-premises environments.
Application Insights as a part of Azure Monitor can be connected to both cloud and on-premises applications for
monitoring
8. Azure AD in the cloud can be used to provide identity services for the application, both running on Azure and
on-premises.
Components
Azure Boards: Use Azure Board to plan work and track its progress, using Agile tools such as Kanban boards.
Source code is hosted on GitHub Enterprise, where developers can collaborate within your organization and the
open source communities. GitHub Enterprise offers advanced security features to identify vulnerabilities in the
code you write and in open source dependencies.
Azure Pipelines runs Continuous Integration and Continuous Delivery jobs for your application, as well as the
creation of your infrastructure with the integration with Terraform.
You can use Azure Key Vault to store certificates, connection strings, tokens, and other secrets. These are read by
your application at run-time, so they're abstracted away from your developers
Terraform is a third-party product developed by HashiCorp that allows infrastructure automation on Azure, on-
premises, and other environments
Using Azure Monitor lets you get insights on the availability and performance of your application and
infrastructure.
Azure AD provides identity and access management services for your application, both on-premises and on the
cloud. Azure AD can synchronize with an on-premises Active Directory to seamlessly allow your users to
authenticate everywhere.
Azure Web Apps is a managed platform for hosting web-based applications in the cloud.
Azure SQL Database is a managed database service for transactional applications.
Java CI/CD using Jenkins and Azure Web Apps
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Azure App Service is a fast and simple way to create web apps using Java, Node, PHP or ASP.NET, as well as support
for custom language runtimes using Docker. A continuous integration and continuous deployment (CI/CD) pipeline
that pushes each of your changes automatically to Azure app services allows you to deliver value faster to your
customers.

Architecture

Download an SVG of this architecture.

Data Flow
1. Change application source code
2. Commit code to GitHub
3. Continuous Integration Trigger to Jenkins
4. Jenkins triggers a build job using Azure Container Instances for a dynamic build agent
5. Jenkins builds and stores artifact in Azure Storage
6. Jenkins deploys Java application to Azure Web Apps backed by Azure Database for MySQL
7. Azure App Insights provides metrics on application performance
8. Monitor application and make improvements

Components
Azure Web Apps: Quickly create and deploy mission critical Web apps at scale
Container Instances: Easily run containers on Azure without managing servers
Azure Database for MySQL: Managed MySQL database service for app developers
Application Insights: Detect, triage, and diagnose issues in your web apps and services
Azure DevOps: Build and deploy multi-platform apps to get the most from Azure services

Next steps
Set up continuous integration and deployment to Azure Web Apps with Jenkins
Use the Azure Container Agents plug-in for Jenkins for dynamic build agents
Build a Java and MySQL web app in Azure
Performance monitoring with Application Insights
Get the Azure Extension Pack for Visual Studio Code

Deploy to Azure
Deploy to Azure
Run a Jenkins server on Azure
11/2/2020 • 12 minutes to read • Edit Online

This scenario shows how to deploy and operate a scalable, enterprise-grade Jenkins server on Azure secured with
single sign-on (SSO). The architecture also uses Azure Monitor to monitor the state of the Jenkins server. Deploy
this solution .

Download a Visio file that contains this architecture diagram.


This architecture supports disaster recovery with Azure services but does not cover more advanced scale-out
scenarios involving multiple primaries or high availability (HA) with no downtime. For general insights about the
various Azure components, including a step-by-step tutorial about building out a CI/CD pipeline on Azure, see
Jenkins on Azure.
The focus of this document is on the core Azure operations needed to support Jenkins, including the use of Azure
Storage to maintain build artifacts, the security items needed for SSO, other services that can be integrated, and
scalability for the pipeline. The architecture is designed to work with an existing source control repository. For
example, a common scenario is to start Jenkins jobs based on GitHub commits.

Architecture
The architecture consists of the following components:
Resource group. A resource group is used to group Azure assets so they can be managed by lifetime,
owner, and other criteria. Use resource groups to deploy and monitor Azure assets as a group and track
billing costs by resource group. You can also delete resources as a set, which is very useful for test
deployments.
Jenkins ser ver. A virtual machine is deployed to run Jenkins as an automation server and serves as
Jenkins Primary. To deploy this VM, use the solution template for Jenkins on Azure. Other Jenkins offerings
are available in the Azure Marketplace.
NOTE
Nginx is installed on the VM to act as a reverse proxy to Jenkins. You can configure Nginx to enable SSL for the
Jenkins server.

Vir tual network . A virtual network connects Azure resources to each other and provides logical isolation.
In this architecture, the Jenkins server runs in a virtual network.
Subnets. The Jenkins server is isolated in a subnet to make it easier to manage and segregate network
traffic without affecting performance.
Network security groups. Use network security groups to restrict network traffic from the Internet to the
subnet of a virtual network.
Managed disks. A managed disk is a persistent virtual hard disk (VHD) used for application storage and
also to maintain the state of the Jenkins server and provide disaster recovery. Data disks are stored in Azure
Storage. For high performance, premium storage is recommended.
Azure Blob storage. The Windows Azure Storage plugin uses Azure Blob storage to store the build
artifacts that are created and shared with other Jenkins builds.
Azure Active Director y (Azure AD). Azure AD supports user authentication, allowing you to set up SSO.
Azure AD service principals define the policy and permissions for each role authorization in the workflow,
using role-based access control (RBAC). Each service principal is associated with a Jenkins job.
Azure Key Vault. To manage secrets and cryptographic keys used to provision Azure resources when
secrets are required, this architecture uses Azure Key Vault. For added help storing secrets associated with
the application in the pipeline, see also the Azure Credentials plugin for Jenkins.
Azure monitoring ser vices . This service monitors the Azure virtual machine hosting Jenkins. This
deployment monitors the virtual machine status and CPU utilization and sends alerts.

Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have a
specific requirement that overrides them.
Azure AD
The Azure AD tenant for your Azure subscription is used to enable SSO for Jenkins users and set up service
principals that enable Jenkins jobs to access Azure resources.
SSO authentication and authorization are implemented by the Azure AD plugin installed on the Jenkins server. SSO
allows you to authenticate using your organization credentials from Azure AD when logging on to the Jenkins
server. When configuring the Azure AD plugin, you can specify the level of a user's authorized access to the Jenkins
server.
To provide Jenkins jobs with access to Azure resources, an Azure AD administrator creates service principals. These
grant applications — in this case, the Jenkins jobs — authenticated, authorized access to Azure resources.
RBAC further defines and controls access to Azure resources for users or service principals through their assigned
role. Both built-in and custom roles are supported. Roles also help secure the pipeline and ensure that a user's or
agent's responsibilities are assigned and authorized correctly. In addition, RBAC can be set up to limit access to
Azure assets. For example, a user can be limited to working with only the assets in a particular resource group.
Storage
Use the Jenkins Windows Azure Storage plugin, which is installed from the Azure Marketplace, to store build
artifacts that can be shared with other builds and tests. An Azure Storage account must be configured before this
plugin can be used by the Jenkins jobs.
Jenkins Azure plugins
The solution template for Jenkins on Azure installs several Azure plugins. The Azure DevOps Team builds and
maintains the solution template and the following plugins, which work with other Jenkins offerings in Azure
Marketplace as well as any Jenkins primary set up on premises:
Azure AD plugin allows the Jenkins server to support SSO for users based on Azure AD.
Azure VM Agents plugin uses an Azure Resource Manager template to create Jenkins agents in Azure virtual
machines.
Azure Credentials plugin allows you to store Azure service principal credentials in Jenkins.
Windows Azure Storage plugin uploads build artifacts to, or downloads build dependencies from, Azure
Blob storage.
We also recommend reviewing the growing list of all available Azure plugins that work with Azure resources. To
see all the latest list, visit Jenkins Plugin Index and search for Azure. For example, the following plugins are
available for deployment:
Azure Container Agents helps you to run a container as an agent in Jenkins.
Kubernetes Continuous Deploy deploys resource configurations to a Kubernetes cluster.
Azure Container Service deploys configurations to Azure Container Service with Kubernetes, DC/OS with
Marathon, or Docker Swarm.
Azure Functions deploys your project to Azure Function.
Azure App Service deploys to Azure App Service.

Scalability considerations
Jenkins can scale to support very large workloads. For elastic builds, do not run builds on the Jenkins primary
server. Instead, offload build tasks to Jenkins agents, which can be elastically scaled in and out as need. Consider
two options for scaling agents:
Use the Azure VM Agents plugin to create Jenkins agents that run in Azure VMs. This plugin enables elastic
scale-out for agents and can use distinct types of virtual machines. You can select a different base image
from Azure Marketplace or use a custom image. For details about how the Jenkins agents scale, see
Architecting for Scale in the Jenkins documentation.
Use the Azure Container Agents plugin to run a container as an agent in either Azure Container Service with
Kubernetes, or Azure Container Instances.
Virtual machines generally cost more to scale than containers. To use containers for scaling, however, your build
process must run with containers.
Also, use Azure Storage to share build artifacts that may be used in the next stage of the pipeline by other build
agents.
Scaling the Jenkins server
You can scale the Jenkins server VM up or down by changing the VM size. The solution template for Jenkins on
Azure specifies the DS2 v2 size (with two CPUs, 7 GB) by default. This size handles a small to medium team
workload. Change the VM size by choosing a different option when building out the server.
Selecting the correct server size depends on the size of the expected workload. The Jenkins community maintains a
selection guide to help identify the configuration that best meets your requirements. Azure offers many sizes for
Linux VMs to meet any requirements. For more information about scaling the Jenkins primary, refer to the Jenkins
community of best practices, which also includes details about scaling Jenkins.

Availability considerations
Availability in the context of a Jenkins server means being able to recover any state information associated with
your workflow, such as test results, libraries you have created, or other artifacts. Critical workflow state or artifacts
must be maintained to recover the workflow if the Jenkins server goes down. To assess your availability
requirements, consider two common metrics:
Recovery Time Objective (RTO) specifies how long you can go without Jenkins.
Recovery Point Objective (RPO) indicates how much data you can afford to lose if a disruption in service
affects Jenkins.
In practice, RTO and RPO imply redundancy and backup. Availability is not a question of hardware recovery — that
is part of Azure — but rather ensuring you maintain the state of your Jenkins server. Microsoft offers a service level
agreement (SLA) for single VM instances. If this SLA doesn't meet your uptime requirements, make sure you have
a plan for disaster recovery, or consider using a multi-primary Jenkins server deployment (not covered in this
document).
Consider using the disaster recovery scripts in step 7 of the deployment to create an Azure Storage account with
managed disks to store the Jenkins server state. If Jenkins goes down, it can be restored to the state stored in this
separate storage account.

Security considerations
Use the following approaches to help lock down security on a basic Jenkins server, since in its basic state, it is not
secure.
Set up a secure way to log into the Jenkins server. This architecture uses HTTP and has a public IP, but HTTP
is not secure by default. Consider setting up HTTPS on the Nginx server being used for a secure logon.

NOTE
When adding SSL to your server, create a network security group rule for the Jenkins subnet to open port 443. For
more information, see How to open ports to a virtual machine with the Azure portal.

Ensure that the Jenkins configuration prevents cross site request forgery (Manage Jenkins > Configure
Global Security). This is the default for Microsoft Jenkins Server.
Configure read-only access to the Jenkins dashboard by using the Matrix Authorization Strategy Plugin.
Install the Azure Credentials plugin to use Key Vault to handle secrets for the Azure assets, the agents in the
pipeline, and third-party components.
Use RBAC to restrict the access of the service principal to the minimum required to run the jobs. This helps
limit the scope of damage from a rogue job.
Jenkins jobs often require secrets to access Azure services that require authorization, such as Azure Container
Service. Use Key Vault along with the Azure Credential plugin to manage these secrets securely. Use Key Vault to
store service principal credentials, passwords, tokens, and other secrets.
To get a central view of the security state of your Azure resources, use Azure Security Center. Security Center
monitors potential security issues and provides a comprehensive picture of the security health of your deployment.
Security Center is configured per Azure subscription. Enable security data collection as described in the Azure
Security Center quick start guide. When data collection is enabled, Security Center automatically scans any virtual
machines created under that subscription.
The Jenkins server has its own user management system, and the Jenkins community provides best practices for
securing a Jenkins instance on Azure. The solution template for Jenkins on Azure implements these best practices.

Manageability considerations
Use resource groups to organize the Azure resources that are deployed. Deploy production environments and
development/test environments in separate resource groups, so that you can monitor each environment's
resources and roll up billing costs by resource group. You can also delete resources as a set, which is very useful for
test deployments.
Azure provides several features for monitoring and diagnostics of the overall infrastructure. To monitor CPU usage,
this architecture deploys Azure Monitor. For example, you can use Azure Monitor to monitor CPU utilization, and
send a notification if CPU usage exceeds 80 percent. (High CPU usage indicates that you might want to scale up the
Jenkins server VM.) You can also notify a designated user if the VM fails or becomes unavailable.

Communities
Communities can answer questions and help you set up a successful deployment. Consider the following:
Jenkins Community Blog
Azure Forum
Stack Overflow Jenkins
For more best practices from the Jenkins community, visit Jenkins best practices.

Deploy the solution


To deploy this architecture, follow the steps below to install the solution template for Jenkins on Azure, then install
the scripts that set up monitoring and disaster recovery in the steps below.
Prerequisites
This reference architecture requires an Azure subscription.
To create an Azure service principal, you must have admin rights to the Azure AD tenant that is associated with
the deployed Jenkins server.
These instructions assume that the Jenkins administrator is also an Azure user with at least Contributor
privileges.
Step 1: Deploy the Jenkins server
1. Open the Azure Marketplace image for Jenkins in your web browser and select GET IT NOW from the left
side of the page.
2. Review the pricing details and select Continue , then select Create to configure the Jenkins server in the
Azure portal.
For detailed instructions, see Create a Jenkins server on an Azure Linux VM from the Azure portal. For this
reference architecture, it is sufficient to get the server up and running with the admin logon. Then you can
provision it to use various other services.
Step 2: Set up SSO
The step is run by the Jenkins administrator, who must also have a user account in the subscription's Azure AD
directory and must be assigned the Contributor role.
Use the Azure AD Plugin from the Jenkins Update Center in the Jenkins server and follow the instructions to set up
SSO.
Step 3: Provision Jenkins server with Azure VM Agent plugin
The step is run by the Jenkins administrator to set up the Azure VM Agent plugin, which is already installed.
Follow these steps to configure the plugin. For a tutorial about setting up service principals for the plugin, see Scale
your Jenkins deployments to meet demand with Azure VM agents.
Step 4: Provision Jenkins server with Azure Storage
The step is run by the Jenkins administrator, who sets up the Windows Azure Storage Plugin, which is already
installed.
Follow these steps to configure the plugin.
Step 5: Provision Jenkins server with Azure Credential plugin
The step is run by the Jenkins administrator to set up the Azure Credential plugin, which is already installed.
Follow these steps to configure the plugin.
Step 6: Provision Jenkins server for monitoring by the Azure Monitor Service
To set up monitoring for your Jenkins server, follow the instructions in Create metric alerts in Azure Monitor for
Azure services.
Step 7: Provision Jenkins server with managed disks for disaster recovery
The Microsoft Jenkins product group has created disaster recovery scripts that build a managed disk used to save
the Jenkins state. If the server goes down, it can be restored to its latest state.
Download and run the disaster recovery scripts from GitHub.
You may wish to review the following Azure example scenario that demonstrates specific solutions using some of
the same technologies:
CI/CD pipeline for container-based workloads
SharePoint Farm for Development Testing
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

This solution provides a small scale deployment of SharePoint using a single Azure Active Directory (Azure AD),
SQL, and SharePoint resources. It address the capability to deliver agile development architecture to teams within
your business using the latest and greatest support platforms.

Architecture

Download an SVG of this architecture.

Data Flow
1. Create resource group for the storage, network, and virtual machine, plus other dependent elements.
2. Create Virtual Network to host the Virtual Machines and Load Balancers for the deployment. Ensure the network
has appropriate Network Security Groups implement to protect network traffic flow.
3. Create the storage accounts that will host the virtual machine images (VHDs).
4. Create the Active Directory Installation. Either utilizing a new Virtual Machine or using Azure Active Directory
Domain Services. If using Azure Active Directory, you also need to consider synchronizing identities to Azure AD
with Azure AD Connect.
5. Install a supported version of SQL Server on an Azure VM or deploy a pay as you go instance of SQL Server.
6. Deploy SharePoint onto an Azure VM or use a trial image from the gallery which already have SharePoint
Server installed.
7. Create the SharePoint farm.
8. Set up an Azure external load balancer to direct incoming HTTPS traffic to the SharePoint server.
9. Use ExpressRoute or VPN Gateway for management access to resource group.
10. On Premises users can access the SharePoint sites via the internet or ExpressRoute or VPN Gateway.
11. External users can be granted access as required to the SharePoint sites for testing.

Components
Azure Resource Group: Container that holds related resources for an Azure solution
Virtual Network: Provision private networks, optionally connect to on-premises datacenters
Storage Accounts: Durable, highly available, and massively scalable cloud storage
Azure Active Directory: Synchronize on-premises directories and enable single sign-on
SharePoint Server: Microsoft's collaboration server product
Host enterprise SQL Server apps in the cloud
Load Balancer: Deliver high availability and network performance to your applications
Azure ExpressRoute: Dedicated private network fiber connections to Azure
VPN Gateway: Establish secure, cross-premises connectivity

Next steps
Azure Resource Group Documentation
Virtual Network Documentation
Storage Documentation
Active Directory Documentation
SharePoint Server Documentation
SQL Server Documentation
Load Balancer Documentation
ExpressRoute Documentation
VPN Gateway Documentation
Sharing location in real time using low-cost serverless
Azure services
11/2/2020 • 6 minutes to read • Edit Online

This scenario describes how to architect a solution that processes changes to underlying data within a web view
without the need for a page refresh using real-time services. Examples that use this scenario include real-time
tracking of products and goods, as well as social media solutions.
In this scenario, we look at how to set up a real-time messaging service to share live location of a food delivery
service transaction. This example can also be useful for users trying to build a real-time location sharing platform
for their web or mobile applications.
We'll use a SignalR service configured in server-less mode to integrate with an Azure Functions app triggered by a
Service Bus. All of it using .NET Core

Potential use cases


These other uses cases have similar design patterns:
Sharing real-time location with client devices.
Pushing notifications to users.
Updating timelines.
Create chat rooms.

Architecture
Components
Service Bus, a highly reliable cloud messaging service between applications and services, even when one or
more is offline.
SignalR makes it easy to add real-time communications to your web application.
Azure Functions, an event-driven serverless compute platform that can also solve complex orchestration
problems.

Considerations
Here are some of the considerations taken to develop this scenario, including how to configure parameters in the
Azure Service Bus connection string in the ServiceBusTrigger as well as:
Hubs: Hubs can be compared to a video streaming service. You can subscribe to the Hub to send/receive messages
from/to the Hub.
Targets: Targets are like radio channels. Everyone who is listening to the Target channel and is notified when there
is a new message on it.
If you can remember the above two features of SignalR platform, it will be a easy to get up and running quickly.
Availability, scalability, and security
You can achieve high availability of this solution by performing the following steps:
Regional pairing
Each Azure region is paired with another region within the same geography. In general, choose regions from the
same regional pair (for example, East US 2 and Central US). Benefits of doing so include:
If there is a broad outage, recovery of at least one region out of every pair is prioritized.
Planned Azure system updates are rolled out to paired regions sequentially to minimize possible downtime.
In most cases, regional pairs reside within the same geography to meet data residency requirements.
However, make sure that both regions support all of the Azure services needed for your application. See
Services by region. For more information about regional pairs, see Business continuity and disaster recovery
(BCDR): Azure Paired Regions.
Azure Front Door

Azure Front Door is a scalable and secure entry point for fast delivery of your global applications. Using priority
routing, it automatically fails over if the primary region becomes unavailable. A multi-region architecture can
provide higher availability than deploying to a single region. If a regional outage affects the primary region, you
can use Front Door to fail over to the secondary region. This architecture can also help if an individual subsystem
of the solution fails. Stop network and application layer attacks at the edge with Web Application Firewall and
DDoS Protection. Harden your service using Microsoft-managed rule sets and author your own rules for custom
protection of your app.
Front Door is a possible failure point in the system. If the service fails, clients cannot access your application during
the downtime. Review the Front Door service level agreement (SLA) and determine whether using Front Door
alone meets your business requirements for high availability. If not, consider adding another traffic management
solution as a fallback. If the Front Door service fails, change your canonical name (CNAME) records in DNS to point
to the other traffic management service. This step must be performed manually, and your application will be
unavailable until the DNS changes are propagated.
Pricing this scenario
Assume your business has 1000 orders in a day and needs to share location data with all of them concurrently,
your estimated Azure usage for deploying this scenario will be close to $192 per month based on pricing at the
time of writing.

SERVIC E T Y P E EST IM AT ED M O N T H LY C O ST

Azure Functions $119.40

Azure SignalR Service $48.97

Service Bus $23.71

Total $192.08
Deploy this scenario
Azure Functions development
A serverless real-time application built with Azure Functions and Azure SignalR Service typically requires two
Azure Functions:
A "negotiate" function that the client calls to obtain a valid SignalR Service access token and service endpoint
URL.
One or more functions that send messages or manage group membership.
FunctionApp6
This is a function app that creates an Azure Function with Service Bus trigger with SignalR.
Negotiate.cs
This function is triggered by an HTTP request. It is used by client applications to get a token from the SignalR
service which clients can use to subscribe to a hub. This should be named "negotiate." For more information read
this guide
Message.cs
This function is triggered by a Service Bus Trigger. It has a binding with SignalR service. It pulls the message from
the queue and passes it on to a SignalR hub.
Instructions
1. Make sure you have a Service Bus Queue provisioned on Azure.
2. Make sure you have a SignalR service provisioned in serverless mode on Azure.
3. Enter your connection strings (Service Bus & SignalR) in the local.settings.json file.
4. Enter the URL of client application (SignalR client) in CORS. This guide provides the most recent syntax.
5. Enter your Service Bus Queue name in the Service Bus Trigger inside Message.cs file .
Now, let's configure the client application to test. First, grab the example sources from here
SignalR client
This a simple .NET Core web application to subscribe to the hub created by FunctionApp6 and display messages
received on the Service Bus Queue in real time. Although you can use FunctionApp6 to work with a mobile client
but for the purpose of this repository, we'll stick to the web client.
Instructions
1. Make sure FunctionApp6 is running first.
2. Copy the url generated by the Negotiate function. It will look something like this: http://localhost:7071/api/
3. Paste the url into chat.js inside signalR.HubConnectionBuilder().withUrl("YOUR_URL_HERE").build();
4. Run the application.
5. You'll see status connected when the web client successfully subscribes to the SignalR hub.
SendToQueue.js
This is a node js script to push a message to the Service Bus so that you can test the deployment you did above.
Instructions
1. Install the node Azure Service Bus module (@azure/service-bus).
2. Enter your connection strings and queue name in the script.
3. Run the script.

Next steps
You can take this scenario into your production environment, however, make sure your Azure services are set to
scale. For instance, your Azure Service Bus should be set to a standard or premium plan.
You can deploy the code to Azure Functions right from Visual Studio. Follow this guide to learn how to publish your
code to Azure Functions from Visual Studio.

Alternatives
Alternatives exist to address this scenario including Pusher. It is the category leader in robust APIs for app
developers building scalable real time communication features.
There is also PubNub. PubNub makes it easy for you to add real-time capabilities to your apps, without worrying
about the infrastructure. Build apps that allow your users to engage in real time across mobile, browser, desktop,
and server.
No doubt that Pusher and PubNub are the widely adopted platforms for real-time messaging but for this scenario,
we do everything in Azure. SignalR was simply the go to for me as it allows bi-directional communication between
server and client. It is also an open-source tool with 7.9 K GitHub stars and 2.2 K GitHub forks.
Here's a link to SignalR's open-source repository on GitHub.

Related resources
This article explains how to work with Azure Service Bus bindings in Azure Functions. Azure Functions supports
trigger and output bindings for Service Bus queues and topics.
This article explains how to authenticate and send real-time messages to clients connected to Azure SignalR
Service by using SignalR Service bindings in Azure Functions. Azure Functions supports input and output bindings
for SignalR Service.
Run containers in a hybrid environment
11/2/2020 • 7 minutes to read • Edit Online

This reference architecture illustrates how developers can create, manage, and monitor deployed containers in the
public cloud, across multiple clouds, and on-premises.

Download a Visio file of this architecture.


Typical uses for this architecture include:
Web applications with internal and external components that deploy both to the public cloud and on-premises
by using shared container images.
Modern deployment testing cycles with quality analysis, testing, development, or staging that's hosted on-
premises and in the public cloud.

Architecture
Microsoft Azure Container Registr y (ACR) . ACR is a service that creates a managed registry. ACR builds,
stores, and manages container images and can store containerized machine learning models.
Azure Kubernetes Ser vice (AKS) . AKS is a managed service that offers a managed Kubernetes cluster with
elastic scale-out functionality.
Azure Container Instances . Azure Container Instances runs containers on-demand in a serverless Microsoft
Azure environment. Azure Container Instances is a low-friction method of running containers that doesn't
require a full Docker host or Kubernetes installation.
Azure Cosmos DB . Azure Cosmos DB is a multiple model database that can serve data elastically at a massive
scale. Azure Cosmos DB was designed for applications that are globally distributed in a multi-write model.
Azure Key Vault . Azure Key Vault is a hardware-backed credential management service that has tight
integration with Microsoft identity services and compute resources.
Azure Policy . Azure Policy enforces standards and assesses compliance for targeted resources deployed to
Azure.
Azure Private Link . Azure Private Link creates a private endpoint in your virtual network that you can use to
communicate with Azure platform as a service (PaaS) without exposing your service to the public internet.
Azure Monitor . Azure Monitor is an all-encompassing suite of monitoring services for applications that deploy
both in Azure and on-premises.
Azure Security Center . Azure Security Center is a unified security management and threat protection system
for workloads across on-premises, multiple clouds, and Azure.
On-premises Kubernetes cluster . In this architecture, a local Kubernetes cluster is used to run multiple
containers on-premises.

Recommendations
Azure Container Registry
ACR is an enterprise container registry that can implement common best practices by protecting images from
unauthorized access, replicating images across multiple geographies, preventing unnecessary ingress/egress, and
optimizing costs. Additionally, ACR supports turnkey geo-replication across multiple Azure regions, which helps you
minimize latency between ACR, your container hosts, and your development team.
ACR includes a suite of tasks, referred to as ACR Tasks, that can manage cloud-based container image building and
maintenance across a variety of operating systems. ACR Tasks can be triggered manually, by a change to source
control, by a change to the base container image, or on a fixed schedule. The following are scenarios in which you
could use ACR Tasks:
An Internet of Things (IoT) developer is building container images to run on ARM-based IoT devices. The
developer might be using a Linux or macOS operating system to develop the software, but they will need to
perform the build on an ARM platform.
A software as a service (SaaS) development team is building software on Windows computers that will run their
container images on Linux hosts. The team would like its build to be performed on a Linux host.
An open source project maintainer is building a container image that augments a well-known operating system
base image. The maintainer will likely want their container image to update every time the base image updates.

NOTE
ACR Tasks can standardize the build environment and perform continuous integration of your container images.

Azure Container Instances


Azure Container Instances is a low-friction, serverless compute environment for containerized applications. Azure
Container Instances is an excellent choice for container deployment because of its low management overhead and
quick startup times. Container images that are stored in ACR can deploy directly to Azure Container Instances
container groups.
In this reference, Azure Container Instances container groups are utilized as virtual nodes for an Azure Kubernetes
Service cluster. AKS uses virtual nodes to register a virtual pod with unlimited capacity and the ability to dispatch
pods by using Azure Container Instances container groups. This is ideal for scenarios where you want very fast
provisioning of individual pods and only want to pay for the execution time per second.

Scalability considerations
Customer-facing containerized web applications benefit from variable scales. You can use services such as Azure
Container Instances and AKS to dynamically scale out to meet anticipated or measured demand. Additionally,
you can use services such as Azure Functions and Azure App Service to run container images at scale.
Internal application usage is more predictable and can run on an existing Kubernetes cluster. If you're interested
in deploying Azure-managed services on-premises, consider:
Connecting a Kubernetes cluster to Azure Arc.
Deploying Functions or App Service in Azure Stack Hub.
Azure Cosmos DB automatically scales service resources to meet the storage needs of your application in an
elastic manner. For throughput, you can choose to pre-provision throughput or operate Azure Cosmos DB as a
serverless service. If your workload has variable or unpredictable demands, you can also choose to provision
your throughput using autoscale.

Availability considerations
Modern applications typically include a website, one or more HTTP APIs, and some connection to a data store.
Applications within a container image should be designed to be stateless for maximum horizontal scale and
availability. Any data should be stored in a separate service that has similar availability. For guidance on
designing an application that can scale to thousands of nodes, refer to the performance efficiency section of the
Azure Well-Architected Framework.
AKS has a reference architecture baseline that defines each of the Well-Architected Framework categories and
recommends an implementation in line with the category.
To reduce the impact of large pulls of container images, deploy ACR in a region that's closest to the development
team and the production compute services. Consider a geo-replicated ACR deployment for distributed teams
and distributed production containers.
Azure Cosmos DB is a database service that supports turnkey global distribution and supports automatic
failover across multiple regions. Azure Cosmos DB also has the ability to enable multiple region writes and
dynamically add or remove regions.

Manageability considerations
Consider using Azure Resource Manager templates to deploy Azure Container Instance container groups in a
repeatable fashion for multiple region deployments and large-scale orchestration. You can similarly use Azure
Resource Manager templates to deploy Azure Kubernetes Service, Azure Key Vault, and Azure Cosmos DB.
Consider utilizing role-based access control (RBAC) to prevent users from accidentally creating or deleting
container instances without permission.
Use Azure Monitor to monitor metrics and logs for both on-premises and remote containers, analyze the data
using queries, and create alerts for abnormal situations.
Use Azure Policy to implement enforcement of a set of rules for clusters and pods deployed to Kubernetes
Service or an Azure Arc-enabled Kubernetes cluster.

DevOps considerations
Use ACR Tasks to automate the build of container images on a schedule or when changes are made to the
source code. Additionally, consider using ACR Tasks to automatically update container images as base images
are patched and updated.
The AKS team has developed GitHub actions that can assist with implementing GitOps and facilitate
deployments from ACR to AKS clusters.
If your Kubernetes cluster is attached to Azure Arc, you can manage your Kubernetes cluster using GitOps. To
review best practices for connecting a hybrid Kubernetes cluster to Azure Arc, refer to the Azure Arc hybrid
management and deployment for Kubernetes clusters reference architecture.

Security considerations
Use Azure Private Link to communicate to and across services in your virtual network. This will route traffic
through specific subnets to reach the individual Azure services directly and protect your data from inadvertent
exposure to the public internet.

Cost considerations
Use the Azure pricing calculator to estimate costs.
If your development team and production instances are in a single region, consider placing the Container
Registry resource in the same region. This will allow you to minimize container push and pull latency and avoid
the additional costs associated with the Premium Azure Container Registry service tier.
Configuring ACR to use an Azure Virtual Network through an Azure Private Link service endpoint requires the
ACR instance to be deployed in the Premium tier.
AKS offers free cluster management. Billing is isolated to the compute, storage, and networking resources used
by AKS to host nodes. Refer to Azure Virtual Machine or Azure Container Instances pricing to review specific
pricing details for each compute service.
If you require a specific uptime service-level agreement (SLA), you can enable the uptime SLA optional feature
of AKS.
Azure Container Instances resources are billed per second, based on an allocation of virtual CPU and memory
resources, to the container group. Allocating unnecessary compute resources can exponentially increase the
costs required to run this architecture solution. Cost monitoring and optimization is a continuous process that
should be conducted at regular intervals throughout the lifetime of your deployment. For further guidance on
minimizing Azure Container Instances operational costs, refer to the cost optimization section of the Azure Well-
Architected Framework.
Building solutions for high availability using
Availability Zones
11/2/2020 • 7 minutes to read • Edit Online

Microsoft Azure global infrastructure is designed and constructed at every layer to deliver the highest levels of
redundancy and resiliency to its customers. Azure infrastructure is composed of geographies, regions, and
Availability Zones, which limit the blast radius of a failure and therefore limit potential impact to customer
applications and data. The Azure Availability Zones construct was developed to provide a software and networking
solution to protect against datacenter failures and to provide increased high availability (HA) to our customers.
Availability Zones are unique physical locations within an Azure region. Each zone is made up of one or more
datacenters with independent power, cooling, and networking. The physical separation of Availability Zones within
a region limits the impact to applications and data from zone failures, such as large-scale flooding, major storms
and superstorms, and other events that could disrupt site access, safe passage, extended utilities uptime, and the
availability of resources. Availability Zones and their associated datacenters are designed such that if one zone is
compromised, the services, capacity, and availability are supported by the other Availability Zones in the region.
Availability Zones can be used to spread a solution across multiple zones within a region, allowing for an
application to continue functioning when one zone fails. With Availability Zones, Azure offers industry best 99.99%
Virtual Machine (VM) uptime service-level agreement (SLA). Zone-redundant services replicate your services and
data across Availability Zones to protect from single points of failure.

For additional information on Availability Zones, including service support by region and pricing, refer to What are
Availability Zones in Azure? in Microsoft Azure Documentation.

Delivering reliability in Azure


Designing solutions that continue to function in spite of failure is key to improving the reliability of a solution. In
cloud-based solutions, building to survive failure is a shared responsibility. This can be viewed at three levels: a
resilient foundation, resilient services, and resilient applications. The foundation is the Microsoft investment in the
platform, including Availability Zones. On top of this foundation are the Azure services that customers can enable
to support high availability, such as zone-redundant storage (ZRS), which replicates data across zones. The
customer builds applications upon the enabled services supported by the foundation. The applications should be
architected to support resiliency.

Your applications
Your app or workload architecture
Resilient services
Azure capabilities you enable as needed
Resilient foundation
Azure capabilities built into the platform

When architecting for resilience, all three layers-foundation, services, and applications-should be considered to
achieve the highest level of reliability. Since a solution can be made up of many components, each component
should be designed for reliability.

Zonal vs. zone-redundant architecture


An Availability Zone in an Azure region is a combination of a fault domain and an update domain. For example, if
you create three or more VMs across three zones in an Azure region, your VMs are effectively distributed across
three fault domains and three update domains. The Azure platform recognizes this distribution across update
domains to take care that VMs in different zones are not updated at the same time.
Azure services supporting Availability Zones fall into two categories: zonal and zone redundant. Customer
workloads can be categorized to utilize either architecture scenario to meet application performance and durability
requirements.
With zonal architecture, a resource can be deployed to a specific, self-selected Availability Zone to achieve more
stringent latency or performance requirements. Resiliency is self-architected by replicating applications and data to
one or more zones within the region. You can choose specific Availability Zones for synchronous replication,
providing high availability, or asynchronous replication, providing backup or cost advantage. You can pin
resources-for example, virtual machines, managed disks, or standard IP addresses-to a specific zone, allowing for
increased resilience by having one or more instances of resources spread across zones.
With zone-redundant architecture, the Azure platform automatically replicates the resource and data across zones.
Microsoft manages the delivery of high availability, since Azure automatically replicates and distributes instances
within the region.
A failure to a zone affects zonal and zone-redundant services differently. In the case of a zone failure, the zonal
services in the failed zone become unavailable until the zone has recovered. By architecting your solutions to use
replicated VMs in zones, you can protect your applications and data from a zone becoming unavailable-for
example, due to a power outage. If one zone is compromised, replicated apps and data are instantly available in
another zone.
Zonal architecture applies to a specific resource, typically an infrastructure as a service (IaaS) resource, like a VM or
managed disk, as illustrated.
For example, zonal load balancer, VM, managed disks, VM scale sets.
In the illustration, each VM and load balancer (LB) are deployed to a specific zone.
With zone-redundant services, the distribution of the workload is a feature of the service and is handled by Azure.
Azure automatically replicates the resource across zones without requiring your intervention. ZRS, for example,
replicates the data across three zones so a zone failure does not impact the HA of the data.
The following illustration is of a zone-redundant load balancer.

For example, zone-redundant load balancer, Azure Application Gateway, Azure Service Bus, virtual private network
(VPN), zone-redundant storage, Azure ExpressRoute, Azure Event Hubs, Azure Cosmos DB.
A few resources, like the load balancer and subnets, support both zonal and zone-redundant deployments. An
important consideration in HA is distributing the traffic effectively across resources in the different Availability
Zones. For information on how Availability Zones apply to the load balancer resources for both zonal and zone-
redundant resources, refer to Standard Load Balancer and Availability Zones.
The following is a summary of the zonal (Z) and zone-redundant (ZR) Azure services.
Analytics Management and governance
Azure Data Explorer (ZR) Azure Traffic Manager (ZR)
Azure Event Hubs (ZR) Networking
Compute Azure Load Balancer (Z, ZR)
Linux vir tual machines (Z) VPN gateway (ZR)
Windows vir tual machines (Z) Azure ExpressRoute (ZR)
Vir tual machine scale sets (Z, ZR) Azure Application Gateway (ZR)
Azure App Ser vice Environments (Z) Azure Firewall (ZR)
Containers Azure Vir tual WAN (ZR)
Azure Kubernetes Ser vice (AKS) (Z) Security
Azure Ser vice Fabric (Z) Azure Active Director y Domain Ser vices (ZR)
Databases Storage
Azure SQL Database (ZR) Azure Data Lake Storage (ZRS)
Azure Cache for Redis (Z, ZR) Azure Blob storage (ZRS, ZR)
Azure Cosmos DB (ZR) Azure Managed Disks (Z)
DevOps Additional capabilities
Azure DevOps (ZR) Azure Premium Files (ZRS)
Identity Zone-redundant storage (ZRS)
Azure Active Director y Domain Ser vices (ZR) Standard IP address (ZR)
Integration Azure Traffic Analytics (ZR)
Azure Event Grid (ZR)
Azure Ser vice Bus (ZR)

VMs supporting AZs: AV2-series, B-series, DSv2-series, DSv3-series, Dv2-series, Dv3-series, ESv3-series, Ev3-series,
F-series, FS-series, FSv2-series, M-series.
For a list of Azure services that support Availability Zones, per Azure region, refer to the Availability Zones
documentation.

SLA offered by Availability Zones


With Availability Zones, Azure offers industry best 99.99% VM uptime SLA. The full Azure SLA explains the
guaranteed availability of Azure as a whole.
The following diagram illustrates the different levels of HA offered by a single VM, Availability Sets, and Availability
Zones.
Using a VM workload as an example, a single VM has an SLA of 99.9%. This means the VM will be available 99.9%
of the time. Within a single datacenter, the use of Availability Sets can increase the level of SLA to 99.95% by
protecting a set of VMs, ensuring they will not all be on the same hardware. Within a region, VM workloads can be
distributed across Availability Zones to increase the SLA to 99.99%. For more information, refer to Availability
options for VMs in Azure.
Every organization has unique requirements, and you should design your applications to best meet your complex
business needs. Defining a target SLA will make it possible to evaluate whether the architecture meets your
business requirements. Some things to consider include:
What are the availability requirements?
How much downtime is acceptable?
How much will potential downtime cost your business?
How much should you invest in making the application highly available?
What are the data backup requirements?
What are the data replication requirements?
What are the monitoring requirements?
Does your application have specific latency requirements?
For additional guidance, refer to Microsoft Azure Well-Architected Framework define requirements.
Depending on the availability needs of an application, the cost and design complexity will vary. When building for a
VM workload, there will be a cost associated with each VM. For example, two VMs per zone across three active
zones will have a cost for six VMs. For pricing of VM workloads, refer to the Azure pricing calculator.
IaaS: Web application with relational database
11/13/2020 • 12 minutes to read • Edit Online

Availability Zones are unique physical locations within an Azure region. Each zone is made up of one or more
datacenters with independent power, cooling, and networking. The physical separation of availability zones within a
region limits the impact to applications and data from zone failures, such as largescale flooding, major/super-
storms, and other events that would disrupt site access, safe passage, extended utilities uptime, and the availability
of resources. This reference architecture shows best practices for applying Availability Zones to a web application
and Microsoft SQL Server database hosted on virtual machines (VMs) known as a zonal deployment.
This approach is used in high availability (HA) scenarios with resiliency as the most significant concern. With HA
architecture there is a balance between high resilience, low latency, and cost. This architecture uses redundant
resources spread across zones to provide high resiliency. Traffic can be routed between zones to minimize the
impact of a zonal failure. If a zone does fail, resources in the zone absorb the traffic until the zone recovers. This
provides a high level of resilience.
This architecture provides an efficient use of resources as most of the resources are actively used. All resources are
used in handling the requests other than the passive SQL Server. The passive SQL Server only becomes active in
the case of a failure of the current SQL Server.
The zone-redundant application gateway and zone-redundant load balancer distribute the traffic to the available
resources. Cross-zone routing will have increased latency relative to traffic routed within the zone.

Download a Visio file of this architecture.

Architecture
The architecture uses resources spread across multiple zones to provide a high availability (HA) architecture for
hosting an Infrastructure as a Service (IaaS) web application and SQL Server database. A zone-redundant
application gateway routes traffic to virtual machines within the web tier. A zone-redundant load balancer routes
traffic from the virtual machines in the web tier to the active SQL Server. In the case of a zonal failure, the
application gateway will route to virtual machines in the other available zones.
If the active SQL Server becomes unavailable, either due to a zone failure or local failure, a passive SQL Server will
become the active SQL Server. The zone-redundant load balancer will detect the failover of SQL Server and route
traffic to the newly active SQL Server.
The following illustrates a failure of Zone 1.

The application gateway is zone-redundant and is not impacted by the failure of Zone 1. The application gateway
uses health probes to determine the available virtual machines. With Zone 1 unavailable, the zone-redundant
application gateway only routes traffic to the remaining two zones. The zone-redundant load balancer is also not
impacted by the failure of Zone 1 and uses health probes to determine the location of the active SQL Server. In this
example, the load balancer will detect that the active SQL Server is in Zone 3 and will then route traffic to the SQL
Server instance in Zone 3.
Spreading resources across Availability Zones also protects an application from planned maintenance. When VMs
are distributed across three Availability Zones, effectively they are spread across three update domains. The Azure
platform recognizes this distribution across update domains to ensure that VMs in different zones are not updated
at the same time.
By replicating VMs across Availability Zones, you can protect your applications and data from a zonal failure. This
allows Azure to support the industry best 99.99% VM uptime service-level agreement (SLA). See Building solutions
for high availability using Availability Zones.
The architecture has the following components.
General
Resource group . Resource groups are used to group Azure resources so they can be managed by lifetime,
owner, or other criteria.
Availability Zones . Availability Zones are physical locations within an Azure region. Each zone consists of
one or more datacenters with independent power, cooling, and networking. By placing VMs across zones, the
application becomes resilient to failures within a zone.
Networking and load balancing
Vir tual network and subnets . Every Azure VM is deployed into a virtual network (VNet) that can be
segmented into subnets. Create a separate subnet for each tier.
Azure Application Gateway . Azure Application Gateway is a layer 7 load balancer. In this architecture, a
zone-redundant application gateway routes HTTP requests to the web front end. Application Gateway also
provides a Web Application Firewall (WAF) that protects the application from common exploits and
vulnerabilities. The v2 SKU of Application Gateway supports cross-zone redundancy. A single Application
Gateway deployment can run multiple instances of the gateway. For production workloads, run at least two
instances. For more information, see Autoscaling and Zone-redundant Application Gateway v2 and How
does Application Gateway support high availability and scalability?.
Azure Load Balancer . Azure Load Balancer is a layer 4 load balancer. In this architecture, a zone-redundant
Azure Standard Load Balancer directs network traffic from the web tier to SQL Server. Because a zone-
redundant load balancer is not pinned to a specific zone, the application will continue to distribute the
network traffic in the case of a zonal failure. A zone-redundant load balancer is used to provide availability in
the case the active SQL Server becomes unavailable. The Standard SKU of Azure Load Balancer supports
cross-zone redundancy. For more information, see Standard Load Balancer and Availability Zones.
Network security groups (NSGs) . A network security group is used to restrict network traffic within the
virtual network. In this architecture, the web tier only accepts traffic from the public IP endpoint. Additionally,
the database tier does not accept traffic from any subnet other than the web-tier subnet.
DDoS protection . The Azure platform provides protection against distributed denial of service (DDoS)
attacks. For additional protection, we recommend using DDoS Protection Standard, which has enhanced
DDoS mitigation features. See Security considerations.
Bastion . Azure Bastion provides secure and seamless Remote Desktop Protocol (RDP) and Secure Shell
(SSH) access to the VMs within the VNet. This provides access while limiting the exposed public IP addresses
of the VMs with the VNet. Azure Bastion provides a cost-effective alternative to a provisioned VM to
provide access to all VMs within the same virtual network.
Microsoft SQL Server
SQL Ser ver Always On availability group . SQL Server Always On availability group provides high
availability at the data tier by enabling replication and failover. It uses Windows Server Failover Cluster
(WSFC) technology for failover.
Cloud Witness . A failover cluster requires more than half of its nodes to be running, which is known as
having quorum. If the cluster has just two nodes, a network partition could cause each node to think it's the
primary node. In that case, you need a witness to break ties and establish quorum. A witness is a resource
such as a shared disk that can act as a tie breaker to establish quorum. Cloud Witness is a type of witness
that uses Azure Blob Storage. The Azure Blob Storage must use Zone Redundant Storage (ZRS) to not be
impacted by zonal failure.
To learn more about the concept of quorum, see Understanding cluster and pool quorum. For more
information about Cloud Witness, see Deploy a Cloud Witness for a Failover Cluster.

Recommendations
Your requirements might differ from the architecture described here. Use these recommendations as a starting
point.
For recommendations on configuring the VMs, see Run a Windows VM on Azure.
For more information about designing virtual networks and subnets, see Plan and design Azure Virtual Networks.
Network security groups
Use NSG rules to restrict traffic between tiers. In the architecture shown earlier, only the web tier can communicate
directly with the database tier. To enforce this rule, the database tier should block all incoming traffic except for the
web-tier subnet.
Deny all inbound traffic from the virtual network. (Use the VIRTUAL_NETWORK tag in the rule.)
Allow inbound traffic from the web-tier subnet.
Allow inbound traffic from the database-tier subnet itself. This rule allows communication between the
database VMs, which is needed for database replication and failover.
Create rules 2 – 3 with higher priority than the first rule, so they override it.
SQL Server Always On availability groups
We recommend Always On availability groups for Microsoft SQL Server high availability. Other tiers connect to the
database through an availability group listener. The listener enables a SQL client to connect without knowing the
name of the physical instance of SQL Server. VMs that access the database must be joined to the domain. The client
(in this case, another tier) uses DNS to resolve the listener's virtual network name into IP addresses.
Configure the SQL Server Always On availability group as follows:
Create a Windows Server Failover Clustering (WSFC) cluster, a SQL Server Always On availability group, and
a primary replica. For more information, see Getting Started with Always On availability groups.
Create an internal load balancer with a static private IP address.
Create an availability group listener and map the listener's DNS name to the IP address of an internal load
balancer.
Create a load balancer rule for the SQL Server listening port (TCP port 1433 by default). The load balancer
rule must enable floating IP, also called Direct Server Return. This causes the VM to reply directly to the client,
which enables a direct connection to the primary replica.

NOTE
When floating IP is enabled, the front-end port number must be the same as the back-end port number in the load balancer
rule.

When a SQL client tries to connect, the load balancer routes the connection request to the primary replica. If there
is a failover to another replica, the load balancer automatically routes new requests to a new primary replica. For
more information, see Configure a load balancer for an availability group on Azure SQL Server VMs.
During a failover, existing client connections are closed. After the failover completes, new connections will be routed
to the new primary replica.
If your application makes significantly more reads than writes, you can offload some of the read-only queries to a
secondary replica. See Connect to a read-only replica.
Test your deployment by forcing a manual failover of the availability group.

Availability considerations
Availability Zones provide high resilience within a single region. If you need even higher availability, consider
replicating the application across two regions, using Azure Traffic Manager for failover. For more information, see
Run an N-tier application in multiple Azure regions for high availability.
Not all regions support Availability Zones, and not all VM sizes are supported in all zones. Run the following Azure
CLI command to find the supported zones for each VM size within a region:

az vm list-skus --resource-type virtualMachines --zone false --location


\<location\> \\

\--query "[].{Name:name, Zones:locationInfo[].zones[] \| join(','\@)}" -o tabl


Virtual machine scale sets automatically use placement groups, which act as an implicit availability set. For more
information about placement groups, see Working with large virtual machine scale sets.
Health probes
Azure Application Gateway and Azure Load Balancer both use health probes to monitor the availability of VM
instances.
Application Gateway always uses an HTTP probe.
Load Balancer can test either HTTP or TCP. Generally, if a VM runs an HTTP server, use an HTTP probe.
Otherwise, use TCP.
If a probe can't reach an instance within a timeout period, the gateway or load balancer stops sending traffic to that
VM. The probe continues to check and will return the VM to the back-end pool if the VM becomes available again.
HTTP probes send an HTTP GET request to a specified path and listen for an HTTP 200 response. This path can be
the root path ("/"), or a health-monitoring endpoint that implements some custom logic to check the health of the
application. The endpoint must allow anonymous HTTP requests.
For more information about health probes, see:
Load Balancer health probes
Application Gateway health monitoring overview
For considerations about designing a health probe endpoint, see Health Endpoint Monitoring pattern.

Cost considerations
Use the Azure Pricing Calculator to estimates costs. Here are some other considerations.
Virtual machine scale sets
Virtual machine scale sets are available on all Windows VM sizes. You are only charged for the Azure VMs you
deploy, and any additional underlying infrastructure resources consumed, such as storage and networking. There
are no incremental charges for the virtual machine scale sets service.
For single VMs pricing options, see Windows VMs pricing.
SQL Server
If you choose Azure SQL DBaaS, you can save on cost because you don't need to configure an Always On
availability group and domain controller machines. There are several deployment options starting from single
database up to managed instance, or elastic pools. For more information, see Azure SQL pricing.
For SQL server VMs pricing options, see SQL VMs pricing.
Azure Load Balancer
You are charged only for the number of configured load-balancing and outbound rules. Inbound NAT rules are free.
There is no hourly charge for the Standard Load Balancer when no rules are configured.
For more information, see the cost section in Azure Architecture Framework.
Azure Application Gateway
The Application Gateway should be provisioned with the v2 SKU and can span multiple Availability Zones. With the
v2 SKU, the pricing model is driven by consumption and has two components: Hourly fixed price and a
consumption-based cost.
For more information, see the pricing section in Autoscaling and Zone-redundant Application Gateway v2.
Security considerations
Virtual networks are a traffic isolation boundary in Azure. By default, VMs in one virtual network can't
communicate directly with VMs in a different virtual network. However, you can explicitly connect virtual networks
by using virtual network peering.
Network security groups
Use network security groups (NSGs) to restrict traffic to and from the internet. For more information, see Microsoft
cloud services and network security.
DMZ
Consider adding a network virtual appliance (NVA) to create a DMZ between the internet and the Azure virtual
network. NVA is a generic term for a virtual appliance that can perform network-related tasks, such as firewall,
packet inspection, auditing, and custom routing. For more information, see Network DMZ between Azure and an
on-premises datacenter.
Encryption
Encrypt sensitive data at rest and use Azure Key Vault to manage the database encryption keys. Key Vault can store
encryption keys in hardware security modules (HSMs). For more information, see Configure Azure Key Vault
Integration for SQL Server on Azure VMs. It's also recommended to store application secrets, such as database
connection strings, in Key Vault.
DDoS protection
The Azure platform provides basic DDoS protection by default. This basic protection is targeted at protecting the
Azure infrastructure. Although basic DDoS protection is automatically enabled, we recommend using DDoS
Protection Standard. Standard protection uses adaptive tuning, based on your application's network traffic patterns,
to detect threats. This allows it to apply mitigations against DDoS attacks that might go unnoticed by the
infrastructure-wide DDoS policies. Standard protection also provides alerting, telemetry, and analytics through
Azure Monitor. For more information, see Azure DDoS Protection: Best practices and reference architectures.

Next steps
Microsoft Learn module: Tour the N-tier architecture style
Microsoft FSLogix for the enterprise
11/2/2020 • 13 minutes to read • Edit Online

This article provides insights on designing, sizing, and implementing a Microsoft FSLogix Profile Container solution
for large enterprises, as well as shows how to avoid performance problems in production. This article is an
extension of the Windows Virtual Desktop at enterprise scale article.
FSLogix is a set of solutions that enhance, enable, and simplify non-persistent Windows computing environments.
FSLogix solutions are appropriate for virtual environments in both public and private clouds. These solutions may
also be used to create more portable computing sessions when using physical devices.
For combining FSLogix with Windows Virtual Desktop as a desktop virtualization solution on Azure, store your
profiles on either Azure Files or Azure NetApp Files as described in Storage options for FSLogix profile containers
in Windows Virtual Desktop. This way you can leverage another Azure platform service that requires zero
infrastructure, and simplify management of your storage environment.

FSLogix filter driver architecture


It's common to copy a profile to and from the network, when a user signs in and out of a remote environment.
Because user profiles can often be large, sign in and sign out times often became unacceptable. FSLogix Containers
redirect user profiles to a network location. Profiles are placed in VHDx files and mounted at run time. Mounting
and using a profile on the network eliminates delays often associated with solutions that copy files.
The conceptual architecture diagram below shows how FSLogix works within the operating system. The Windows
Service agent needs to be installed in the VDI image. Once the installation is completed, two filter drivers are
injected into the operating system. Thereafter, you can set appropriate registry (or ADMX) entries, to place a VHDx
container on a file system or SMB share location (Azure Storage as shown in the diagram below).

Profile Container and Office Container


FSLogix Profile Container and Office Container are the solutions provided by Microsoft to store roaming user
profiles in Windows Virtual Desktop.
The Office Container is a subset of Profile Container. Although all of the benefits of the Office Container are also
available in the Profile Container, there are times when it may be beneficial to use them together. It's important to
completely understand the configuration process, especially when using them together.
Profile Container is used to redirect the full user profile. Profile Container is used in non-persistent, virtual
environments, such as Virtual Desktops. When using Profile Container, the entire user profile is captured, except for
data that is excluded.
There are several reasons why Profile Container and Office Container may be used together. For more information,
read the comparison of Profile Container vs. Office Container.

NOTE
The recommendation in Windows Virtual Desktop is to use Profile Container without Office Container for a simpler and
therefore, a more robust solution.

Multiple profile connections


Concurrent or multiple connections refers to a user connected to multiple sessions, in either multiple hosts or the
same host, concurrently using the same profile. This should not be confused with the term multi-session, which
refers to an operating system that supports multiple users to connect simultaneously.

NOTE
Concurrent or multiple connections are discouraged in Windows Virtual Desktop. The best practice is to create a different
profile location for each session (as a host pool).

Be aware of the following limitations of a multiple connection deployment:


Microsoft OneDrive does not support multiple connection environments using any profile-roaming technology.
Using OneDrive with multiple profile connections may cause data loss.
Microsoft Outlook has limited support for multiple connections.
End-users must be educated on what to expect. Using read-only containers result in a unique experience that a
user, without proper context, may experience as data loss. Read Concurrent Connections with Profile Container
and Office Container for more information.
Performance requirements
In terms of overall profile size, limitations or quotas for FSLogix depend on the storage type used for the user
profile VHDx files, as well as the size limitations of the VHD/VHDx format.
For network bandwidth, depending on the type of utilization it is recommended to plan between 5 and 30 MbPS
per user.
Additionally, the following table gives an example of how many resources an FSLogix profile needs to support each
user. Requirements can vary widely depending on the user, applications, and activity on each profile, so your actual
usage may vary significantly from what is listed here. The table uses an example a single user. Use this to estimate
requirements for the total number of users in your environment. For example, you may need around 1,000 IOPS
(input/output operations per second) for 100 users, and around 5,000 IOPS during sign-in and sign-out, if a large
number of users login during a short period of time creating a login storm.

RESO URC E REQ UIREM EN T

Steady state IOPS 10

Sign-in/sign-out IOPS 50

Storage options for FSLogix profile containers


Azure offers multiple storage solutions that you can use to store your FSLogix profile container. We recommend
storing FSLogix profile containers on Azure Files or Azure NetApp Files for most customer scenarios. The article
Storage options for FSLogix profile containers in Windows Virtual Desktop compares the different managed
storage solutions Azure offers for Windows Virtual Desktop FSLogix user profile containers.
Storage spaces direct (S2D) is supported in conjunction with FSLogix and Windows Virtual Desktop as well. It is a
self-managed storage solution that is out of scope for this article. Customers can get most value out of either
Azure Files or Azure NetApp Files while simplifying management of Windows Virtual Desktop.

Best practices
The following are general best practices for FSLogix profile containers.
For optimal performance, the storage solution and the FSLogix profile container should be in the same data-
center location.
Exclude the VHD(X) files for profile containers from antivirus scanning, to avoid performance bottlenecks.
We recommend to use a separate profile container per host pool, while having two active sessions.
Azure Files best practices
The following list describes some important things to keep in mind when using Azure Files.
Azure Files storage account name cannot be larger than 15 characters.
Azure Files permissions should match permissions described in Requirements - Profile Containers.
The storage account containing the master image must be in the same region and subscription as the virtual
machines (VMs) being provisioned.
Private link for Azure storage could be used to enable a more secure data access as well as to improve the
network latency from your session hosts to your storage account. This is also beneficial in hybrid scenarios with
ExpressRoute connectivity.
You can pre-provision space to your Azure Files Premium share to accommodate more IOPs for your users
proactively. This allows you to use more IOPs during the initial user-logins.
Azure Files Premium tier has a built-in bursting mechanism that gives you thrice more IOPs for the first 60
minutes of a session.
Azure Files sync can be used to replicate existing profile containers into Azure Files easily.
With the FSLogix ObjectSpecific per-group setting, you can filter different Azure Files storage accounts to
accommodate more users. The maximum limit of IOPs per storage account doesn't mean you cannot stack
them. This applies to both personal and pooled host pool scenarios. The architecture diagram below explains it
in more details.
You are able to use multiple storage accounts in one Azure virtual network (VNet). For example, you can assign
different network shares to specific groups of users in your environment using AD groups.
The table below shows you how you can further optimize your WVD environment. Detailed information of Azure
Files performance targets (Standard and Premium) and pricing is available to help you fine-tune your file-share
solution.

W O RK LO A D EXA M P L ES REC O M M EN DED T IER

Light Data entry Less than 200 concurrent active users:


Standard
More than 200 concurrent active users:
Premium

Medium Casual users, LoB apps Premium file shares

Heavy Software engineers, content creation Premium file shares

Power CAD, 3D, machine learning Premium file shares

Azure NetApp Files best practices


Azure NetApp Files has been proven to be a great managed storage solution for FSLogix Profiles and Windows
Virtual Desktop. The low latency and the high amount of IOPs is a great mixture for enterprises at scale.
Currently, up to 1000 IP connections per active VNet are possible. These are the open connections per VM over the
VNet to the share; that is, this limitation is applicable per VM and not per session. This also includes VNET Peerings
as well. The following subsections can help you to proactively design your environment. Read Benefits of using
Azure NetApp Files with Windows Virtual Desktop for more information.
Pooled scenarios
If the WVD Windows 10 Multi-session user per vCPU recommendations sizing for the D32as_v4 VM is calculated
based on the light or medium workload, more than 120,000 users would fit within 1,000 virtual machines before
approaching the 1,000 IP limit, as shown in the following figure.

Personal scenarios
Users are mapped to specific desktop pods. Each pod has just under 1,000 virtual machines, leaving room for IP
addresses propagating from the management VNet. Azure NetApp Files can easily handle 900+ personal desktops
per single-session host pool. The actual number of virtual machines is equal to 1,000 minus the number of
management hosts found in the Hub VNet.

Storage permissions
The following NTFS permissions are recommended to use. For correct and secure use, user permissions must be
created to allow permissions to create and use a profile, while not allowing access to other users profiles. The
Profile Container storage permissions can also be found in the article Configure storage permissions for use with
Profile Containers and Office Containers.

USER A C C O UN T F O L DER P ERM ISSIO N S

Users This Folder Only Modify

Creator/Owner Subfolders and Files Only Modify

Administrator (optional) This Folder, Subfolders, and Files Full Control

Storage exclusions
We recommend keeping native profile folder locations in the FSLogix profile container. However, it could be
beneficial in some scenarios to exclude folders to be more efficient, for example, for temporary or caching data.
The sections below describe the common FSLogix exclusions.
Teams exclusions
Exclude the following from the Teams caching folder, %appdata%/Microsoft/Teams. Excluding these items helps
reduce the user caching size to further optimize your non-persistent setup.
.txt files
Media-stack folder
meeting-addin\Cache (%appdata%\Microsoft\Teams\meeting-addin\Cache)
Check the FSLogix exclusions documentation to learn how to configure the Teams specific exclusions above within
FSLogix Profile Container.
Antivirus exclusions
Make sure to configure the following antivirus exclusions for FSLogix Profile Container virtual hard drives. Make
sure to check the following information with your security team.
Exclude files:
ProgramFiles%\FSLogix\Apps\frxdrv.sys
%ProgramFiles%\FSLogix\Apps\frxdrvvt.sys
%ProgramFiles%\FSLogix\Apps\frxccd.sys
%TEMP%*.VHD
%TEMP%*.VHDX
%Windir%\TEMP*.VHD
%Windir%\TEMP*.VHDX
\\storageaccount.file.core.windows.net\share**.VHD
\\storageaccount.file.core.windows.net\share**.VHDX
Exclude processes:
%ProgramFiles%\FSLogix\Apps\frxccd.exe
%ProgramFiles%\FSLogix\Apps\frxccds.exe
%ProgramFiles%\FSLogix\Apps\frxsvc.exe

Using Cloud Cache


Cloud Cache is an add-on to FSLogix. It uses a local cache to service all reads from a redirected Profile or Office
Container, after the first read. Cloud Cache also allows the use of multiple remote locations, which are all
continuously updated during the user session, creating true real-time profile replication. Using Cloud Cache can
insulate users from short-term loss of connectivity to remote profile containers as the local cache is able to service
many profile operations. In case of a provider failure, Cloud Cache provides business continuity.
Because the local cache file will service most IO requests, the performance of the local cache file will determine the
user experience. It is critical that the storage used for this file be high-performing and highly available. Any storage
used for the local cache file should either be a physically attached storage, or have reliability and performance
characteristics that meet or exceed a high-performing physically attached storage.
Cloud Cache is only one of many options that may be considered for business continuity when using profile
containers. Cloud Cache provides real-time duplication of the user profile, that will actively fail-over if connectivity
to a Cloud Cache provider is lost.
There are a number of considerations when implementing Cloud Cache. It should:
provide real-time profile high availability,
insulate users from short-term connectivity issues to remote Cloud Cache providers by servicing read/writes
from local cache, and
enable the use of Azure Page Blob storage via REST API.
It can be an effective profile tool when configuring multi-geo environments, and
requires high-performing and highly available host-attached (or equivalent) storage to support local cache,
It is executed on the host, utilizing processor, memory, network, and storage resources.
When used as a solution for high availability, Cloud Cache requires multiple, full copies of the user profile, in
addition to the local cache file.
Because of the resource utilization, it may be more cost effective to consider alternate backup/disaster recovery
solutions for FSLogix profile containers. Cloud Cache is generally used when one of its features provides unique
value, such as real-time profile high availability. If an environment can be adequately serviced with an alternate
form of backup, it is often more economical than Cloud Cache.

Disaster recovery
Large multi-national Windows Virtual Desktop deployments could require availability of the users profile across
different regions. So it's important to have the same profile available in the Azure region of the host pool.
Azure Files offers the replication option of a storage account fail-over against the other region configured in
your storage account redundancy plan. This is only supported for the standard storage account type using
Geo-Redundent Storage (GRS). Other options to use are AzCopy or any other file copy mechanism such as
RoboCopy.
Azure NetApp Files offers cross-region replication. With this feature, you are able to replicate your FSLogix
file share to another region over the Azure backbone.

Backup and restore


Making regular backups of your Windows profiles could be beneficial in different situations. Scenarios where a
user accidentally removes data and wants to restore, or profile and/or data corruption due to malware infections,
could be an indicator to have your data safely stored on Azure.
Azure Files Premium tier integrates with Azure Backup and is supported in conjunction with FSLogix. Azure
NetApp Files offers a similar snapshot mechanism to make copies of your FSLogix profile containers.

Maintenance
As explained earlier in this article, FSLogix Profile Container works with virtual disks. The virtual hard disks are in
the VHD or VHDx file format; both are supported with this tool. By default, the disks created will be in a
Dynamically Expanding format rather than Fixed format. This could result in situations where the size of the disk is
higher than the actual files inside the virtual disk. So it's common to do maintenance after shrinking the virtual
hard disks on a schedule, for example, on a monthly basis. This can be possible with a script that can decrease the
amount of your profiles. This script can be designed to work at an enterprise scale, to reduce the size of thousands
of disks in the shortest time possible. It can be run from any machine in your environment. It does not need to be
run from a file server hosting the disks. It does not need the Hyper-V role installed.
This tool is multi-threaded and will take advantage of multiple CPU cores on the machine from which you run the
script. It is not advised to run more than twice the number of threads of the available cores on your machine. You
could also use the number of threads to throttle the load on your storage.
Download the tool here.

NOTE
This script is not supported by the Microsoft product group. It is community driven. This script does not support reducing
the size of a Fixed file format.

Best practice settings for enterprises


The following settings are commonly used by our customers in their desktop virtualization environments, on top
of the default settings such as Enabled and VHDLocations.
SET T IN G VA L UE REA SO N

DeleteLocalProfileWhenVHDShouldAppl 1 This setting avoids errors while logging


y in with an existing local profile. It
removes the profile first, if any already
exists.

SizeInMBs 30000 Specifies the size of newly created


VHD(X) in number of MB. Default value
is 30000 MB or 30 GB.

VolumeType VHDx More capabilities for PowerShell and


maintenance.

FlipFlopProfileDirectoryName 1 Makes it easier to search for the specific


profile container user folder on the
network share.

Next steps
Read the following articles for more information:
Windows Virtual Desktop documentation.
FSLogix documentation.
Storage options for FSLogix profile containers in Windows Virtual Desktop.
Windows Virtual Desktop for the enterprise.
Manage configurations for Azure Arc enabled servers
11/2/2020 • 7 minutes to read • Edit Online

This reference architecture illustrates how Azure Arc enables you to manage, govern, and secure servers across on-
premises, multiple cloud, and edge scenarios.

Download a Visio file of this architecture.


Typical uses for this architecture include:
Organize, govern, and inventory large groups of virtual machines (VMs) and servers across multiple
environments.
Enforce organization standards and assess compliance at scale for all your resources anywhere with Azure
Policy.
Easily deploy extensions to Azure VMs and Arc enabled servers.
Configure and enforce Azure Policy for VMs and servers hosted across multiple environments.

Architecture
The architecture consists of the following components:
Azure Arc . Azure Arc enables you to connect Azure to your Windows and Linux machines hosted outside of
Azure on your corporate network. When a server is connected to Azure, it becomes an Arc enabled server and is
treated as a resource in Azure. Each Arc enabled server has a Resource ID, a managed system identity, and is
managed as part of a resource group inside a subscription. Arc enabled servers benefit from standard Azure
constructs such as inventory, policy, tags, and Azure Lighthouse.
Azure VMs . Azure VMs include Windows or Linux VMs hosted in Azure.
Azure Policy Guest Configuration . Azure Policy Guest Configuration can audit operating systems and
machine configuration both for machines running in Azure and Arc enabled servers running on-premises or in
other clouds.
Azure Monitor . Azure Monitor enables you to track performance and events for systems running in Azure, on-
premises, or in other clouds.

Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have a
specific requirement that overrides them.
Connecting machines to Azure Arc
You can connect any other physical or virtual machine running Windows or Linux to Azure Arc. To use Azure Arc to
connect the machine to Azure, you need to install the Azure Connected Machine agent on each machine that you
plan to connect using Azure Arc.
Once configured, the Connected Machine agent sends a regular heartbeat message every five minutes to Azure.
When the heartbeat is not received, Azure assigns the machine Offline status, which is reflected in the portal within
15 to 30 minutes. Upon receiving a subsequent heartbeat message from the Connected Machine agent, its status
will automatically change to Connected.
Manual installation
You can enable Azure Arc for servers for one or a few Windows or Linux machines in your environment by using
the Windows Admin Center tool set or by performing a set of steps manually. You can download the Windows
agent Windows Installer package from the Microsoft Download Center or the Linux agent from Microsoft's package
repository.
For more information, refer to Overview of Azure Arc for servers agent.
Script-based installation
You can perform automated agent installation by running a template script that you download from the Azure
portal. This script automates the download and installation of both agents.
This method requires that you have administrator permissions on the machine to install and configure the agent.
You can use the root account on Linux, or on Windows, you can join the local administrators group.
Connect machines at scale using a service principal
To connect the machines to Azure Arc for servers, you can use an Azure Active Directory service principal instead of
using your privileged identity to interactively connect the machine. A service principal is a special, limited
management identity that has the minimum permission necessary to connect machines to Azure using the
azcmagent command. Using a service principal is safer than using a higher privileged account like a Tenant
Administrator and follows access control security best practices. The service principal is only used during
onboarding. For more information, refer to Connect hybrid machines to Azure at scale.
Installation using Windows PowerShell DSC
You can automate agent installation and configuration for a Windows computer by using Windows PowerShell
Desired State Configuration (DSC), Windows PowerShell, the AzureConnectedMachine DSC Module and a service
principal for onboarding. For more information, refer to How to install the Connected Machine agent using
Windows PowerShell DSC.
Manage VM extensions
Azure Arc for servers enables you to deploy Azure VM extensions to non-Azure Windows and Linux VMs, providing
a consistent extension management experience between Azure and non-Azure VMs. VM extensions allow you to:
Use Azure Automation State Configuration to centrally store configurations and maintain the desired state of Arc
enabled servers through the DSC VM extension.
Collect log data for analysis with Azure Monitor Logs enabled through the Log Analytics agent VM extension.
Analyze the performance of your Windows and Linux VMs using Azure Monitor and monitor their processes and
dependencies on other resources and external processes.
Download and execute scripts on Arc enabled servers using the Custom Script extension.
For more information, refer to Virtual machine extension management with Azure Arc for servers.
Implement Azure Policy Guest Configuration
Azure Policy Guest Configuration can audit settings inside a machine, both for machines running in Azure and Arc
enabled servers. For example, you can audit settings such as:
Operating system configuration
Application configuration or presence
Environment settings
There are several Azure Policy built-in definitions for Azure Arc. These policies provide auditing and configuration
settings for both Windows and Linux-based machines.
Implement Update Management
You can perform update management for Arc enabled servers. Update management in Azure Automation enables
you to manage operating system updates and quickly assess the status of available updates on all agent machines.
You can also manage the process of installing required updates for servers.
Implement Change Tracking and Inventory
You can use Azure Automation Change Tracking and Inventory for Arc enabled servers to determine what software
is installed in your environment. You can collect and observe inventory for software, files, Linux daemons, Windows
services, and Windows Registry keys on your computers. Tracking the configurations of your machines can help
you pinpoint operational issues across your environment and better understand the state of your machines.
Implement Azure Monitor
You can use Azure Monitor to monitor your VMs, virtual machine scale sets, and Azure Arc machines at scale. Azure
Monitor analyzes the performance and health of your Windows and Linux VMs and monitors their processes and
dependencies on other resources and external processes. It includes support for monitoring performance and
application dependencies for VMs that are hosted on-premises or in another cloud provider.
Implement Azure Sentinel
You can use Azure Sentinel to deliver intelligent security analytics and threat intelligence across the enterprise,
providing a single solution for alert detection, threat visibility, proactive hunting, and threat response. Azure
Sentinel is a scalable, cloud-native, security information event management (SIEM) and security orchestration
automated response (SOAR) solution that enables several scenarios including:
Collect data at cloud scale across all users, devices, applications, and infrastructure, both on-premises and in
multiple clouds.
Detect previously undetected threats and minimize false positives.
Investigate threats with artificial intelligence and hunt for suspicious activities at scale.
Respond to incidents rapidly with built-in orchestration and automation of common tasks.
Topology and network considerations
The Connected Machine agent for Linux and Windows communicates outbound securely to Azure Arc over TCP port
443 . If the machine connects through a firewall or proxy server to communicate over the internet, review the
required URLs and service tags found on the Azure Arc Agent Networking configuration page.

Availability considerations
In most cases, the location you select when you create the installation script should be the Azure region
geographically closest to your machine's location. The rest of the data will be stored within the Azure geography
containing the region you specify, which might also affect your choice of region if you have data residency
requirements. If an outage affects the Azure region to which your machine is connected, the outage will not
affect the Arc enabled server, but management operations using Azure might not be able to complete. For
resilience in the event of a regional outage, if you have multiple locations which provide a geographically-
redundant service, it's best to connect the machines in each location to a different Azure region.
Ensure that Azure Arc is supported in your regions by checking supported regions.
Ensure that services referenced in the Architecture section are supported in the region to which Azure Arc is
deployed.

Manageability considerations
Consult the list of supported operated systems on the Azure Arc agent overview page.
Before configuring your machines with Azure Arc for servers, you should review the Azure Resource Manager
subscription limits and resource group limits to plan for the number of machines to be connected.

Security considerations
Appropriate role-based access control (RBAC) access should be managed for Arc enabled servers. To onboard
machines, you must be a member of the Azure Connected Machine Onboarding role. To read, modify, re-
onboard, and delete a machine, you must be a member of the Azure Connected Machine Resource
Administrator role.
You can use Azure Policy to manage security policies across your Azure VMs and Arc enabled servers including
implementing security policies in Azure Security Center. A security policy defines the desired configuration of
your workloads and helps ensure you're complying with the security requirements of your company or
regulators. Security Center policies are based on policy initiatives created in Azure Policy.

Cost considerations
Use the Azure pricing calculator to estimate costs.
Other considerations are described in the Principles of cost optimization section in the Microsoft Azure Well-
Architected Framework.
Azure Automation in a hybrid environment
11/2/2020 • 16 minutes to read • Edit Online

This reference architecture illustrates how to extend automation to on-premises or other cloud providers. It
describes the services that must be deployed in Azure to provide automated management and configuration across
on-premises or other cloud providers. The same architecture can be applied on Azure virtual machines (VMs) that
reside behind a firewall, with outbound connectivity over the 443 TCP port.

Download a Visio file of this architecture.


Typical uses for this architecture include:
Automated management and configuration across Azure, on-premises, or other cloud providers.
Automation of Azure virtual machines (VMs) that reside behind a firewall, with outbound connectivity over the
443 TCP port.

Architecture
The architecture consists of the following components:
Log Analytics Workspace: A Log Analytics workspace is a data repository for log data collected from
resources that run in Azure, on-premises or in another cloud provider.
Automation Hybrid Worker solution: With this, you can create Hybrid Runbook Workers to run Azure
Automation runbooks on your Azure and non-Azure computers.
Automation Account: A cloud service that automates configuration and management across your Azure and
non-Azure environments.
Hybrid Runbook Worker : A computer that is configured with the Hybrid Runbook Worker feature and can
execute runbooks directly on the computer and against the resources in the local environment.
Hybrid Runbook Worker Group: Groups multiple Hybrid runbook workers for higher availability and scale to
run a set of runbooks.
A Runbook : A collection of one or more linked activities that together automate a process or operation.
On-premises machines and VMs . On-premises computers and VMs with Windows or Linux operating
system hosted in a private local-area network.

Recommendations
The following recommendations apply to most scenarios. Follow these recommendations unless you have a specific
requirement that overrides them.
The following steps highlight the actual implementation:
Create a Log Analytics Workspace
Add an Automation Hybrid Worker Solution
Create an Automation Account
Link an Automation Account with Log Analytics Workspace
Deploy a Log Analytics agent and connect to a Log Analytics Workspace
Deploy a Hybrid Runbook Worker Group and Hybrid Runbook Worker on an on-premises Windows computer
(optional Linux VM)
Create a Runbook in Azure Automation
Create a Run As account for authentication (if applicable)
Deploy a Runbook on a Hybrid Runbook Worker Group
Create a Log Analytics Workspace
Before you create a Log Analytics Workspace, ensure that you have at least Log Analytics Contributor role
permissions. An Azure subscription can contain more than one Log Analytics Workspace for data isolation or for
geographic location for data storage, but the Log Analytics agent can be configured to report to one Log Analytics
Workspace. For more information, review the Azure Monitor Log design guidance before you create the workspace.
Use the following steps to create a Log Analytics Workspace:
1. Sign in to the Azure portal at https://por tal.azure.com .
2. In the Azure portal, select Create a Resource . In the Search the Marketplace, enter Log Analytics . As you
begin entering, the list filters based on your input. Select Log Analytics Workspaces .
3. Select Create , and then select choices for the following items:
a. Select a Subscription to link to by selecting from the drop-down list if the default selected is not
appropriate.
b. For Resource Group , use an existing resource group already set up or create a new one.
c. Provide a unique name for the new Log Analytics Workspace , such as Hybrid Workspace -your name.
d. Select the Location for your deployment.
e. Select Pricing Tier to proceed to further customization.
f. If you're creating a workspace in a subscription created after April 2, 2018, it'll automatically use the Per
GB pricing plan. The option to select a pricing tier won't be available. If you're creating a workspace for an
existing subscription created before April 2, 2018 or to a subscription that was tied to an existing
Enterprise Agreement enrollment, select your preferred pricing tier. For additional information about the
particular tiers, refer to Log Analytics Pricing Details.
g. Select Tags and optionally provide name/value for categorization of the resources.
h. Select Review + Create .
4. After providing the required information on the Log Analytics Workspace pane, select Create .
Add an Automation Hybrid Worker Solution
Next, prepare the Log Analytics Workspace with the necessary components required for the Hybrid Runbook
Worker. Use the following steps to add Automation Hybrid Worker Solution:
1. In the Azure portal, select Create a Resource .
2. In the Search the Marketplace, enter Automation Hybrid Worker. As you begin entering, the list filters based
on your input. Select Automation Hybrid Worker.
3. Select Create , and then select the Log Analytics Workspace that you created in the previous step. For
example, HybridWorkspace-yourname.
4. After providing the required information on the Automation Hybrid Worker pane, select Create .
Create an Automation Account
When Automation Hybrid Worker Solution has been added to Log Analytics Workplace, proceed with creating the
Azure Automation Account. It's important that you create an Automation Account in the same region and preferably
in the same Resource Group as the Log Analytics Workplace.
Use the following steps to create the Automation Account:
1. In the Azure portal, select Create a Resource .
2. In the Search the Marketplace, enter Automation. As you begin, the list filters based on your input. Select
Automation , and then select Create.
3. Select Create and then select choices for the following items:
Provide the Name for the Automation Account, such as hybrid-auto.
Select a Subscription to link to by selecting from the drop-down list if the default selected is not
appropriate.
For Resource Group , choose the same resource group in which you've created the Log Analytics
Workspace.
Select the Location to be the same as the Log Analytics Workspace.
Create Azure Run As account is optional. This only provides authentication with Azure to manage
Azure resources from Automation runbooks.
4. After providing the required information on the Add Automation Account pane, select Create .
Link an Automation Account with Log Analytics Workspace
Automation accounts use the components of Hybrid Runbook Worker that are deployed in Log Analytics
Workspace. Integrate those services before you deploy a Log Analytics agent on an on-premises machine. If you
plan to use the same Automation Account for Update Management and Change Tracking, you must map the Log
Analytics Workspace and Automation Account. Currently, mappings between Log Analytics Workspace and
Automation Account are supported in several regions. For further information, refer to Supported regions for
linked Log Analytics workspace.
Use the following steps to link the Automation Account with Log Analytics Workspace:
1. In the Azure portal, select All ser vices , and then enter automation. As you begin entering, the list filters based
on your input. Select Automation Account , and then select your automation account created in the previous
step.
2. In the Automation Account pane, in the Update Management section, select Update Management.
3. In the Update Management pane, select choices for the following items:
Select a Subscription to link to by selecting from the drop-down list if the default selected is not
appropriate.
For Log Analytics workspace , select the Log Analytics Workspace that you created. For example,
HybridWorkspace-Marjan.
4. After providing the required information in the Update Management pane, select Enable.
Deploy a Log Analytics agent and connect to a Log Analytics Workspace
Deploying a Hybrid Runbook Worker component is part of the deployment of Log Analytics Agent.
If you test the solution using an Azure VM, install Log Analytics Agent and enroll the VM into an existing Log
Analytics Workspace by using VM extension for both Linux and Windows. Deploy the agent using Azure
Automation Desired State Configuration (DSC), PowerShell script, or use the Resource Manager template for VMs.
For more information, refer to the following article Connect Windows computers to Azure Monitor.
For non-Azure VMs, deploy the agent both on Windows and Linux computers, physical or VMs, using manual or
automated process.
For Windows machines, configure the agent to communicate with Log Analytics Workspace using TLS 1.2 protocol.
Deployment procedure is explained in detail in the following article Connect Windows computers to Azure Monitor.
The Log Analytics Agent for Linux can be deployed:
Manually using a shell script bundle that contains Debian and Red Hat Package Manager (RPM) packages for
each of the agent components. This is recommended when the Linux machine doesn't have internet connectivity
and will communicate with Log Analytic Service through Log Analytics gateway.
Using wrapper-script hosted on GitHub, when the computer has connectivity to the Internet.
The Log Analytics Agent must be configured to communicate with Log Analytics Workspace by using workspace ID
and key of the Log Analytics Workspace.
Use the following steps to deploy Log Analytics Agent and connect to Log Analytics Workspace:
1. In the Azure portal, search for and select Log Analytics Workspaces .
2. In your list of Log Analytics Workspaces, select the workspace you intend to configure the agent to report to.
3. Select Agents management .
4. Copy and paste into your favorite editor, the Workspace ID and Primar y Key .
5. In your Log Analytics Workspace, from the Windows Ser vers page you navigated to earlier, select the
appropriate Download Windows Agent version to download depending on the processor architecture of the
Windows operating system.
6. Run Setup to install the agent on your computer.
7. On the Welcome page, select Next .
8. On the License Terms page, read the license, and then select I Agree .
9. On the Destination Folder page, change or keep the default installation folder, and then select Next .
10. On the Agent Setup Options page, choose to connect the agent to Azure Log Analytics, and then select Next .
11. On the Azure Log Analytics page, perform the following:
Paste the Workspace ID and Workspace Key (Primar y Key) that you copied earlier. If the computer
should report to a Log Analytics Workspace in Azure Government cloud, select Azure US Government
from the Azure Cloud drop-down list.
If the computer needs to communicate through a proxy server to the Log Analytics service, select
Advanced and provide the URL and port number of the proxy server. If your proxy server requires
authentication, enter the username and password to authenticate with the proxy server, and then select
Next .
12. Select Next when you've completed providing the necessary configuration settings.
Deploy Hybrid Runbook Worker Group and Hybrid Runbook Worker on an on-premises Windows machine
(optional Linux VM )
The Hybrid Runbook Worker role requires the Log Analytics agent for the supported operating system.
For Windows operating system, refer to the following "prerequisites"
For Linux operating system, refer to the following "prerequisites"
Deploy Hybrid Worker role on a Windows machine using automated and manual deployment.
For automated deployment, Microsoft provides PowerShell scripts New-OnPremiseHybridWorker.ps1 that can
be downloaded from the PowerShell Gallery.
For manual deployment, the log analytics agent will download the necessary components that are required for the
Hybrid Runbook worker from the Log Analytics Workspace. The link that exists between the Log Analytics
Workspace and the Azure Automation account will push down the HybridRegistration PowerShell module, which
contains the Add-HybridRunbookWorker cmdlet.
Use the following procedure to manually deploy Hybrid Runbook Worker Group and Hybrid Runbook Worker on
an on-premises Windows machine:
1. In the Azure portal, search for and select Automation Account .
2. In your list of Automation Accounts, select the Automation Account you intend to configure the agent to report
to.
3. In the Account Settings section, select Keys .
4. Copy and paste into your favorite editor, the Primar y access key and URL.
5. Switch on the windows machine, open a PowerShell session in Administrator mode, and then execute the
following commands to import the module:

cd "C:\Program Files\Microsoft Monitoring Agent\Agent\AzureAutomation\\<version>\HybridRegistration"

Import-Module .\HybridRegistration.psd1

1. Now execute the Add-HybridRunbookWorker cmdlet using the following syntax:

Add-HybridRunbookWorker –GroupName <String> -Url <Url> -Key <String>

NOTE
For the URL, use the previously recorded URL , and for the Key , use the previously copied Primar y access key .

Create a Runbook in Azure Automation


To manage resources on a local computer or against resources in the local environment where the hybrid worker is
deployed, you must create a Runbook. Add a Runbook to Azure Automation by either creating a new one or
importing an existing one from a file or the Runbook Gallery.

NOTE
When the Hybrid Runbook host machine reboots, any open Runbook job restarts from the beginning or from the last
checkpoint for PowerShell Workflow Runbooks. This occurs for a maximum of three times, and then it is suspended.

Use the following steps to create or import Runbook in Azure Automation:


1. In the Azure portal, search for and then select Automation Account .
2. In your list of Automation Accounts, select the Automation Account you intend to configure the agent to report
to.
3. In the Process Automation section, select Runbooks.
4. Select Create a Runbook or Impor t a Runbook to configure the automation task that will run on on-
premises machines.
Create a Run As Account for Authentication (as applicable )
Hybrid Runbook Workers on Azure VMs can use managed identities from Azure Active Directory to authenticate to
Azure resources.
A Runbook that creates jobs on Hybrid Runbook Worker by default operates under the local System account on
Windows or the nxautomation account on Linux.
For accessing local resources using different authentication, specify a Run As account for a Hybrid Runbook Worker
group. The Run As account is defined with credential asset that has sufficient permission to access the local
resources.
Use the following steps to create a Run As Account for authentication:
1. In the Azure portal, search for and then select Automation Account .
2. In your list of Automation Accounts, select the Automation Account you created previously.
3. In the Shared Resources section, select Credentials .
4. Select Add a credential to create a credential asset with access to local resources.
5. In the Automation account pane in the Process Automation section, select Hybrid Worker Groups , and then
select the specific group.
6. In the Hybrid Worker group settings, select Hybrid worker group settings .
7. Change the value of Run As from Default to Custom .
8. Select the Run As credential created before, and then select Save .
Deploy a Runbook on a Hybrid Runbook Worker Group
The final step is to deploy a runbook to execute on a Hybrid Runbook Worker Group. The runbook must be
published and started using one of the following methods:
Azure portal
PowerShell
Azure Automation API
Webhooks
Schedule
Respond to Azure Alert
From another Runbook
Refer to the following article Start a runbook in Azure Automation to determine the method to start a runbook in
Azure Automation.
Test the runbook in a draft version but consider that runbook still executes normally and performs against any
resources in the environment.
To test and deploy the runbook on a Hybrid Runbook Worker Group, use the following steps:
1. In the Azure portal, search for and then select Automation Account .
2. In your list of Automation Accounts, select the Automation Account you have created previously.
3. In the Automation account pane in the Process Automation section, select Runbooks.
4. Select your runbook created before and select Edit.
5. In the edit runbook, select the Test Pane.
6. In the Test Pane, change the value of Run on from Azure to Hybrid Worker .
7. In the Choose Hybrid Worker group, select your group created in the previous step.
8. Start the test to observe the result of the runbook.
9. Close the Test Pane to return to the Edit section.
10. Select Publish to save the final version of the runbook.
11. In the Runbooks pane, select Link to schedule.
12. In the Schedule, create or link the existing schedule to define the startup environment for the runbook.
13. In the Schedule Runbook pane, select Parameters and Run Settings, and then change the value of Run On
from Azure to Hybrid Worker.
14. In the Choose Hybrid Worker group, select your group created in the previous step.
15. Confirm the choices by selecting OK to finish publishing the runbook on the Hybrid Runbook Worker.

Scalability considerations
The Log Analytics Agent for Windows and Linux have very minimal impact on the machine performance.
Scale up your workers by configuring to run on more powerful machines with higher performance including
(memory, CPU, IOPs).
Increased demands for processing large number of jobs can be solved by organizing multiple Hybrid
Workers into Hybrid Worker Groups. Runbooks are executed on each Hybrid worker using queuing
mechanisms. The Hybrid worker checks the Automation account once every 30 seconds and pickup four jobs
to execute.
Multiple Hybrid Worker Groups can execute runbooks automation tasks using different Run As accounts.

Availability considerations
Currently, mappings between Log Analytics Workspace and Automation Account are supported in several
regions. For further information, refer to Supported regions for linked Log Analytics workspace.
A Hybrid Runbook Worker Group with more than one machine configured with Hybrid Worker Role provides
high availability because runbooks will start only on servers that are running and healthy.

Manageability considerations
To accelerate deployment of the Log Analytics Agent with Hybrid Worker Role running on Windows machine,
use the PowerShell script New-OnPremiseHybridWorker.ps1 . The script performs the following steps:
Installs the necessary modules
Signs in with your Azure account
Verifies the existence of a specified resource group and Automation Account
Creates references to Automation account attributes
Creates an Azure Monitor Log Analytics Workspace if not specified
Enables the Azure Automation solution in the workspace
Downloads and installs the Log Analytics Agent for Windows
Registers the machine as a Hybrid Runbook Worker
Deployment of many agents in on-premises infrastructure can be orchestrated using command line scripts and
deployed using Group Policy or System Center Configuration Manager.

Security considerations
Encryption of sensitive assets in Automation: An Azure Automation Account can contain sensitive assets such as
credentials, certificate, connection, and encrypted variables that might be used by the runbooks. Each secure
asset is encrypted by default using a Data Encryption key that is generated for each Automation Account. These
keys are encrypted and stored in Azure Automation with an Account Encryption Key (AEK) that can be stored in
the Key Vault for customers who want to manage encryption with their own keys. By default, AEK is encrypted
using Microsoft-managed Keys. Use the following guidelines to apply encryption of secure assets in Azure
Automation.
Runbook permission: By default, runbook permissions for a Hybrid Runbook Worker run in a system context on
the machine where they're deployed. A runbook provides its own authentication to local resources.
Authentication can be configured using managed identities for Azure resources or by specifying a Run As
account to provide a user context for all runbooks.
Network planning: Hybrid Runbook Worker requires outbound internet access over TCP port 443 to
communicate with Automation. For computers with restricted internet access, use Log Analytics gateway to
configure communication with Azure Automation and Azure Log Analytics Workspace.
Azure Security baseline for Automation: The Azure security baseline for Automation contains recommendations
on how to increase overall security configuration to protect your asset following best-practice guidance.

DevOps considerations
Azure Automation allows integration with popular source control systems, Azure DevOps, and Github. With
Source Control, you can integrate the existing development environment that contains your scripts and custom
code that have been previously tested in an isolated environment.
For information on how to integrate Azure Automation with your Source Control environment, refer to: Use
source control integration.

Cost considerations
Use the Azure pricing calculator to estimate costs. Pricing models for Azure Automation are explained here.
Azure Automation costs are priced for job execution per minute or for configuration management per node.
Every month, the first 500 minutes of process automation and configuration management on five nodes are
free.
Azure Log Analytics Workspace might generate additional costs related to the amount of log data stored in the
Azure Log Analytics. The pricing model is based on consumption. The costs are associated for data ingestion and
data retention. For ingesting data into Azure Log Analytics, use Capacity Reservation or Pay-As-You-Go model
that include 5 gigabytes (GB) free per billing account per month. Data retention for the first 31 days are free of
charge.
Use the Azure pricing calculator to estimate costs. Pricing models for Log Analytics are explained here.
Using Azure file shares in a hybrid environment
11/2/2020 • 11 minutes to read • Edit Online

This architecture shows how to include Azure file shares in your hybrid environment. Azure file shares are used as
serverless file shares. By integrating them with Active Directory Directory Services (AD DS), you can control and
limit access to AD DS users and Azure file shares can replace traditional file servers.

Download a Visio file of this architecture.


Typical uses for this architecture include:
*Replace or supplement on-premises file servers. All companies use file servers. Azure Files can completely
replace or supplement traditional on-premises file servers or network-attached storage devices. With Azure file
shares and AD DS authentication, you can migrate data to Azure Files and take the advantage of high availability
and scalability while minimizing client changes.
Lift and shift . Azure Files makes it easy to "lift and shift" applications that expect a file share to store application
or user data to the cloud.
Backup and disaster recover y . You can use Azure file shares as storage for backups or for disaster recovery
to improve business continuity. You can use Azure file shares to back up your data from existing file servers
while preserving configured Windows discretionary access control lists. Data that's stored on Azure file shares
isn't affected by disasters that might affect on-premises locations.
Azure File Sync . With Azure File Sync, Azure file shares can replicate to Windows Server, either on-premises or
in the cloud, for performance and distributed caching of data where it's being used.

Architecture
The architecture consists of the following components:
Azure Active Director y tenant . This is an instance of Azure Active Directory (Azure AD) that is created by
your organization. It acts as a directory service for cloud applications by storing objects copied from the on-
premises Active Directory and provides identity services when accessing Azure file shares.
AD DS ser ver . This is an on-premises directory and identity service. The AD DS directory is synchronized with
Azure AD to enable it to authenticate on-premises users.
Azure AD Connect sync ser ver . This is an on-premises server that runs the Azure AD Connect sync service.
This service synchronizes information held in the on-premises Active Directory to Azure AD.
Vir tual network gateway . This is an optional component used to send encrypted traffic between a Virtual
Network NAT and an on-premises location over the internet.
Azure file shares . Azure file shares provide storage for files and folders that you can access over Server
Message Block (SMB), Network File System (NFS), and Hypertext Transfer Protocol (HTTP) protocols. File shares
are deployed into Azure storage accounts.
Recover y Ser vices Vault . This is an optional component that provides Azure file shares backup.
Clients . These are AD DS member computers, from which users can access Azure file shares.

Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have a
specific requirement that overrides them.
Use general-purpose v2 (GPv2) or FileStorage storage accounts for Azure file shares
You can create an Azure file share in various storage accounts. Although general-purpose v1 (GPv1) and classic
storage accounts can contain Azure file shares, most new features of Azure Files are available only in GPv2 and
FileStorage storage accounts. Important difference between GPv2 and FIleStorage storage accounts is that with
GPv2 storage accounts data is stored on hard disk drive-based (HDD-based) hardware, while with FileStorage
storage accounts data is stored on solid-state drive-based (SSD-based) hardware. For more information, refer to
Create an Azure file share.
Create Azure file shares in storage accounts that contain only Azure file shares
Storage accounts allow you to use different storage services, such as Azure file shares, blob containers, and tables in
the same storage account. All storage services in a single storage account share the same storage account limits.
Mixing storage services in the same storage account make it more difficult to troubleshoot performance issues.

NOTE
Deploy each Azure file share in its own separate storage account, if possible. If multiple Azure file shares are deployed into the
same storage account, they all share the storage account limits.

Use premium file shares for workloads that require high throughput
Premium file shares are deployed to FileStorage storage accounts and are stored on solid-state drive-based (SSD-
based) hardware. This makes them suitable for storing and accessing data that requires consistent performance,
high throughput, and low latency, such as databases. You can store other workloads, which are less sensitive to
performance variability, such as general-purpose file shares and dev/test environments, on standard file shares. For
more information, refer to How to create an premium Azure file share.
Always require encryption when accessing Azure file shares
Always use encryption in transit when accessing data in Azure file shares (encryption in transit is enabled by
default). This means that Azure Files will only allow the connection if it is made with protocol that uses encryption,
such as SMB 3.0. Clients that do not support SMB 3.0 will not be able to mount the Azure file share if encryption in
transit is required.
Use VPN if port that SMB uses (port 445) is blocked
Many internet service providers block Transmission Control Protocol (TCP) por t 445 , which is used to access
Azure file shares. If unblocking TCP por t 445 is not an option, you can access Azure file shares over an
ExpressRoute or virtual private network (VPN) connection (site-to-site or point-to-site) to avoid traffic blocking. For
more information, refer to Configure a Point-to-Site (P2S) VPN on Windows for use with Azure Files and Configure
a Site-to-Site VPN for use with Azure Files.
Consider using Azure File Sync with Azure file shares
Azure File Sync is a service that allows you to cache Azure file shares on an on-premises Windows Server file
server. With cloud tiering enabled, File Sync helps ensure that a file server always has free available space while
making more files available than a file server could store locally. If you have on-premises Windows Server file
servers, consider integrating file servers with Azure file shares by using Azure File Sync. For more information, refer
to Planning for an Azure File Sync deployment.

Scalability considerations
Azure file share size is limited to 100 tebibytes (TiB). There is no minimum file share size and no limit on the
number of Azure file shares.
Maximum size of a file in a file share is 1 TiB and there is no limit on the number of files in a file share.
Maximum I/O operations per second (IOPS) per standard file share is 10,000 IOPS and 100,000 IOPS per
premium file share.
Maximum throughput for a single standard file share is up to 300 mebibytes/sec (MiB/sec) and up to 6,204
MiB/s for premium file shares.
IOPS and throughput limits are per Azure storage account and are shared between Azure file shares in the same
storage account.
For more information, refer to Azure Files scalability and performance targets.

Availability considerations
NOTE
Azure storage account is the parent resource for Azure file shares. Azure file share has the level of redundancy that is
provided by the storage account that contains the share.

Azure file shares currently support the following data redundancy options:
Locally redundant storage (LRS) . Data is copied synchronously three times within a single physical
location in the primary region. This protects against loss of data because of hardware faults, such as a bad
disk drive.
Zone-redundant storage (ZRS) . Data is copied synchronously across three Azure availability zones in
the primary region. Availability zones are unique physical locations within an Azure region. Each zone
consists of one or more datacenters equipped with independent power, cooling, and networking.
Geo-redundant storage (GRS) . Data is copied synchronously three times within a single physical
location in the primary region using LRS. Your data is then copied asynchronously to a single physical
location in the secondary region. Geo-redundant storage provides six copies of your data spread between
two Azure regions.
Geo-zone-redundant storage (GZRS) . Data is copied synchronously across three Azure availability
zones in the primary region using ZRS. Your data is then copied asynchronously to a single physical
location in the secondary region.
Premium file shares can be stored in locally redundant storage (LRS) and zone redundant storage (ZRS) only.
Standard file shares can be stored in LRS, ZRS, geo-redundant storage (GRS), and geo-zone-redundant storage
(GZRS). For more information, refer to Planning for an Azure Files deployment and Azure Storage redundancy.
Azure Files is a cloud service, and as with all cloud services, you must have internet connectivity to access Azure
file shares. A redundant internet connection solution is highly recommended to avoid disruptions.

Manageability considerations
You can manage Azure file shares by using the same tools as any other Azure service, including Azure portal,
Azure Command-Line Interface, and Azure PowerShell.
Azure file shares enforce standard Windows file permissions. You can configure directory or file-level
permissions by mounting an Azure file share and configuring permissions using File Explorer, Windows
icacls.exe command, or the Set-Acl Windows PowerShell cmdlet.
You can use Azure file share snapshot for creating a point-in-time, read-only copy of the Azure file share data.
You create a share snapshot at the file share level. You can then restore individual files in the Azure portal or in
File Explorer, where you can also restore a whole share. You can have up to 200 snapshots per share, which
enables you to restore files to different point-in time versions. If you delete a share, its snapshots are also
deleted. Share snapshots are incremental. Only the data that has changed after your most recent share snapshot
is saved. This minimizes the time required to create the share snapshot and saves on storage costs. Azure file
share snapshots are also used when you protect Azure file shares with Azure Backup. For more information,
refer to Overview of share snapshots for Azure Files.
You can prevent accidental deletion of Azure file shares by enabling soft delete for file shares. If a file share is
deleted when a soft delete is enabled, file share transitions to a soft deleted state instead of being permanently
erased. You can configure the amount of time soft deleted data is recoverable before it's permanently deleted
and restore the share anytime during this retention period. For more information, refer to Enable soft delete on
Azure file shares.

NOTE
Azure Backup enables soft delete for all file shares in the storage account when you configure backup for the first Azure file
share in the respective storage account.

NOTE
Both standard and premium file shares are billed on used capacity when soft deleted, rather than provisioned capacity.

Security considerations
Use AD DS authentication over SMB for accessing Azure file shares. This provides the same seamless single
sign-on (SSO) experience when accessing Azure file shares as accessing on-premises file shares. For more
information, refer to How it works, and feature enablement steps. Your client needs to be domain joined to
AD DS as the authentication is still performance by AD DS domain controller. In addition, you need to assign
both share level and file/directory level permissions to get access to the data. Share level permission
assignment goes through RBAC model. File/directory level permission is managed as Windows ACLs.

NOTE
Access to Azure file shares is always authenticated. Azure file shares don't support anonymous access. Besides
identity-based authentication over SMB, users can authenticate to Azure file share also by using storage access key
and Shared Access Signature.

All data that is stored on Azure file share is encrypted at rest using Azure storage service encryption (SSE).
SSE works similarly to BitLocker Drive Encryption on Windows, where data is encrypted beneath the file
system level. By default, data stored in Azure Files is encrypted with Microsoft-managed keys. With
Microsoft-managed keys, Microsoft maintains the keys to encrypt/decrypt the data and manages rotating
them on a regular basis. You can also choose to manage your own keys, which gives you control over the
rotation process.
All Azure storage accounts have encryption in transit enabled by default. This means that all communication
with Azure file shares is encrypted. Clients that don't support encryption can't connect to Azure file shares. If
you disable encryption in transit, clients that run older operating systems, such as Windows Server 2008 R2
or older Linux, can also connect. In such instances, data is not encrypted in transit from Azure file shares.
By default, clients can connect to Azure file share from anywhere. You can limit from which networks clients
can connect to Azure file shares by configuring Firewall and virtual networks and private endpoint
connections. For more information, refer to Configure Azure Storage firewalls and virtual networks and
Configuring Azure Files network endpoints.

Cost considerations
Azure Files has two storage tiers and two pricing models:
Standard storage : Uses HDD-based storage. There is no minimum file share size and with standard
storage you pay only for used storage space. In addition, you need to pay for file operations, such as
enumerating a directory or reading a file.
Premium storage : Uses SSD-based storage. The minimum size for a premium file share is 100 gibibytes
and you pay per provisioned storage space. When using premium storage, all file operations are free.
There are additional costs associated with file share snapshots and outbound data transfers (when transferring
data from Azure file shares, inbound data transfer is free). Data transfer costs depend on the actual amount of
transferred data and on the stock keeping unit (SKU) of your virtual network gateway (if a virtual network
gateway is used). Please refer to Azure Files Pricing and Azure Pricing calculator for the actual costs. Be aware
that the actual cost varies by Azure region and your individual contract. Contact a Microsoft sales representative
for additional information on pricing.
Azure Functions in a hybrid environment
11/13/2020 • 5 minutes to read • Edit Online

This reference architecture illustrates multiple local branches of an organization that's spread out geographically.
Each location is using a Microsoft Azure function app that's configured with the Premium plan in a nearby cloud
region. The developers in this architecture are monitoring all the Azure function apps by using Azure Monitor as a
single pane of glass.

Download a Visio file of this architecture.


Typical uses for this architecture include:
Organizations with many physical locations that are connected to a virtual network in Azure to communicate
with Azure Functions.
High-growth workloads that are using Azure Functions locally and maintaining the option to use Azure for any
unexpected bursts in work.

Architecture
The architecture consists of the following components:
Azure Functions . Azure Functions is a serverless platform as a service (PaaS) in Azure that runs small, single-
task code without requiring new infrastructure to be spun up. The Azure Functions Premium plan adds the
ability to communicate with Azure Functions privately over a virtual network.
Azure Vir tual Network . Azure virtual networks are private networks built on the Azure cloud platform so that
Azure resources can communicate with each other in a secure fashion. Azure virtual networks service endpoints
ensure that Azure resources can only communicate over the secure virtual network backbone.
On-premises network . In this architecture, the organization has created a secure private network that
connects the various branches. This private network is connected to their Azure virtual networks by using a site-
to-site connection.
Developer workstations . In this architecture, individual developers might work on code for Azure Functions
entirely on the secure private network or from any remote location. In either scenario, developers have access to
Azure Monitor to query or observe metrics and logs for the function apps.
Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have a
specific requirement that overrides them.
Designing for a serverless architecture
Traditional enterprise applications trend toward a monolithic application architecture in which one code "solution"
runs the entire organization's business logic. With Azure Functions, the best practice is to design for a serverless
architecture in which individual functions perform single tasks. These single tasks are designed to run quickly and
integrate into larger workflows.
Serverless architecture on Azure Functions has many benefits, including:
Applications can automatically scale by individual business functions instead of scaling the entire solution. This
can help keep costs down by scaling only what's needed for each task to serve existing workloads.
Azure Functions provides declarative bindings for many Azure services, reducing the amount of code your team
needs to write, test, and maintain.
Individual functions can be reused, reducing the amount of repeated code that's necessary for large enterprise
solutions.
Running Azure Functions on-premises
You may choose to have Azure Functions run on-premises rather than in Azure; for example:
Your team might want to run Azure Functions within an existing on-premises Kubernetes installation.
In development, your team may find it easier to develop locally using the command line interface instead of the
in-portal editor.
Your functions will run locally with the toolset installed on on-premises VMs.
You can run Azure Functions on-premises in three ways:
Azure Functions Core Tools . Azure Functions Core Tools is a developer suite that typically installs from node
package manager (npm). It allows developers to develop, debug, and test function apps at the command prompt
on a local computer.
Azure Functions Docker container image . You can use this container image as a base image for containers
that run Azure Functions on a Docker host or in Kubernetes.
Kubernetes . Azure Functions support seamless event-driven scale within a Kubernetes cluster using
Kubernetes-based Event Driven Autoscaling (KEDA). To review best practices for managing Azure Kubernetes
Service clusters and Azure arc enabled Kubernetes clusters, review the Run containers in a hybrid environment
reference architecture.
Network connectivity
Creating function apps by using the Premium plan opens up the possibility of highly secure cross-network
connectivity between Azure virtual networks, Azure and on-premises networks, and the networks for each on-
premises branch.
You should use either a site-to-site or an Azure ExpressRoute connection between Azure Virtual Network and on-
premises networks. This allows the on-premises branches to communicate with the function apps in Azure by using
their service endpoints.
Additionally, each virtual network in Azure should also use virtual network peering to enable communication
between individual function apps across regions.

Scalability considerations
Azure Functions code should be designed so that it can scale out endlessly. Consider race conditions, leased files,
and other workloads that might cause one function run to block another. Also consider writing all Azure
Functions code to be stateless and defensive in its design.
For function apps that use Azure Storage accounts in triggers or bindings, don't reuse the same account that's
used to store metadata about the function apps and their runs.

Availability considerations
Typically, Azure Functions on the consumption plan can scale down to zero instances. When a new event triggers
a function app, a new instance must be created with your code running on it. The latency that's associated with
this process is referred to as a cold start. The Azure Functions Premium plan offers the option to configure pre-
warmed instances that are ready for any new requests. You can configure the number of pre-warmed instances
up to the minimum number of instances in your scale-out configuration.
Consider having multiple Premium plans in multiple regions and using Azure Traffic Manager to route requests
appropriately.

Manageability considerations
Azure Functions must be in an empty subnet that's a different subnet than your other Azure resources. This
might require more planning when designing subnets for your virtual network.
Consider creating proxies for every on-premises resource that Azure Functions might need to access. This can
protect your application integrity against any unanticipated on-premises networking changes.
Use Azure Monitor to observe analytics and logs for Azure Functions across your entire solution.

DevOps considerations
Ideally, deployment operations should come from a single team (Dev or DevOps), not from each individual
branch. Consider using a modern workflow system like Azure Pipelines or GitHub Actions to deploy function
apps in a repeatable manner across all Azure regions and potentially on-premises.
Use your workflow system to automate the redeployment of code to Azure Functions as the code is updated and
tagged for release.
Use deployment slots to test Azure Functions prior to their final push to production.

Cost considerations
Use the Azure pricing calculator to estimate costs.
The Azure Functions Premium plan is required for Azure Virtual Network connectivity, private site access, service
endpoints, and pre-warmed instances.
The Azure Functions Premium plan bills on instances instead of consumption. The minimum of a single instance
ensures there will be at least some monthly bill even without runs. You can set a maximum instance count to
control costs for workloads that may burst in size.
Back up files and applications on Azure Stack Hub
11/2/2020 • 32 minutes to read • Edit Online

This document describes the architecture and design considerations of a solution that delivers an optimized
approach to backup and restore of files and applications of VM-based user workloads hosted on Azure Stack Hub.

Download a Visio file of this architecture.


Backup and restore are essential components of any comprehensive business continuity and disaster recovery
strategy. Designing and implementing a consistent and reliable backup approach in a hybrid environment is
challenging, but can be considerably simplified through integration with services provided by Microsoft Azure. This
applies not only to the workloads running on traditional on-premises infrastructure, but also to those hosted by
third-party public and private cloud providers. However, benefits of integration with Azure cloud services are
particularly evident when the hybrid environments incorporate Azure Stack portfolio offerings, including the Azure
Stack Hub.
While one of the primary strengths of Azure Stack Hub is its support for the platform-as-a-service (PaaS) model, it
also helps customers to modernize their existing infrastructure-as-a-service (IaaS) workloads. Such workloads
might include file shares, Microsoft SQL Server databases, Microsoft SharePoint farms, and Microsoft Exchange
Server clusters. Migrating them to VMs running on hyperconverged, highly-resilient clusters, with administrative
and programming models consistent with Microsoft Azure, results in minimized management and maintenance
overhead.
For implementing backup of files and applications running on Azure Hub Stack VMs, Microsoft recommends a
hybrid approach that relies on a combination of cloud and on-premises components to deliver a scalable,
performant, resilient, secure, straightforward to manage, and cost-efficient backup solution. The central component
of this solution is Microsoft Azure Backup Server (MABS) v3, which is part of the Azure Backup offering. MABS
relies on Azure Stack Hub infrastructure for compute, network, and short-term storage resources, while it leverages
Azure-based storage to serve as the long-term backup store. This minimizes or eliminates the need to maintain
physical backup media such as tapes.

NOTE
MABS is based on Microsoft System Center Data Protection Manager (DPM) and provides similar functionality with just a few
differences. However, DPM is not supported for use with Azure Stack Hub.
Architecture of the proposed solution
Cloud components include the following services:
An Azure subscription that hosts all cloud resources included in this solution.
An Azure Active Directory (Azure AD) tenant associated with the Azure subscription, which provides
authentication of Azure AD security principals to authorize access to Azure resources.
An Azure Recovery Services vault in the Azure region closest to an on-premises datacenter that will host the
Azure Stack Hub deployment.
Depending on the criteria presented in this document, cloud components could also include the following services:
An Azure ExpressRoute circuit between the on-premises datacenter and the Azure region hosting the Azure
Recovery Services vault, configured with Microsoft peering to accommodate larger backup sizes.
The Azure Import/Export service for enabling MABS offline backups to Azure.

NOTE
As of 08/20, MABS offline backup to Azure by using Azure Data Box is in preview.

Depending on the use of the Azure Import/Export service for MABS offline backup to Azure, the solution might also
have an Azure Storage account in the same Azure region as the Recovery Services vault.
The on-premises components include the following services:
An Azure Stack Hub–integrated system in the connected deployment model, running the current update
(2002 as of August 2020), and located within the customer's on-premises datacenter.
A Windows Server 2016 or Windows Server 2019 VM hosted by the Azure Stack Hub–integrated system and
running MABS v3 Update Release (UR) 1.
Azure Stack Hub VMs with the MABS protection agent, which manages backups to and restores from the
MABS Azure Stack Hub VM. The MABS protection agent tracks changes to protected workloads, and transfers
the changes to the MABS data store. The protection agent also identifies data on its local computer that can
be protected, and plays a role in the recovery process.
A Microsoft Azure Recovery Services (MARS) agent installed on the server running MABS, providing
integration between MABS and Azure Recovery Services vault.

NOTE
MARS agent is also referred to as an Azure Backup agent.

Core functionality
The proposed solution supports the following functionality on Azure Stack Hub VMs running Windows Server
2019, 2016, 2012 R2, 2012, 2008 R2 SP1 (64-bit with Windows Management Framework 4.0), 2008 SP2 (64-bit
with Windows Management Framework 4.0), and Windows 10 (64-bit):
Backup and restore of New Technology File System (NTFS) and Resilient File System (ReFS) volumes, shares,
folders, files, and system state.
Backup and restore of SQL Server 2019, 2017, 2016 (with required service packs [SPs]), and 2014 (with
required SPs) instances and their databases.
Backup and restore of Exchange Server 2019 and Exchange Server 2016 servers and databases, including
standalone servers and databases in a database availability group (DAG).
Restore of individual mailboxes and mailbox databases in a DAG.
Backup and restore of SharePoint 2019 and SharePoint 2016 (with the latest SPs) farms and front-end web
server content.
Restore of SharePoint 2019 and SharePoint 2016 databases, web applications, files, list items, and search
components.

NOTE
To deploy Windows 10 client operating systems on Azure Stack Hub, you must have Windows per-user licensing or
have purchased it through a Qualified Multitenant Hoster (QMTH).

MABS implements the disk-to-disk-to-cloud (D2D2C) backup scheme, with the primary backup stored locally on the
server hosting the MABS installation. Local backups are subsequently copied to an Azure Site Recovery vault. The
local disk functions as short-term storage, while the vault provides long-term storage.

NOTE
Unlike DPM, MABS doesn't support tape backups.

At a high level, the backup process consists of the following four stages:
1. You install the MABS protection agent on a computer you want to protect, and add it to a protection group.
2. You set up protection for the computer or its app, including backup to MABS local disks for short-term storage
and to Azure for long-term storage. As part of the setup, you specify the backup schedule for both types of
backups.
3. The protected workload backs up to the local MABS disks according to the schedule you specified.
4. The local backup stored on MABS disks is backed up to the Azure Recovery Services vault by the MARS agent
running on the MABS server.

Prerequisites
Implementing the recommended solution is contingent on meeting the following prerequisites:
Access to an Azure subscription, with permissions sufficient to provision an Azure Recovery Services vault in
the Azure region closest to an on-premises datacenter hosting the Azure Stack Hub deployment.
An Active Directory Domain Services (AD DS) domain accessible from an Azure Stack Hub VM that will host
a MABS instance.
An Azure Stack Hub-hosted VM that will run a MABS instance, satisfying the prerequisites listed in Install
Azure Backup Server on Azure Stack and with outbound connectivity to URLs listed in DPM/MABS
networking support.

NOTE
Additional disk space and performance considerations for MABS are described in more detail later in this document.
NOTE
To validate whether the VM hosting MABS has connectivity to the Azure Backup service, you can use the Get-
DPMCloudConnection cmdlet (included in the Azure Backup Server PowerShell module).

NOTE
MABS also requires a local instance of SQL Server. For details regarding SQL Server requirements, refer to Install and
upgrade Azure Backup Server.

Availability considerations
Azure Stack Hub helps increase workload availability through resiliency inherent to its infrastructure. You can
further enhance the degree of this availability by designing and implementing solutions with the purpose of
extending the scope of workload protection. This is the added value that MABS provides. In the context of MABS
running on Azure Stack Hub, there are two aspects of availability that need to be explored in more detail:
Availability of MABS and its data stores
Availability of the point-in-time restore capability of MABS-protected workloads
You need to consider both of these when developing a backup strategy driven by recovery point objectives (RPOs)
and recovery time objectives (RTOs). RTO and RPO represent continuity requirements stipulated by individual
business functions within an organization. RPO designates a time period representing maximum acceptable data
loss following an incident that affected availability of that data. RTO designates the maximum acceptable duration of
time it can take to reinstate business functions following an incident that affected availability of these functions.

NOTE
In general, to address the RTO requirements for Azure Stack Hub workloads, you should account for recovery of the Azure
Stack infrastructure, user VMs, applications, and user data. In the context of this reference architecture document, we are
interested only in the last two of these components - applications and user data - although we also present considerations
regarding availability of the Modern Backup Storage (MBS) functionality.

MABS and its data stores availability is contingent on the availability of the VM hosting the MABS installation and its
local and cloud-based storage. Azure Stack Hub VMs are highly available by design. In case of a MABS failure, you
have the ability to restore Azure Backup–protected items from any other Azure Stack Hub VM hosting MABS. Note,
however, that for a server hosting MABS to recover backups performed by using MABS running on another server,
both servers must be registered with the same Azure Site Recovery vault.

NOTE
In general, you could deploy another instance of MABS and configure it to back up the primary MABS deployment, similar to
the primary-to-secondary protection, chaining, and cyclic protection configurations available when using DPM. However, this
approach is not supported with MABS and it would not yield meaningful availability advantages in the scenario described in
this reference architecture document.

The point-in-time restore capability of MABS-protected workloads is dependent to a large extent on the type of
data, its backups, and its protection policies. To understand these dependencies, it's necessary to explore these
concepts in more detail.
Data types
From the perspective of MABS, there are two data types to consider:
File data is data that typically resides on file servers (such as Microsoft Office files, text files, or media files), and
which needs to be protected as flat files.
Application data is data that exists on application servers (such as Exchange storage groups, SQL Server
databases, or SharePoint farms) and which requires MABS to be aware of the corresponding application
requirements.

NOTE
As an alternative to file data backup with MABS, it is possible to install the MABS agent directly on Azure Stack Hub VMs and
back up their local file system directly to an Azure Recovery Services vault. However, unlike MABS, this approach does not
provide centralized management and always relies on cloud-based storage for backups and restores.

Backup types
Whether you're protecting file data or application data, protection begins with creating a replica of the data source
in the local storage of a MABS instance. The replica is synchronized or updated at regular intervals according to the
settings that you configure. The method that MABS uses to synchronize the replica depends on the type of data
being protected. If a replica is identified as being inconsistent, MABS performs a consistency check, which is a block-
by-block verification of the replica against the data source.
For a file volume or share on a server, after the initial full backup, the MABS protection agent uses a volume filter
and change journal to determine which files have changed. It then performs a checksum procedure for these files to
synchronize only the changed blocks. During synchronization, these changes are transferred to MABS and then
applied to the replica, thereby synchronizing the replica with the data source.
If a replica becomes inconsistent with its data source, MABS generates an alert that specifies which computer and
which data sources are affected. To resolve the issue, you can repair the replica by initiating a synchronization with
consistency check on the replica. During a consistency check, MABS performs a block-by-block verification and
repairs the replica to return it to consistency with the data source. You can schedule a daily consistency check for
protection groups or initiate a consistency check manually.
At regular intervals that you can configure, MABS creates a recovery point for the protected data source. A recovery
point is a version of the data which can be recovered.
For application data, after the replica is created by MABS, changes to volume blocks that belong to application files
are tracked by the volume filter. How changes are transferred to the MABS server depends on the application and
the type of synchronization. The operation that is labeled synchronization in MABS Administrator Console is
analogous to an incremental backup, and it creates a transactionally consistent, accurate reflection of the application
data when combined with the replica.
During the type of synchronization that is labeled express full backup in MABS Administrator Console, a full Volume
Shadow Copy Service (VSS) snapshot is created, but only changed blocks are transferred to the MABS server.
Each express full backup creates a recovery point for application data. If the application supports incremental
backups, each synchronization also creates a recovery point. The synchronization process is application-dependent:
For Exchange data, synchronization transfers an incremental VSS snapshot using the Exchange VSS writer.
Recovery points are created for each synchronization and for each express full backup.
SQL Server databases that are log-shipped, that are in read-only mode, or that use the simple recovery model,
don't support incremental backup. Recovery points are created for each express full backup only. For all other
SQL Server databases, synchronization transfers a transaction log backup, with recovery points created for each
incremental synchronization and express full backup. The transaction log is a serial record of all the transactions
that have been performed against the database since the transaction log was last backed up.
SharePoint servers don't support incremental backup. Recovery points are created for each express full backup
only.
Incremental synchronizations require less time than performing an express full backup. However, the time required
to recover data increases as the number of synchronizations increases. This is because MABS must restore the last
full backup and then restore and apply all the incremental synchronizations up to the point in time selected for
recovery.
To enable faster recovery time, MABS regularly performs an express full backup, which updates the replica to
include the changed blocks. During the express full backup, MABS takes a snapshot of the replica before updating
the replica with the changed blocks. To enable more frequent RPOs and reduce the data loss window, MABS also
performs incremental synchronizations in the time between two express full backups.
As with file data protection, if a replica becomes inconsistent with its data source, MABS generates an alert that
specifies which server and which data sources are affected. To resolve the inconsistency, you can repair the replica
by initiating a synchronization with consistency check on the replica. During a consistency check, MABS performs a
block-by-block verification of the replica and repairs it to bring it back into consistency with the data sources. You
can schedule a daily consistency check for protection groups or initiate a consistency check manually.
Protection policies
A computer or its workload becomes protected when you install the MABS protection agent software on the
computer and add its data to a protection group. Protection groups are used to configure and manage protection of
data sources on computers. A protection group is a collection of data sources that share the same protection
configuration. The protection configuration is the collection of settings that are common to a protection group, such
as the protection group name, protection policy, storage allocations, and replica creation method.
MABS stores a separate replica of each protection group member in the storage pool. A protection group member
can contain such data sources as:
A volume, share, or folder on a file server or server cluster.
A storage group of an Exchange server or server cluster.
A database of an instance of SQL Server or server cluster.
You configure a protection policy for each protection group based on the recovery goals that you set up for that
protection group. Recovery goals represent data protection requirements corresponding to your organization's
RTOs and RPOs. Within MABS, they are defined based on a combination of the following parameters:
Short-term retention range. This determines how long you want to retain backed up data on the local MABS
storage.
Synchronization and recovery point frequencies. This corresponds directly to data loss tolerance, which in
turn reflects your organization's RPOs. It also determines how often MABS should synchronize its local
replicas with protected data sources by collecting their data changes. You can set the synchronization
frequency to any interval between 15 minutes and 24 hours. You can also select to synchronize just before a
recovery point is created rather than on a specified time schedule.
Short-term recovery point schedule. This establishes how many recovery points should be created in the
local storage for the protection group. For file protection, you select the days and times for which you want
recovery points created. For data protection of applications that support incremental backups, the
synchronization frequency determines the recovery point schedule.
Express full backup schedule. For data protection of applications that don't support incremental backups and
do support express full backups, this determines the recovery point schedule.
Online backup schedule. This determines frequency of creating a copy of local backups in the Azure Recovery
Services vault that's associated with the local MABS instance. You can schedule it on a daily, weekly, monthly,
or yearly basis, with maximum allowed frequency of two backups per day. MABS automatically creates a
recovery point for online backups using the most recent local replica, without transferring new data from the
protected data source.
NOTE
A Recovery Services vault supports up to 9,999 recovery points.

Online retention policy. This specifies the time period during which daily, weekly, monthly, and yearly
backups will be retained in the Azure Site Recovery vault associated with the local MABS instance.

NOTE
To protect the latest content of the data source online, create a new recovery point on the local disk before creating
an online recovery point.

NOTE
By default, Azure Recovery Services vault is geo-redundant, meaning that any backup copied to its storage is
automatically replicated to an Azure region that is part of a pre-defined region pair. You have the option to change
the replication settings to locally redundant if that's sufficient for your resiliency needs and if you need to minimize the
storage costs. However, you should consider keeping the default setting. In addition, keep in mind that this option
cannot be changed if the vault contains any protected items.

NOTE
For the listing of Azure region pairs, refer to Business continuity and disaster recovery (BCDR): Azure Paired Regions.

Scalability and performance considerations


When planning to deploy MABS on Azure Stack Hub, you need to consider the amount of processing, storage, and
network resources allocated to the VM (or VMs) hosting the deployment. Microsoft recommends allocating 2.33
gigahertz (GHz) quad-core CPU to account for MABS processing needs, with about 10 GB disk space to
accommodate the installation binaries. Additional storage requirements can be divided into the following
categories:
Disk space for backups. The general recommendation for backup disk space is to allocate a storage pool disk
space that's equivalent to about 1.5 times the size of all data to be backed up. After the disks are attached to
the VM, MABS manages volume and disk space management. The number of disks you can attach to a VM
depends on its size.

NOTE
You should not store backups locally for more than five days. Backups older than five days should be offloaded to the
Azure Site Recovery vault.

Disk space for MARS agent cache location. Consider using drive C on the VM hosting the MABS installation.
Disk space for local staging area during restores. Consider using the temporary drive D on the VM hosting
the MABS installation.
To provision storage for the VM hosting the MABS installation, use managed disks with Premium performance tier.
The expected performance characteristics are 2,300 I/O operations per second (IOPS) and 145 megabyte
(MB)/second per disk. (Note that unlike Azure, there are no performance guarantees for Azure Stack Hub.)
To obtain a more accurate estimate of storage required to accommodate Azure Stack Hub–based workload backups,
consider using the Azure Stack VM Size Calculator for MABS, which is available from Microsoft Downloads. The
calculator is implemented as a Microsoft Excel workbook with macros that derive optimum Azure Stack Hub sizing
information based on a number of parameters you provide. These parameters include:
Source details containing a list of VMs to be protected, including for each:
The size of protected data
The workload type (Windows Server, SharePoint, or SQL Server)
Data retention range, in days
Each workload type is by default associated with a predefined daily rate of changes (or churn). You have the option
to adjust these values if they don't reflect the actual utilization patterns in your environment.
The Azure Stack VM Size Calculator for MABS will use the information you specified to provide:
An estimated size of the Azure Stack Hub VM to host the installation of MABS.
An estimated amount of MABS disk space needed to host backed up data.
A total number of disks at 1 terabyte (TB) each.
IOPS available for MABS usage.
An estimated time to complete the initial backup (based on the total size of protected data and the IOPS
available for MABS usage).
An estimated time to complete daily backups (based on the total size of daily churn and the IOPS available for
MABS usage).

NOTE
Azure Stack VM Size Calculator for MABS was released in April 2018, which means that it doesn't take into account
optimizations incorporated into MABS v3 (including those included in UR1). However, it does include enhancements specific to
MBS, which was introduced in MABS v2 released in June 2017.

If you create a protection group by using the MABS graphical interface, whenever you add a data source to a
protection group, MABS will also automatically calculate local disk space allocation based on the short-term
recovery goals you specify. At that point, you have the option to decide how much space to allocate in the storage
pool to replicas and recovery points for each data source that you have selected for membership in the group. In
addition, you also need to ensure that there is sufficient space on local disks of protected servers for the change
journal. MABS provides default space allocations for the members of the protection group. For details regarding the
default space allocations for different MABS components, refer to Deploy protection groups documentation.
Consider using the default space allocations unless you are certain that they don't meet your needs. Overriding the
default allocations can result in allocation of too little or too much space. Allocation of too little space for the
recovery points can prevent MABS from storing enough recovery points to meet your retention range objectives.
Allocation of too much space wastes disk capacity. After creating a protection group, if you allocated too little space
for a data source, you can increase the allocations for the replica and recovery point volumes for each data source.
If you allocated too much space for the protection group, you can remove the data source from the protection
group and delete the replica. Then add the data source to the protection group with smaller allocations.
If you need to adjust the estimated sizing of the Azure Stack Hub VM hosting MABS post-deployment to
accommodate changes in processing or storage requirements, you have three basic options to choose from:
Implement vertical scaling. This involves modifying the amount and type of processor, memory, and disk
resources of the Azure Stack Hub VMs hosting MABS.
Implement horizontal scaling. This involves provisioning or deprovisioning Azure Stack Hub VMs with MABS
installed to match processing demands of protected workloads.
Modify protection policies. This involves changing parameters of protection policies, including retention range,
recovery point schedule, and express full backup schedule.
NOTE
It is important to keep in mind that MABS is subject to limits in regard to the number of recovery points, express full backups,
and incremental backups. For details regarding these limits, refer to Recovery process.

If you opt to automatically grow volumes, then MABS accounts for the increased backup volume as the production
data grows. Otherwise, MABS limits the backup storage to the size of data sources in the protection group.
From the networking standpoint, there are two main options to adjust available bandwidth:
Increase VM size. In the case of Azure Stack Hub VMs, their size determines maximum network bandwidth.
However, it's important to note that there are no bandwidth guarantees; instead, VMs are allowed to utilize the
amount of available bandwidth up to the limit determined by their size.
Increasing throughput of uplink switches. Azure Stack Hub systems support a range of hardware switches that
offer several choices of uplink speeds. Each Azure Stack Hub cluster node has two uplinks to the top-of-rack
switches for fault tolerance. The system allocates half of the uplink capacity for critical infrastructure, and the
remainder is shared capacity for Azure Stack services and all user traffic. Systems deployed with faster speeds
have more bandwidth available for backup traffic.
While in general it's possible to segregate network traffic by attaching a second network adapter to a server, in the
case of Azure Stack Hub VMs, all VM traffic to the internet shares the same uplink. A second virtual network adapter
will not segregate traffic at the physical transport level.
To accommodate larger backup sizes, you could consider leveraging Azure ExpressRoute with Microsoft peering for
connections between Azure Stack Hub virtual networks and Azure Recovery Services vault. Azure ExpressRoute
extends on-premises networks into the Microsoft cloud over a private connection supplied by a connectivity
provider. You can purchase ExpressRoute circuits for a wide range of bandwidths, from 50 megabits per second
(Mbps) to 10 gigabits per second (Gbps).

NOTE
For details regarding implementing Azure ExpressRoute in Azure Stack Hub scenarios, refer to Connect Azure Stack Hub to
Azure using Azure ExpressRoute.

NOTE
MABS v3 leverages enhancements built into MBS, and optimizes network and storage usage by transferring only changed
data during consistency checks.

Manageability considerations
From the manageability standpoint, one of the primary considerations affecting your backup and restore strategy is
the configuration of protection groups and criteria you use to decide which protected workloads should belong to
the same protected groups. As described earlier in this document, a protection group is a collection of data sources
such as volumes, shares, or application data stores, which have common backup and restore settings. When
defining a protection group, you need to specify:
Data sources, such as servers and workloads you want to protect.
Back-up storage, including short-term and long-term protection settings.
Recovery points, which are points in time from which backed up data can be recovered.
Allocated disk space, which is the amount of disk space allocated for backups from the storage pool.
Initial replication, which is the method of an initial backup of data sources, which can involve its transfer online
(over a network) or offline (such as via Azure Import/Export service).
Consistency checks, which is the method of verifying integrity of backed up data.
When deciding which protected workloads should belong to the same protected groups, some of the most
common approaches include:
By computer. This combines all data sources for a computer into the same protection group.
By workload. This separates files and each application data type into different protection groups. However,
recovering a server hosting multiple workloads might require multiple restores from different protection
groups.
By RPO and RTO. This method groups data sources with similar RPOs. You control the RPO by setting the
synchronization frequency for the protection group, which determines the amount of potential data loss
(measured in time) during unexpected outages. In the scenario described in this reference architecture
document, you control the RTO by setting the retention period within the short-term storage. This is because this
determines the period during which backups can be restored from the local short-term storage, rather than from
the cloud-based long-term storage. Backing up from the local short-term storage results in a faster restore.
By data characteristics. This approach accounts for frequency of data changes, data growth rate, or its storage
requirements as the criteria for groupings.
When naming protection groups, use a unique, meaningful name. The names can consist of any combination of
alphanumeric characters and can include spaces, but cannot exceed 64 characters in length.
In addition, when you create a protection group you must choose a method for creating the initial replica. The initial
replication copies all the data selected for protection to the server hosting MABS, and then copies it to the Azure
Site Recovery vault. (Each stage is also followed by a consistency check.) MABS can create replicas automatically
over the network, but you have the option to create replicas manually by backing up, transferring, and restoring
data offline.
When choosing a replica creation method, refer to the information available at Initial replication over the network,
including a table that provides estimates of how long MABS would take to create a replica automatically over the
network given different protected data sizes and network speeds.
The offline-seeding process supports the Azure Import/Export service. The Azure Import/Export service provides
the ability to transfer data to an Azure Storage account by using SATA disks. This capability accommodates scenarios
in which customers have terabytes of initial backup data and a low-bandwidth network connection to Azure.
The offline-seeding workflow involves the following high-level steps:
1. A customer copies the initial backup data to one or more SATA disks by using the AzureOfflineBackupDiskPrep
tool.
2. The tool automatically generates an Azure Import job and an Azure AD app in the subscription hosting the target
Azure Storage account and Azure Recovery Services vault. The purpose of the app is to provide Azure Backup
with secure and scoped access to the Azure Import Service, required by the offline seeding process.
3. The customer ships SATA drives to the Azure datacenter hosting the target Azure Storage account.
4. Azure datacenter staff copies data from SATA disks to the Azure Storage account.
5. The workflow triggers a copy from the Azure Storage account to the Azure Recovery Services vault.
Testing restores
In addition to an optimally designed and implemented backup strategy, it is equally important to define, document,
and test the proper restore process for each type of protected workload. While MABS provides built-in consistency
checks that automatically verify integrity of backed up data, testing of restores should be part of routine operating
procedures, which validate that integrity based on the state of restored workloads. Results of such validation should
be available to workload owners.
In general, testing restores tends to be challenging because it requires an environment that closely resembles the
one hosting protected workloads. Azure Stack Hub, with its built-in DevOps and Infrastructure-as-Code capabilities,
considerably simplifies addressing this challenge.
Roles and responsibilities
Planning for and implementing backup and restore of Azure Stack Hub–based workloads typically involves
interaction among a number of stakeholders:
Azure Stack Hub operators. Azure Stack Hub operators manage Azure Stack Hub infrastructure, ensuring that
there are sufficient compute, storage, and network resources for implementing a comprehensive backup and
restore solution, and making these resources available to tenants. They also collaborate with application and
data owners to help determine the optimal approach to deploying their workloads to Azure Stack Hub.
Azure administrators. Azure administrators manage Azure resources necessary for implementing hybrid backup
solutions.
Azure AD administrators. Azure AD administrators manage Azure AD resources, including user and group
objects that are used to provision, configure, and manage Azure resources.
Azure Stack Hub tenant IT staff. These stakeholders design, implement, and manage MABS, including its backups
and restores.
Azure Stack Hub users. These users provide RPO and RTO requirements and submit requests to back up and
restore data and applications.

Security considerations
Managing user data and applications in hybrid scenarios warrants additional security considerations. These
considerations can be grouped into the following categories:
Backup encryption
Azure Recovery Services vault protection
MABS and Azure Backup enforce encryption of backups at rest and in transit:
Encryption at rest. One of the steps during MABS installation involves providing a passphrase, which is
subsequently used to encrypt all backups before they are uploaded to an Azure Recovery Services vault.
Decryption takes place only after backups are downloaded from that vault. The passphrase is available only to
the user who created the passphrase and the locally installed MARS agent. It's critical to ensure that the
passphrase is stored in a secure location, because it serves as the authorization mechanism when performing
cloud-based restores on a MABS server different from the one where the backups took place.
Encryption in transit. MABS v3 relies on Transport Layer Security (TLS) protocol version 1.2 to protect its
connections to Azure.
Azure Recovery Services vault offers a number of mechanisms that further protect online backups, including:
Azure role-based access control (RBAC). RBAC allows for delegating and segregating responsibilities according
to the principle of least privilege. There are three Azure Backup-related built-in roles that restrict access to
backup management operations:
Backup Contributor. Provides access to create and manage backups, with the exception of deleting
Recovery Services vault and delegating access to others.
Backup Operator. Provides access equivalent to that of the Backup Contributor, with the exception of
removing backups and managing backup policies.
Backup Reader. Provides access to monitor backup management operations.
Azure Resource Locks. You have the option to create and assign read-only and delete locks to an Azure Site
Recovery vault to mitigate the risk of the vault being accidentally or maliciously changed or deleted.
Soft delete. Soft delete helps protect vault and backup data from accidental or malicious deletions. With soft
delete, if a user deletes a backup item, the corresponding data is retained for an additional 14 days, allowing for
its recovery with no data loss during that period. The additional 14-day retention of backup data in the soft
delete state doesn't incur any cost. Soft delete is enabled by default.
Protection of security-sensitive operations. The Azure Recovery Services vault automatically implements an
additional layer of authentication whenever a security-sensitive operation—such as changing a passphrase—is
attempted. This extra validation helps ensure that only authorized users are performing these operations.
Suspicious activity monitoring and alerts. Azure Backup provides built-in monitoring and alerting of security-
sensitive events related to Azure Backup operations. Backup Reports facilitate usage tracking, auditing of
backups and restores, and identifying key backup trends.

DevOps considerations
While backup and restore are considered part of IT operations, there are some DevOps-specific considerations that
are worth incorporating into a comprehensive backup strategy. Azure Stack Hub facilitates automated deployment
of a variety of workloads, including VM-based applications and services. You can leverage this capability to
streamline MABS deployment to Azure Stack Hub VMs, which simplifies the initial setup in multitenant scenarios. By
combining Azure Resource Manager templates, VM extensions, and the DPM PowerShell module, it is possible to
automate configuration of MABS, including setup of its protection groups, retention settings, and backup schedules.
In the spirit of the best practices of DevOps, store templates and scripts in a source control and configure their
deployment through pipelines. These provisions will also help with minimizing recovery time in scenarios where
the infrastructure necessary to restore file and application data needs to be recreated.

Cost considerations
When considering the cost of the backup solution described in this reference architecture document, remember to
account for both on-premises and cloud-based components. The pricing of on-premises components is determined
by the Azure Stack Hub pricing model. As with Azure, Azure Stack Hub offers a pay-as-you-use arrangement
available through enterprise agreements and the Cloud Solution Provider program. This arrangement includes a
monthly price per Windows Server VM. If you have the option to leverage existing Windows Server licenses, you
can significantly reduce that cost down to the base VM pricing. While MABS relies on SQL Server as its data store,
there is no licensing cost associated with running that SQL Server instance providing it's used exclusively for MABS.
Azure-related charges are associated with use of the following resources:
Azure Backup. Pricing for Azure Backup is largely determined by the number of protected workloads, and the
size of backed-up data (before compression and encryption) for each. The cost is also affected by the choice
between locally redundant storage (LRS) and geo-redundant storage (GRS) for the replication of the Azure
Recovery Services vault content. For details, refer to Azure Backup pricing.
Azure ExpressRoute. Azure ExpressRoute pricing is based on one of two models:
Unlimited data. This is a monthly fee with all inbound and outbound data transfers included.
Metered data. This is a monthly fee with all inbound data transfers free of charge and outbound data
transfers charged per gigabyte.
Azure Import/Export. The cost for Azure Import/Export includes a flat, per-device fee for device handling.
Azure Storage. When using Azure Import/Export, standard Azure Storage rates and transaction fees apply.
Without ExpressRoute, you might have to account for increased bandwidth utilization of your Internet connections,
both during backups and restores. The cost will vary depending on a number of factors, including geographical
area, current bandwidth usage, and the choice of an Internet service provider.

Alternative solutions
The recommended solution described in this reference architecture document is understandably not the only way
to provide backup and restore functionality to user workloads running on Azure Stack Hub. Customers have other
options, including:
Local backup and restore by using the Windows Ser ver Backup feature included in the Windows Server
operating system. Users can subsequently copy local backups to a longer-term storage. This approach supports
application-consistent backups by relying on Windows VSS providers, but increases local disk space usage and
backup maintenance overhead.
Backup and restore by using Azure Backup with the locally installed MARS agent. This approach minimizes the
use of local disk space and automates the process of uploading backups to cloud-based storage. However, it
doesn't support application-consistent backups.
Backup and restore by using a backup solution installed in the same datacenter but outside of Azure Stack Hub.
This approach facilitates scenarios that involve an Azure Stack Hub disconnected deployment model.
Azure Stack Hub–level backup and restore by using disk snapshots. This approach requires that the VM being
backed up is stopped, which is typically not a viable option for business-critical workloads, but might be
acceptable in some scenarios.

Summary
In conclusion, Azure Stack Hub is a unique offering, which differs in many aspects from other virtualization
platforms. As such, it warrants special considerations in regard to business continuity strategies for workloads
running on its VMs. Leveraging Azure services helps simplify the design and implementation strategy. In this
architecture reference document, we explored using MABS for backup of file and application data on Azure Stack
Hub VMs in the connected deployment model. This approach allows customers to benefit from resiliency and
manageability of Azure Stack Hub, and from the hyperscale and global presence of the Azure cloud.
It's important to note that the backup solution described here focuses exclusively on file and application data on
Azure Stack Hub VMs. This is just a part of an overall business continuity strategy that should account for a variety
of other scenarios affecting workload availability. These could include localized hardware and software failures,
system outages, catastrophic events, and large-scale disasters.
Azure Automation Update Management
11/2/2020 • 17 minutes to read • Edit Online

This reference architecture illustrates how to design a hybrid update management solution to manage updates on
both Microsoft Azure and on-premises Windows and Linux computers.

Download a Visio file of this architecture.


Typical uses for this architecture include:
Managing updates across on-premises and in Azure using the Update Management component of Automation
Account.
Using scheduled deployments to orchestrate the installation of updates within a defined maintenance window.

Architecture
The architecture consists of the following components:
Log Analytics workspace - A Log Analytics workspace is a data repository for log data that's collected from
resources that run in Azure, on-premises, or in another cloud provider.
Automation Hybrid Worker solution - Create Hybrid Runbook Workers to run Azure Automation runbooks
on your Azure and non-Azure computers.
Automation account - This is a cloud service that automates configuration and management across your
Azure and non-Azure environments.
Hybrid Runbook Worker - This is a computer that's configured with the Hybrid Runbook Worker feature and
can run runbooks directly on the computer and against the resources in the local environment.
Hybrid Runbook Worker group - It's a group of Hybrid Runbook Workers used for high availability.
Runbook - This is a collection of one or more linked activities that together automate a process or operation.
On-premises computers and VMs - These are on-premises computers and VMs with Windows or Linux
operating systems that reside on-premises.
Azure VMs - Azure VMs include Windows or Linux VMs that are hosted in Azure.

Recommendations
The following recommendations apply for most scenarios. Follow them unless you have a specific requirement that
overrides them.
Update Management
Update Management is a configuration component of Automation. Windows and Linux computers, both in Azure
and on-premises, send assessment information about missing updates to the Log Analytics workspace. Azure
Automation then uses that information to create a schedule for automatic deployment of the missing updates.
The following steps highlight the actual implementation:
1. Create a Log Analytics workspace.
2. Create an Automation account.
3. Link the Automation account with the Log Analytics workspace.
4. Enable Update Management for Azure VMs.
5. Enable Update Management for non-Azure VMs.
Create a Log Analytics workspace
Before you create a Log Analytics workspace, ensure that you have at least Log Analytics Contributor role
permissions.
You can have more than one Log Analytics workspace for data isolation or for geographic location of data storage,
but the Log Analytics agent can be configured to report to one Log Analytics workspace. For more information,
review the Designing your Azure Monitor Logs deployment before you create the workspace.
Use the following procedure to create a Log Analytics workspace:
1. Sign in to the Azure portal at https://por tal.azure.com .
2. In the Azure portal, select Create a resource .
3. In the Search the Marketplace box, enter Log Analytics . As you begin entering this text, the list filters based
on your input. Select Log Analytics workspaces .
4. Select Create , and then configure the following items:
a. Select a different Subscription in the drop-down list if the default selection isn't appropriate.
b. For the Resource Group , choose to use an existing resource group that's already set up or create a new
one.
c. Provide a unique name for the new Log Analytics workspace , such as HybridWorkspace-yourname
d. Select the Location for your deployment.
e. Select pricing tier to proceed to further customizations.
f. If you're creating a workspace in a subscription that was created after April 2, 2018, it'll automatically use
the Per GB pricing plan, and the option to select a pricing tier won't be available. If you're creating a
workspace for an existing subscription that was created before that date or for a subscription that was
tied to an existing Enterprise Agreement enrollment, select your preferred pricing tier. For more
information about the particular tiers, refer to Log Analytics Pricing details.
g. Select Tags and optionally provide a name and value for categorization of the resources.
h. Select Review + Create .
5. After providing the required information in the Log Analytics workspace pane, select Create .
Create an Automation account
After the Automation Hybrid Worker solution has been added to the Log Analytics workspace, proceed with
creation of the Automation account. Refer to Supported regions for linked Log Analytics workspace to select the
regions for Automation account and Log Analytics workspace. It's important that you create the Automation
account based on the region mapping document and preferably in the same resource group as the Log Analytics
workspace.
Use the following procedure to create an Automation account:
1. In the Azure portal, select Create a resource .
2. In the Search the Marketplace box, enter Automation . As you begin entering this text, the list filters based on
your input. Select Automation , and then select Create .
3. Select Create , and then configure the following items:
a. Provide the Name for the Automation account, such as hybrid-auto.
b. Select a different Subscription in the drop-down list if the default selection isn't appropriate.
c. For the Resource Group , choose the same resource group in which you want to create the automation
account.
d. Select the Location based on the region mapping document.
e. Create Azure Run As account is optional because this only provides authentication with Azure to
manage Azure resources from Automation runbooks.
4. After providing the required information in the Add Automation Account pane, select Create .
Link the Automation account with the Log Analytics workspace
Automation accounts use the Hybrid Runbook Worker components that deploy in the Log Analytics workspace. You
must integrate those services before you deploy a Log Analytics agent on an on-premises computer. Currently,
mappings between Log Analytics workspaces and Automation accounts are supported in several regions. For
further information, refer to Supported regions for linked Log Analytics workspace.
Use the following procedure to link an Automation account with a Log Analytics workspace:
1. In the Azure portal, select All ser vices , and then enter automation . As you begin entering this text, the list
filters based on your input. Select Automation Account , and then select the Automation account that you
created earlier.
2. In the Automation Account pane, select Update Management in the Update Management section.
3. In the Update Management pane, configure the following items:
a. Select a different Subscription in the drop-down list if the default selection isn't appropriate.
b. For Log Analytics workspace , select your existing Log Analytics workspace; for example,
HybridWorkspace-yourname.
4. After providing the required information in the Update Management pane, select Enable .
Enable Update Management for Azure VMs
Enable Update Management for Azure VMs by using the following tools:
Azure Resource Manager template. Microsoft provides a sample template that can automate creation of an
Azure Log Analytics workspace, creation of an Automation account, linking the Automation account to the Log
Analytics workspace, and enabling Update Management.
Update Management from the Azure portal. Use this method when you want to update multiple VMs that reside
in different regions.
Update Management from an Azure VM. This will configure updates for a selected VM.
Update Management from an Automation account. Use this method when you want to update both Azure and
non-Azure computers and VMs at the same time.
Update Management from a runbook. Use this method to enable Update Management as an automated
procedure combined with other automation activities.
Use the following procedure to enable Update Management for Azure VMs:
1. In the Azure portal, select All ser vices , and then enter automation . As you begin entering this text, the list
filters based on your input. Select Automation Account , and then select the Automation account that you
created earlier.
2. In the Automation Account pane, select Update Management in the Update Management section.
3. In the Update Management pane, select Add Azure VMs , select one or more VMs that are ready for Update
Management, and then select Enable .
Deploy the Log Analytics agent and connect to a Log Analytics workspace
Deploying a Hybrid Runbook Worker component is part of the deployment of a Log Analytics agent.
If you test the solution by using an Azure VM, you can install the Log Analytics agent and enroll the VM in an
existing Log Analytics workspace by using a VM extension for both Linux and Windows. You can also deploy the
agent by using Azure Automation Desired State Configuration, a Windows PowerShell script, or by using a Resource
Manager template for VMs. For more information, refer to Connect Windows computers to Azure Monitor.
For non-Azure VMs, deploy the agent by using a manual or automated process both on physical Windows and
Linux computers or VMs that are in your environment.
For Windows computers, configure the agent to communicate with the Log Analytics service by using the Transport
Layer Security (TLS) 1.2 protocol. Refer to Connect Windows computers to Azure Monitor for a detailed explanation
of the deployment procedure.
The Log Analytics agent for Linux can be deployed:
Manually by using a shell script bundle that contains Debian and Red Hat Package Manager (RPM) packages for
each of the agent components. This is recommended when a Linux computer doesn't have internet connectivity
and will communicate with the Log Analytics service through the Log Analytics gateway.
By using a wrapper-script that's hosted on GitHub when the computer has connectivity to the internet.
The Log Analytics agent must be configured to communicate with a Log Analytics workspace by using the
workspace ID and key of the Log Analytics workspace.
Use the following procedure to deploy a Log Analytics agent and connect to a Log Analytics workspace:
1. In the Azure portal, search for and select Log Analytics workspaces .
2. In your list of Log Analytics workspaces, select the workspace that the agent uses for reporting.
3. Select Agents management .
4. Copy and paste the Workspace ID and Primar y Key into your favorite editor.
5. In your Log Analytics workspace, from the Windows Ser vers page that you browsed to earlier, select the
appropriate Download Windows Agent version to download based on the processor architecture of the
Windows operating system.
6. Run Setup to install the agent on your computer.
7. On the Welcome page, select Next .
8. On the License Terms page, read the license, and then select I Agree .
9. On the Destination Folder page, change or keep the default installation folder, and then select Next .
10. On the Agent Setup Options page, choose to connect the agent to Azure Log Analytics, and then select Next .
11. On the Azure Log Analytics page, perform the following steps:
a. Paste the Workspace ID and Workspace Key (Primar y Key) that you copied earlier. If the computer
reports to a Log Analytics workspace in a Microsoft Azure Government cloud, select Azure US
Government in the Azure Cloud drop-down list.
b. If the computer needs to communicate through a proxy server to the Log Analytics service, select
Advanced , and then provide the URL and port number of the proxy server. If your proxy server requires
authentication, enter the username and password to authenticate with the proxy server, and then select
Next .
12. Select Next after you finish providing the necessary configuration settings.
Enable Update Management for non-Azure computers
Enabling Update Management on non-Azure computers has the following prerequisites:
Deploy the Log Analytics agent and connect to a Log Analytics workspace.
Previous procedures explain how to configure those prerequisites.
After installing the Log Analytics agent on an on-premises computer, enable Update Management in the Azure
portal by using the following procedure:
1. In the Azure portal, select All ser vices , and then enter automation . As you begin entering this text, the list
filters based on your input. Select Automation Account , and then select the Automation account that you
created earlier.
2. In the Automation Account pane, select Update Management in the Update Management section.
3. In the Update Management pane, select Manage machines , and then select computers that are listed and
have been configured to send log data to the Log Analytics workspace.
4. Select Enable to finish the configuration of Update Management on non-Azure machines.
Each Windows computer managed by Update Management is listed in the Hybrid Worker groups pane as a
System Hybrid Worker group for the Automation account. Use these groups only for deploying updates, not for
targeting the groups with runbooks for automated tasks.

Manageability considerations
Manage updates for Azure VMs and non-Azure machines
Update assessment for all missing updates that both Azure VMs and non-Azure computers require is visible in the
Update Management section of your Automation account.
Schedule an update deployment by using the Azure portal or by using PowerShell, which creates schedule assets
that are linked to the Patch-MicrosoftOMSComputers runbook.
Use the following procedure to schedule a new update deployment:
1. In your Automation account, go to Update management under Update management , and then select
Schedule update deployment .
2. Under New update deployment , use the Name box to enter a unique name for your deployment.
3. Select the operating system to target for the update deployment.
4. In the Groups to update pane, define a query that combines subscription, resource groups, locations, and
tags to build a dynamic group of Azure VMs to include in your deployment. To learn more, refer to Use
dynamic groups with Update Management.
5. In the Machines to update pane, select a saved search, an imported group, or pick Machines from the
drop-down menu, and then select individual machines.
6. Use the Update classifications drop-down menu to specify update classifications for products.
7. Use the Include/exclude updates pane to select specific updates for deployment.
8. Select Schedule settings to define a time when the update deployment will run on computers.
9. Use the Recurrence box to specify if the deployment occurs once or uses a recurring schedule, and then
select OK .
10. In the Pre-scripts + Post-scripts (Preview) region, select the scripts to run before and after your
deployment. To learn more, refer to Manage pre-scripts and post-scripts.
11. Use the Maintenance window (minutes) box to specify the amount of time that's allowed for updates to
install.
12. Use the Reboot options box to specify the way to handle reboots during deployment.
13. When you finish configuring the deployment schedule, select Create .
Results of a completed update deployment are visible in the Update Management pane on the Histor y tab.
Configure Windows Update settings
Azure Update Management depends on Windows Update Client to download and install updates either from
Windows Update (default setting) or from Windows Server Update Server. Configure Windows Update Client
settings to connect to Windows Server Update Services (WSUS) by using:
Local Group Policy Editor
Group Policy
PowerShell
Directly editing the registry
For more information, refer to how to configure Windows Update settings.
Integrate Update Management with Microsoft Endpoint Configuration Manager
The Software Update Management cycle can integrate with Microsoft Endpoint Configuration Manager for
customers that are already using this product to manage PCs, servers, and mobile devices.
To integrate Software Update Management with Endpoint Configuration Manager, first integrate Endpoint
Configuration Manager with Azure Monitor logs and import the collections in the Log Analytics workspace.
For details, refer to Connect Configuration Manager to Azure Monitor.
To manage updates on local computers, configure them with:
The Endpoint Configuration Manager client.
The Log Analytics agent, which is configured to report to a Log Analytics workspace that's enabled for Update
Management.
Windows agents that are configured to communicate with WSUS or have access to Microsoft Update.
To manage updates on computers with Endpoint Configuration Manager, deploy the following roles on the Endpoint
Configuration Manager computer:
Management point. This site system role manages clients with a policy that contains configuration settings and
service location information.
Distribution point. This contains source files for clients.
Software update point. This is a role on the server that's hosting WSUS.
Manage software updates by using:
Endpoint Configuration Manager
Azure Automation
Partner updates on Windows machines can be deployed from a custom repository that System Center Updates
Publisher (SCUP) provides. SCUP can import custom updates either in standalone WSUS or integrated with
Endpoint Configuration Manager.
For more information, refer to Integrate Update Management with Windows Endpoint Configuration Manager.
Deploy the Log Analytics agent by using a PowerShell script
To accelerate deployment of the Log Analytics agent with the Hybrid Worker role running on a Windows computer,
use the New-OnPremiseHybridWorker.ps1 PowerShell script. The script:
Installs the necessary modules.
Signs in with your Azure account.
Verifies the existence of a specified resource group and Automation account.
Creates references to Automation account attributes.
Creates an Azure Monitor Log Analytics workspace if not specified.
Enables the Automation solution in the workspace.
Downloads and installs the Log Analytics agent for the Windows operating system.
Registers the machine as a Hybrid Runbook Worker.
Deploying many agents in an on-premises infrastructure can be orchestrated by using command-line scripts and by
using Group Policy or Endpoint Configuration Manager.
Use dynamic groups for Azure and non-Azure machines
Dynamic groups for Azure VMs filter VMs based on a combination of:
Subscriptions
Resource groups
Locations
Tags
Dynamic groups for non-Azure computers use saved searches to filter the computers for deployment of the update.
Saved searches, also known as computer groups, can be created by using:
A log query. Use Azure Data Explorer to define a logical expression to filter the computers.
Active Directory Domain Services. A group is created in Log Analytics workspace for any members of an Active
Directory domain.
Endpoint Configuration Manager. Import computer collections from Endpoint Configuration Manager into a Log
Analytics workspace.
WSUS. Groups that are created in WSUS servers can be imported into a Log Analytics workspace.
For more information on how to create computer groups for filtering machines for update deployment, refer to
Computer groups in Azure Monitor log queries.

Scalability considerations
Azure Automation can process up to 1,000 computers per update deployment. If you expect to update more than
1,000 computers, you can split up the updates among multiple update schedules. Refer to Azure subscription and
service limits, quotas, and constraints.

Availability considerations
Currently, mappings between Log Analytics Workspace and Automation Account are supported in several
regions. For further information, refer to Supported regions for linked Log Analytics workspace.
Supported client types: Update assessment and patching is supported on Windows and Linux computers that
run in Azure or in your on-premises environment. Currently, the Windows client isn't officially supported. For a
list of the supported clients, refer to Supported client types.

Security considerations
Update Management permissions: The Update Management component of Automation and the Log Analytics
workspace component of Monitor can use role-based access control (RBAC) with built-in roles from Azure
Resource Manager. For segregation of the duties, these roles can be assigned to different users, groups, and
security principals. For a list of the roles in Automation accounts, refer to Manage role permissions and security.
Encryption of sensitive assets in Automation: An Automation account can contain sensitive assets such as
credentials, certificates, and encrypted variables that runbooks might use. Each secure asset is encrypted by
default using a data encryption key that's generated for each Automation account. These keys are encrypted and
stored in Automation with an account encryption key that can be stored in the Azure Key Vault for customers
who want to manage encryption with their own keys. By default, an account encryption key is encrypted by
using Microsoft-managed keys. Use the following guidelines to apply encryption of secure assets in Azure
Automation.
Runbook permissions for a Hybrid Runbook Worker: By default, runbook permissions for a Hybrid Runbook
Worker run in a system context on the machine where they're deployed. A runbook provides its own
authentication to local resources. Authentication can be configured using managed identities for Azure resources
or by specifying a Run As account to provide a user context for all runbooks.
Network planning: Hybrid Runbook Worker requires outbound internet access over TCP port 443 to
communicate with Automation. For computers with restricted internet access, you can use the Log Analytics
gateway to configure communication with Automation and an Azure Log Analytics workspace.
Azure Security baseline for Automation: Azure security baseline for Automation contains recommendations
about how to increase overall security to protect your assets following best practice guidance.

DevOps considerations
You can schedule update deployment programmatically through the REST API. For more information, refer to
Software Update Configurations - Create.
Azure Automation allows integration with popular source control systems like Azure DevOps and GitHub. With
source control, you can integrate an existing development environment that contains your scripts and custom
code that has been previously tested in an isolated environment.
For more information about how to integrate Automation with your source control environment, refer to Use
source control integration.

Cost considerations
Use the Azure pricing calculator to estimate costs. For more information about Automation pricing models, refer
to Automation pricing.
Azure Automation costs are priced for job execution per minute or for configuration management per node.
Every month, the first 500 minutes of process automation and configuration management on five nodes are
free.
An Azure Log Analytics workspace might generate more costs related to the amount of log data that's stored in
Azure Log Analytics. The pricing is based on consumption, and the costs are associated with data ingestion and
data retention. For ingesting data into Azure Log Analytics, use the capacity reservation or pay-as-you-go model
that includes 5 gigabytes (GB) free a month for each billing account. Data retention for the first 31 days is free of
charge.
Use the Azure pricing calculator to estimate costs. For more information about Log Analytics pricing models,
refer to Azure Monitor pricing.
Deploy AI and machine learning at the edge by using
Azure Stack Edge
11/2/2020 • 5 minutes to read • Edit Online

This reference architecture illustrates how to use Microsoft Azure Stack Edge to extend the ability to perform rapid
machine learning inferencing from the cloud to on-premises or edge scenarios.

Download a Visio file of this architecture.


Typical uses for this architecture include:
Performing local, rapid machine learning inference against data as it’s ingested by organizations with a
significant on-premises hardware footprint .
Creating long-term research solutions where existing on-premises data is cleaned and used to generate a
model. The model is then used both on-premises and in the cloud; it's retrained regularly as new data arrives.
Building software applications that need to make inferences about users, both at a physical location and online.

Architecture
The architecture consists of the following components:
Azure Machine Learning . Machine Learning is a technology that lets you build, train, deploy, and manage
machine learning models in a cloud-based environment. These models can then deploy to various Azure
services, including (but not limited to) Azure Container Instances, Azure Kubernetes Service (AKS), and Azure
Functions.
Azure Container Registr y . Container Registry is a service that creates and manages the Docker Registry.
Container Registry builds, stores, and manages Docker container images and can store containerized machine
learning models.
Azure Stack Edge . Azure Stack Edge is an edge computing device that's designed to perform machine learning
inferencing at the edge and preprocess data before transferring it to Azure. Azure Stack Edge includes compute
acceleration hardware that's designed to improve performance of any AI inferencing that occurs at the edge.
Local data . Local data references any data that's used in the training of the machine learning model. The data
can reside in any local storage solution, including Azure Arc deployments.

Recommendations
Ingesting, transforming, and transferring data stored locally
Azure Stack Edge can transform data sourced from local storage before transferring that data to Azure. This
transformation is performed by using an Azure IoT Edge device that's deployed on the Azure Stack Edge device.
These IoT Edge devices are associated with an Azure IoT Hub resource on the Azure cloud platform.
Each IoT Edge module is a Docker container that performs a specific task in an ingest, transform, and transfer
workflow. For example, an IoT Edge module can collect data from an Azure Stack Edge local share, transform the
data into a format that's ready for machine learning, and then transfer the transformed data to an Azure Stack Edge
cloud share. You can add custom or built-in modules to your IoT Edge device; you can also develop custom IoT
Edge modules to use in Azure Stack Edge.

NOTE
IoT Edge modules are registered as Docker container images in Container Registry.

In the Azure Stack Edge resource on the Azure cloud platform, the cloud share is backed by an Azure Blob storage
account resource. All data in the cloud share will automatically upload to the associated storage account. You can
verify the data transformation and transfer by either mounting the local or cloud share, or by traversing the Azure
Storage account.
Training and deploying a model
After preparing and storing data in Blob storage, you can create a Machine Learning dataset that connects to Azure
Storage. Creating a dataset ensures that you only keep a single copy of your data in storage that's directly
referenced by Machine Learning.
You can use the Machine Learning command-line interface (CLI), the R SDK, the Python SDK, designer, or Visual
Studio Code to build the scripts that are required to train your model.
After training and readying the model to deploy, you can deploy it to a variety of Azure services, including but not
limited to:
Azure Container Registr y . You can deploy the models to a private Docker Registry such as Azure Container
Registry since they are Docker container images.
Azure Container Instances . You can deploy the model's Docker container image directly to a container
group.
Azure Kubernetes Ser vices . You can use Azure Kubernetes Services to automatically scale the model's
Docker container image for high-scale production deployments.
Azure Functions . You can package a model to run directly on a Functions instance.
Azure Machine Learning . You can use Compute instances; managed cloud-based development workstations,
for both training and inferencing of models. You can also similarly deploy the model to on-premises IoT Edge
and Azure Stack Edge devices.

NOTE
For this reference architecture, the model deploys to Azure Stack Edge to make the model available for inferencing on-
premises. The model also deploys to Container Registry to ensure that the model is available for inferencing across the
widest variety of Azure services.

Inferencing the newly deployed model


Azure Stack Edge can run machine learning models locally against data in your on-premises location quickly by
using its built-in compute acceleration hardware. This computation is performed entirely at the edge and can be a
way to get rapid insights from data by using hardware that's closer to the data source than a public cloud region.
Additionally, data can continue to transfer from Azure Stack Edge to Machine Learning for continuous retraining
and improvement by using a machine learning pipeline that's associated with the model that's already running
against data stored locally.

Availability considerations
Consider placing your Azure Stack Edge resource in the same Azure region where you will want to access it
from other Azure services. To optimize upload performance, consider placing your Azure Blob storage account
in the region where your appliance has the best network connection.
Consider Azure ExpressRoute for a stable, redundant connection between your device and Azure.

Manageability considerations
As an administrator, you can verify that the data source from local storage has transferred to the Azure Stack
Edge resource correctly by mounting the Server Message Block (SMB)/Network File System (NFS) file share or
connecting to the associated Blob storage account by using Azure Storage Explorer.
Use Machine Learning datasets to reference your data in Blob storage while training your model. This
eliminates the need to embed secrets, data paths, or connection strings in your training scripts.
In your Azure Machine Learning workspace, register version models so that you can easily keep track of the
differences between your models at different points in time. You can similarly mirror the versioning and
tracking metadata in the tags that you use for the Docker container images that deploy to Container Registry.

DevOps considerations
Review the MLOps lifecycle management approach for Machine Learning. For example, you can use GitHub or
Azure Pipelines to create a continuous integration process to automatically train and retrain a model either
when new data populates the dataset or a change is made to the training scripts.
The Azure Machine Learning workspace will automatically register and manage Docker container images for
machine learning models and IoT Edge modules.

Cost considerations
Use the Azure pricing calculator to estimate costs.
Azure Stack Edge pricing is calculated as a flat-rate monthly subscription with a one-time shipping fee.
Azure Machine Learning also deploys Container Registry, Blob storage, and Azure Key Vault services, which
incur additional costs.
Azure Machine Learning pricing includes charges for the virtual machines that are used for training the model
in the public cloud.
Design a hybrid Domain Name System solution with
Azure
11/2/2020 • 9 minutes to read • Edit Online

This reference architecture illustrates how to design a hybrid Domain Name System (DNS) solution to resolve
names for workloads that are hosted on-premises and in Microsoft Azure. This architecture works for users and
other systems that are connecting from on-premises and the public internet.

Download a Visio file of this architecture.

Architecture
The architecture consists of the following components.
On-premises network . The on-premises network represents a single datacenter that's connected to Azure
over an Azure ExpressRoute or virtual private network (VPN) connection. In this scenario, the following
components make up the on-premises network:
DNS servers. These servers represent two servers with DNS service installed that are acting as
resolver/forwarder. These DNS servers are used for all computers in the on-premises network as DNS
servers. Records must be created on these servers for all endpoints in Azure and on-premises.
Gateway . The gateway represents either a VPN device or an ExpressRoute connection that's used to
connect to Azure.
Hub subscription . The hub subscription represents an Azure subscription that's used to host connectivity,
management, and networking resources that are shared across multiple Azure-hosted workloads. These
resources can be broken down into multiple subscriptions, as described in the enterprise-scale architecture.
NOTE
The hub virtual network can be substituted with a virtual wide area network (WAN) hub, in which case the DNS
servers would have to be hosted in a different Azure virtual network (VNet). In the Enterprise-scale architecture, that
VNet is maintained in its own subscription entitled the Identity subscription .

Azure Bastion subnet . The Azure Bastion service in the hub virtual network is used for remoting to virtual
machines (VMs) in the hub and spoke VNets from the public internet for maintenance purposes.
Private endpoint subnet . The private endpoint subnet hosts private endpoints for Azure-hosted
workloads in VNets that aren't peered to the hub. One advantage of this type of disconnected VNet is that
its IP addresses can clash with other IP addresses that are used in Azure and on-premises.
Gateway subnet . The gateway subnet hosts the Azure VPN, or ExpressRoute, gateway that's used to
provide connectivity back to the on-premises datacenter.
Shared ser vices subnet . The shared services subnet hosts services that are shared among multiple
Azure workloads. In this scenario, this subnet hosts virtual machines running Windows or Linux that are
also used as DNS servers. These DNS servers host the same DNS zones as the on-premises servers.
Connected subscription . The connected subscription represents a collection of workloads that require a
virtual network and connectivity back to the on-premises network.
VNet peering . This is a peering connection back to the hub VNet. This connection allows connectivity
from the on-premises network to the spoke and back through the hub VNet.
Default subnet . The default subnet contains a sample workload.
web-vmss . This sample virtual machine scale set hosts a workload in Azure that can be accessed
from on-premises, Azure, and the public internet.
Load balancer . The load balancer provides access to a workload that a series of VMs host. The IP
address of this load balancer in the default subnet must be used to access the workload from
Azure and from the on-premises datacenter.
AppGateway subnet . This is the required subnet for the Azure Application Gateway service.
AppGateway . Application Gateway provides access to the sample workload in the default subnet
to users from the public internet.
wkld1-pip . This is the public IP address that's used to access the sample workload from the public
internet.
Disconnected subscription . The disconnected subscription represents a collection of workloads that don't
require connectivity back to the on-premises datacenter and that use the private link service.
PLSSubnet . The private link service subnet (PLSSubnet) contains one or more private link service
resources that provide connectivity to workloads hosted in the Connected subscription .
Default subnet . The default subnet contains a sample workload.
web-vmss . This sample virtual machine scale set hosts a workload in Azure that can be accessed
from on-premises, Azure, and the public internet.
Load balancer . The load balancer provides access to a workload that a series of VMs host. This
load balancer is connected to the private link service to provide access for users that are coming
from Azure and the on-premises datacenter.
AppGateway subnet . This is the required subnet for the Application Gateway service.
AppGateway . Application Gateway provides access to the sample workload in the default subnet
to users from the public internet.
wkld2-pip . This is the public IP address that's used to access the sample workload from the public
internet.
Azure Bastion subnet . The Azure Bastion service in the disconnected virtual network is used for
remoting to VMs in the hub and spoke VNets from the public internet for maintenance purposes.
Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have a
specific requirement that overrides them.

NOTE
For the following recommendations, we'll refer to Workload 1 as a connected workload and Workload 2 as a disconnected
workload. We'll also refer to users and systems that access those workloads as on-premises users , internet users , and
Azure systems .

Extend AD DS to Azure (optional)


Use integrated DNS zones in AD DS to host DNS records for your on-premises datacenter and Azure. In this
scenario, there are two sets of AD DS DNS servers: one on-premises and one in the hub VNet.
We recommend extending your AD DS domain to Azure. We also recommend configuring the hub and spoke VNets
to use the AD DS DNS servers in the hub VNet for all the VMs in Azure.

NOTE
This recommendation is applicable only for organizations that use Active Directory Integrated DNS zone for name resolution.
Others can consider implementing DNS servers that act as resolver/forwarder.

Configure split-brain DNS


You must ensure that split-brain DNS is in place to allow Azure systems, on-premises users, and internet users to
access workloads based on where they're connecting from.
For both connected and disconnected workloads, we recommend the following for DNS resolution:
Azure DNS zones for internet users.
DNS servers for on-premises users and Azure systems.
Azure private DNS zones for resolution between Azure VNets.
To better understand this split-brain recommendation, consider Workload 1 , for which we'll use the
wkld1.contoso.com fully qualified domain name (FQDN).
In this scenario, internet users must resolve that name to the public IP address that Application Gateway exposes
through Wkld1-pip . This is done by creating an address record (A record) in Azure DNS for the connected
subscription.
On-premises users must resolve the same name to the internal IP address for the load balancer in the connected
subscription. This is done by creating an A record in the DNS servers in the hub subscription.
Azure systems can resolve the same name to an internal IP address for the load balancer in the connected
subscription either by creating an A record in the DNS server in the hub subscription, or by using private DNS
zones. When you're using private DNS zones, either manually create an A record in the private DNS zone or enable
autoregistration.
Use private DNS zones for a private link
In our scenario, Workload 2 is hosted in a disconnected subscription, and access to this workload for on-premises
users and connected Azure systems is possible through a private endpoint in the hub VNet. However, there's a third
connection possibility for this workload: Azure systems in the same VNet as Workload 2 .
To better understand the DNS recommendations for Workload 2 , we'll use the wkld2.contoso.com FQDN and
discuss the individual recommendations.
In this scenario, internet users must resolve that name to the public IP address that Application Gateway exposes
through Wkld2-pip . This is done by creating an A record in Azure DNS for the connected subscription.
On-premises users and Azure systems that are connected to the hub VNet and spoke VNets must resolve the same
name to the internal IP address for the private endpoint in the Hub VNet. This is done by creating an A record in the
DNS servers in the hub subscription.
Azure systems in the same VNet as Workload 2 must resolve the name to the IP address of the load balancer in
the disconnected subscription. This is done by using a private DNS zone in Azure DNS in that subscription.
Azure systems in different VNets can still resolve the IP address of the Workload 2 if you link those VNets with
private DNS zone that is hosting the A record for Workload 2 .
Enable autoregistration
When you configure a VNet link with a private DNS zone, you can optionally configure autoregistration
autoregistration for all the virtual machines.

NOTE
Autoregistration works only for virtual machines. For all other resources that are configured with IP address from the VNet,
you have to create DNS records manually in the private DNS zone.

If you're using AD DS DNS server, configure Windows VMs can use dynamic updates for Windows computers to
keep your own DNS records up to date in the AD DS DNS servers. We recommend enabling dynamic updates and
configuring the DNS servers to only allow secure updates.
Linux VMs do not support secure dynamic updates. For on-premises Linux computers, use Dynamic Host
Configuration Protocol (DHCP) to register DNS records to the AD DS DNS servers.
For Linux VMs in Azure, use an automated process.

Scalability considerations
Per Azure region or on-premises datacenters, consider using at least two DNS servers each.
Notice how that's done in the previous scenario, with DNS servers on-premises and in the hub virtual network.

Availability considerations
Consider placement of DNS servers. As described in the scalability considerations section, DNS servers should
be placed close to the users and systems that need access to them.
Per Azure region. Each Azure region has its own hub VNet or vWAN hub. This is where your DNS servers
must be deployed.
Per on-premises datacenter. You should also have a pair of DNS servers per on-premises datacenter for
users and systems in those locations.
For isolated (disconnected) workloads, host a private DNS zone and a public DNS zone for each
subscription to manage split-brain DNS records.

Manageability considerations
Consider the need for DNS records for platform as a service (PaaS) services.
You also must consider DNS resolution for PaaS services that use a private endpoint. Use a private DNS zone for
that and use your DevOps pipeline to create records in the DNS servers.

Security considerations
If you require the use of DNSSEC, consider that Azure DNS currently does not support it.
For DNSSEC validation, deploy a custom DNS server and enable DNSEC validation.

DevOps considerations
Automate configuration of this architecture by combining Azure Resource Manager templates for configuration
of all the resources. Both private and public DNS zones support full management from Azure CLI, PowerShell,
.Net and REST API.
If you're using a continuous integration and continuous development (CI/CD) pipeline to deploy and maintain
workloads in Azure and on-premises, you can also configure autoregistration of DNS records.

Cost considerations
Azure DNS zone costs are based on the number of DNS zones hosted in Azure and the number of DNS queries
that are received.
Use the Azure pricing calculator to estimate costs. Pricing models for Azure DNS is explained here.
The billing model for Azure ExpressRoute is based either on metered data (you are charged per gigabyte for
outbound data transfer), or unlimited data (monthly fee including all data transfer)
If you're using VPN, instead of ExpressRoute, the cost is dependent on the SKU of the virtual network gateway
and is charged per hour.
Hybrid file services
11/2/2020 • 12 minutes to read • Edit Online

This reference architecture illustrates how to use Azure File Sync and Azure Files to extend file services hosting
capabilities across cloud and on-premises file share resources.

Download a Visio file of this architecture.


Typical uses for this architecture include:
Hosting file shares that need to be accessible from cloud and on-premises environments.
Synchronizing data between multiple on-premises data stores with a single cloud-based source.

Architecture
The architecture consists of the following components:
Storage Account . An Azure Storage Account used to host Microsoft Azure file shares.
Azure Files . A serverless cloud file share that provides the cloud endpoint of a sync relationship with Azure File
Sync. Files in an Azure file share can be accessed directly with Server Message Block (SMB) or the FileREST
protocol.
Sync groups . Logical groupings of Azure file shares and servers running Windows Server. Sync groups are
deployed into Storage Sync Service, which registers servers for use with Azure File Sync and contain the sync
group relationships.
Azure File Sync agent . This is installed on Windows Server machines to enable and configure sync with cloud
endpoints.
Windows Ser vers . On-premises or cloud-based Windows Server machines that host a file share that syncs
with an Azure file share.
Azure Active Director y . The Azure Active Directory (Azure AD) tenant used for identity synchronization across
Azure and on-premises environments.

Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have a
specific requirement that overrides them.
Azure Files usage and deployment
You store your files in the cloud in Azure file shares. You can use Azure file shares in two ways: by directly mounting
these serverless Azure file shares (SMB) or by caching Azure file shares on-premises using Azure File Sync. The
deployment option you choose changes what you need to consider as you plan for your deployment:
Direct mount of an Azure file share. Because Azure Files provides SMB access, you can mount Azure file shares
on-premises or in the cloud using the standard SMB client available in the Windows, macOS, and Linux
operating systems. Azure file shares are serverless, so deploying for production scenarios doesn't require
managing a file server or network-attached storage (NAS) device. This means you don't have to apply software
patches or swap out physical disks.
Cache Azure file share on-premises with Azure File Sync. Azure File Sync enables you to centralize your
organization's file shares in Azure Files, while keeping the flexibility, performance, and compatibility of an on-
premises file server. Azure File Sync transforms an on-premises (or cloud) Windows Server into a quick cache of
your Azure file share.
Deploy the Storage Sync Service
Begin Azure File Sync deployment by deploying a Storage Sync Service resource into a resource group of your
selected subscription. We recommend provisioning as few Storage Sync Service objects as possible. You will create
a trust relationship between your servers and this resource, and a server can only be registered to one Storage Sync
Service. As a result, we recommend that you deploy as many Storage Sync Services as you need to separate groups
of servers. Keep in mind that servers from different Storage Sync Services cannot sync with each other.
Registering Windows Server machines with the Azure File Sync agent
To enable the sync capability on Windows Server, you must install the Azure File Sync downloadable agent. The
Azure File Sync agent provides two main components:
FileSyncSvc.exe . The background Windows service that is responsible for monitoring changes on the server
endpoints and initiating sync sessions.
StorageSync.sys . A file system filter that enables cloud tiering and faster disaster recovery.

You can download the agent from the Azure File Sync Agent Download page at the Microsoft Download Center.
Operating system requirements
Azure File Sync is supported by the Windows Server versions in the following table.

VERSIO N SUP P O RT ED SK US SUP P O RT ED DEP LO Y M EN T O P T IO N S

Windows Server 2019 Datacenter, Standard, and IoT Full and Core

Windows Server 2016 Datacenter, Standard, and Storage Full and Core
Server

Windows Server 2012 R2 Datacenter, Standard, and Storage Full and Core
Server

For more information, refer to Windows file server considerations.


Configuring sync groups and cloud endpoints
A sync group defines the sync topology for a set of files. Endpoints within a sync group are kept in sync with each
other. A sync group must contain one cloud endpoint, which represents an Azure file share, and one or more server
endpoints. A server endpoint represents a path on a registered server. A server can have server endpoints in
multiple sync groups. You can create as many sync groups as you need to appropriately describe your desired sync
topology.
A cloud endpoint is a pointer to an Azure file share. All server endpoints will sync with a cloud endpoint, making the
cloud endpoint the hub. The storage account for the Azure file share must be located in the same region as the
Storage Sync Service. The entirety of the Azure file share will be synced, with one exception: a special folder—
comparable to the hidden System Volume Information folder on an NT file system (NTFS) volume—will be
provisioned. This directory is called .SystemShareInformation , and it contains important sync metadata that will
not sync to other endpoints.
Configuring server endpoints
A server endpoint represents a specific location on a registered server, such as a folder on a server volume. A server
endpoint must be a path on a registered server (rather than a mounted share), and use cloud tiering. The server
endpoint path must be on a non-system volume. NAS is not supported.
Azure File Share to Windows file share relationships
You should deploy Azure file shares 1 to 1 with Windows file shares wherever possible. The server endpoint object
gives you a great degree of flexibility on how you set up the sync topology on the server-side of the sync
relationship. To simplify management, make the path of the server endpoint match the path of the Windows file
share.
Use as few Storage Sync Services as possible. This will simplify management when you have sync groups that
contain multiple server endpoints, because a Windows Server can only be registered to one Storage Sync Service at
a time.
Paying attention to a storage account's I/O operations per second (IOPS) limitations when deploying Azure file
shares. Ideally, you would map file shares 1 to 1 with storage accounts; however, this might not always be possible
due to various limits and restrictions, both from your organization and from Azure. When it's not possible to have
only one file share deployed in one storage account, ensure that your most active file shares don't get put in the
same storage account together.
Topology recommendations: firewalls, edge networks, and proxy connectivity
Consider the following recommendations for solution topology.
Firewall and traffic filtering
Based on your organization's policy or unique regulatory requirements, you might require more restrictive
communication with Azure. Therefore, Azure File Sync provides several mechanisms for you to configure
networking. Based on your requirements, you can:
Tunnel sync and file upload/download traffic over your Azure ExpressRoute or Azure virtual private network
(VPN).
Make use of Azure Files and Azure networking features such as service endpoints and private endpoints.
Configure Azure File Sync to support your proxy in your environment.
Throttle network activity from Azure File Sync.
To learn more about Azure File Sync and networking, refer to Azure File Sync networking considerations.
Configuring proxy servers
Many organizations use a proxy server as an intermediary between resources inside their on-premises network and
resources outside their network, such as in Azure. Proxy servers are useful for many applications, such as network
isolation and security, and monitoring and logging. Azure File Sync can interoperate fully with a proxy server;
however, you must manually configure the proxy endpoint settings for your environment with Azure File Sync. This
must be done via Azure PowerShell using the Azure File Sync server cmdlets.
For more information on how to configure Azure File Sync with a proxy server, refer to Azure File Sync proxy and
firewall settings.

Scalability considerations
You should consider the type and performance of the storage account being used to host Azure file shares. All
storage resources that are deployed into a storage account share the limits that apply to that storage account. To
find out more about determining the current limits for a storage account, refer to Azure Files scalability and
performance targets.
There are two main types of storage accounts you will use for Azure Files deployments:
General purpose version 2 (GPv2) storage accounts. GPv2 storage accounts allow you to deploy Azure file
shares on standard/hard disk-based (HDD-based) hardware. In addition to storing Azure file shares, GPv2
storage accounts can store other storage resources such as blob containers, queues, or tables.
FileStorage storage accounts: FileStorage storage accounts allow you to deploy Azure file shares on
premium/solid-state disk-based (SSD-based) hardware. FileStorage accounts can only be used to store
Azure file shares; you cannot deploy any other storage resources (such as blob containers, queues, and
tables) in a FileStorage account.
By default, standard file shares can span only up to 5 terabytes (TiB), although the share limit can be increased to
100 TiB. To do this, the large file share feature must be enabled at the storage-account level. Premium storage
accounts (FileStorage storage accounts) don't have the large file share feature flag, because all premium file
shares are already enabled for provisioning up to the full 100 TiB capacity. You can only enable large file shares
on locally redundant or zone-redundant standard storage accounts. After you have enabled the large file share
feature flag, you can't change the redundancy level to geo-redundant storage or geo-zone-redundant storage. To
enable large file shares on an existing storage account, enable the Large file share option in the
Configuration view of the associated storage account.

Availability considerations
You should ensure that Azure File Sync is supported in the regions to which you want to deploy your solution.
For more information about this, refer to Azure file sync region availability.
You should ensure that services referenced in the Architecture section are supported in the region to which
you hybrid file services architecture is deployed.
To protect the data in your Azure file shares against data loss or corruption, all Azure file shares store multiple
copies of each file as they are written. Depending on the requirements of your workload, you can select
additional degrees of redundancy.
Previous Versions is a Windows feature that enables you to utilize server-side Volume Shadow Copy Service
(VSS) snapshots of a volume to present restorable versions of a file to an SMB client. VSS snapshots and
Previous Versions work independently of Azure File Sync. However, cloud tiering must be set to a compatible
mode. Many Azure File Sync server endpoints can exist on the same volume. You have to make the following
PowerShell call per volume that has even one server endpoint, where you plan to or are using cloud tiering. For
more information about Previous Versions and VSS, refer to Self-service restore through Previous Versions and
VSS (Volume Shadow Copy Service).

Manageability considerations
The Azure File Sync agent is updated on a regular basis to add new functionality and address issues. We
recommend you configure Microsoft Update to get updates for the Azure File Sync agent as soon as they're
available. For more information, refer to Azure File Sync agent update policy.
Azure Storage offers soft delete for file shares (preview) so that you can more easily recover your data when it's
mistakenly deleted by an application or other storage account user. To learn more about soft delete, refer to
Enable soft delete on Azure file shares.
Cloud tiering is an optional feature of Azure File Sync in which frequently accessed files are cached locally on the
server while all other files are tiered to Azure Files based on policy settings. When a file is tiered, the Azure File
Sync file system filter (StorageSync.sys) replaces the file locally with a pointer to the file in Azure Files. A tiered
file has both the offline attribute and the FILE_ATTRIBUTE_RECALL_ON_DATA_ACCESS attribute set in NTFS
so that third-party applications can securely identify tiered files. For more information, refer to Cloud Tiering
Overview.

Security considerations
Azure File Sync works with your standard Active Directory Domain Services (AD DS)–based identity without any
special setup beyond setting up Azure File Sync. When you use Azure File Sync, file access will typically go
through the Azure File Sync caching servers rather than through the Azure file share. Because the server
endpoints are located on Windows Server machines, the only requirement for identity integration is using
domain-joined Windows file servers to register with the Storage Sync Service. Azure File Sync will store access
control lists (ACLs) for the files in the Azure file share, and will replicate them to all server endpoints.
Even though changes made directly to the Azure file share will take longer to sync to the server endpoints in the
sync group, you might also want to ensure that you can enforce your AD DS permissions on your file share
directly in the cloud as well. To do this, you must domain join your storage account to your on-premises AD DS
domain, just as how your Windows file servers are domain joined. To learn more about domain joining your
storage account to a customer-owned AD DS instance, refer to Overview of Azure Files identity-based
authentication options for SMB access.
When using Azure File Sync, there are three different layers of encryption to consider:
Encryption at rest for data stored in Windows Server. There are two strategies for encrypting data on
Windows Server that work generally with Azure File Sync: encryption beneath the file system such that
the file system and all of the data written to it is encrypted, and encryption within the file format itself.
These methods are not mutually exclusive; they can be used together if desired since the purpose of
encryption is different.
Encryption in transit between the Azure File Sync agent and Azure. Azure File Sync agent communicates
with your Storage Sync Service and Azure file share using the Azure File Sync REST protocol and the
FileREST protocol, both of which always use HTTPS over port 443. Azure File Sync does not send
unencrypted requests over HTTP.
Encryption at rest for data stored in the Azure file share. All data stored in Azure Files is encrypted at rest
using Azure storage service encryption (SSE). Storage service encryption works similarly to BitLocker on
Windows: data is encrypted beneath the file system level. Because data is encrypted beneath the Azure
file share's file system, as it's encoded to disk, you don't have to have access to the underlying key on the
client to read or write to the Azure file share.

Cost considerations
Consult the Principles of cost optimization page in the Azure Well-Architected Framework for cost optimization
recommendations.
The Azure Storage Pricing page provides detailed pricing information based on account type, storage capacity,
replication, and transactions.
The Data Transfers Pricing Details provides detailed pricing information for data egress.
You can use the Azure Storage Pricing Calculator to help estimate your costs.
Hybrid availability and performance monitoring
11/2/2020 • 12 minutes to read • Edit Online

This reference architecture illustrates how to design Azure Monitor to monitor performance and availability for
operating system (OS) workloads running in virtual machines (VMs) in Microsoft Azure, in on-premises
environments, and with third-party cloud providers.

Download a Visio file of this architecture.

Architecture explanation
The architecture consists of the following components:
On-premises main office - VM 1 . This is a web application with internet access and a public-facing webpage,
and both log analytics and dependency agents installed. For information about agents, refer to Log Analytics
agent overview and Overview of Azure Monitor agents, Dependency agent.
On-premises main office - VM 2 . This business-logic environment doesn't have internet access. It does,
however, have log analytics and dependency agents installed.
On-premises main office - VM 3 . This is a datastore without internet access, but with log analytics and
dependency agents installed.
On-premises main office - Log Analytics gateway . The Log Analytics gateway collects log and metric data
from the three on-premises VMs, and delivers them into the Log Analytics Workspace over Transmission Control
Protocol (TCP) on port 443.
On-premises main office - Firewall . Traffic to and from the on-premises environment is routed through the
firewall.
Gateway . The gateway provides connectivity to the branch office.
On-premises branch office - VM 4 . This is the business application that's running without internet access,
but with log analytics and dependency agents installed. The Log Analytics agent installed on the VM is
configured to transfer data directly to the Log Analytics workspace without the need for a Log Analytics gateway.
On-premises branch office - Gateway . This gateway connects the branch office to the on-premises main
office via a Virtual Private Network (VPN).
Third-par ty cloud provider - VM 5 . This is a web application with internet access, a public-facing webpage,
and both log analytics and dependency agents installed.
Third-par ty cloud provider - VM 6 . This is a datastore environment without internet access, but with both
log analytics and dependency agents installed. There is no direct connectivity from the third-party cloud
provider environments to the on-premises environments.
Azure - Vir tual Machine Scale Sets . The Azure Virtual Machine Scale Sets (VMSS) is a scale set running a
business application with the log analytics and diagnostic agents installed.
Azure - Application ser ver . This server has a single VM running a business application, with log analytics and
diagnostic agents installed.
Azure Monitor metrics . Data collected by Azure Monitor metrics is stored in a time-series database that is
optimized for analyzing time-stamped data. It also stores metrics sent from on-premises VMs and Azure VMs.
Azure Monitor - Log Analytics workspace . This workspace stores logs sent from on-premises VMs, Azure
VMs and VMs on third-party cloud providers. The workspace is an Azure resource where data is aggregated, and
it serves as an administrative boundary for accessing this data. Other Azure Monitor services then connect to
the Log Analytics workspace and use the data for various purposes. For more information, refer to Designing
your Azure Monitor Logs deployment.
Azure Monitor - Insights - Application Insights . Application Insights provides analyses of applications, and
insights into their use. In this example architecture, an availability ping test checks the availability of the on-
premises web application, with Alert rules enabled to provide notification of a failed test. For more information,
refer to What is Application Insights? and Monitor the availability of any website.
Azure Monitor - Insights - Azure Monitor for VMs . Azure Monitor for VMs monitors the performance and
health of your virtual machines and virtual machine scale sets, including their running processes and
dependencies on other resources. In this scenario it will provide insights into your virtual machines. For more
information refer to What is Azure Monitor for VMs?
Azure Monitor - Analysis . Log and metric data from the VMs are queried within Azure Monitor metrics, and
the Log Analytics workspace using the Kusto Query Language (KQL), to provide insights into the infrastructure,
topology, and resources. For more information refer to Kusto: Overview and Azure Monitor log query examples.
Azure Monitor - Visualizations . Azure Monitor uses visualization tools to review application and
infrastructure components and communications between services in Azure Monitor. Visualization tools include
Application Insight's - Application Map , Azure Monitor for VMs - Map , Workbooks , and various
Dashboard views available within Azure Monitor. For more information, refer to Use the Map feature of Azure
Monitor for VMs to understand application components, Create and share dashboards of Log Analytics data, and
Azure Monitor Workbooks.
Azure Monitor - Integrations . Azure Monitor integrates with a range of partner and third-party tools and
extensions which enhance and build upon existing Azure Monitor functionality such as Analysis and
Visualizations.
Azure Monitor - Actions - Aler ts . Variations in metric and log data can indicate the occurrence of events.
Rules define the data variations that trigger alerts, provide notifications, and initiate remediation responses. In
this architecture, when an alert is triggered, automation runbooks automatically remediate the on-premises VMs
and Azure VMs. Webhook actions, service management integration, and other action types are also available.
For more information, refer to Create, view, and manage metric alerts using Azure Monitor and Create, view, and
manage log alerts using Azure Monitor.
Azure Monitor - Actions - Autoscale . Autoscale adds or removes VM instances as demand changes, which
maintains performance and increases cost effectiveness. In this architecture, Autoscale has conditions defined
around average CPU load (in percentage). When conditions are met, Azure Monitor Autoscale will adjust the
scale set according to demand. For more information, refer to Overview of autoscale in Microsoft Azure

Recommendations
The following best practices are recommendations that apply for most scenarios. Follow these unless you have a
specific requirement that overrides them.
Log Analytics workspace
Consider the following recommendations when designing the Log Analytics workspace:
Place the workspace and resources in the same Azure region, if latency is an important factor.
Start with a single Log Analytics workspace, and increase the number of workspaces as the requirements evolve.
If you've geographically dispersed teams and resources, you might need one workspace per region.
Your workspace doesn't need to be in the same subscription as the resources you're running.
Alerts
For simple or more straightforward scenarios, you can use metrics to flag alerts rather than logs. Metrics:
Provide a count, or numerical value, for events such as CPU usage, available memory, or logical disk space.
Have low latency.
Offer greater granularity, for example per-second or per-minute intervals.
Notify you about an issue quickly.
If you want to collect some custom performance indicators or business-specific metrics to provide deeper insights,
you can do this using custom metrics. For more information, refer to Custom metrics in Azure Monitor (Preview).
Metrics alerts are not the answer in all situations, however. You might still want to use log-based alerts when you
require more customization or more powerful correlations.
Analysis and Diagnostics
Consider the following recommendations for analysis and diagnostics:
Use logs for deeper analysis. Logs can:
Provide verbose detail about events (compared to metrics).
Happen intermittently.
Facilitate deeper diagnostics after an initial metric flag.
You can customize log data collection (which is similar to metrics) using the HTTP Data Collector API to send log
data to a Log Analytics workspace. For more information, refer to Send log data to Azure Monitor with the HTTP
Data Collector API (public preview).
In addition, analyze your applications proactively with Application Insight's smar t detection feature. Smart
detection applies Azure's machine learning capabilities, and statistical analysis, to detect issues such as performance
or failure anomalies, memory leaks, or general application degradation. For more information, refer to Smart
Detection in Application Insights.
Also, use Azure Monitor for VMs - Map to review connections between servers, processes, inbound and
outbound connection latency, and ports across any TCP-connected architecture. No configuration is required other
than installing an agent. With Azure Monitor for VMs - Map , you can interact and engage with your servers as
interconnected systems.
Log Analytics queries
Query the data within a Log Analytics workspace by using KQL to search for terms, specific events, or states to
identify trends and analyze patterns. Use the Quer y explorer to browse and select pre-written queries, modify
them, or create your own. You can run, save, share, and export queries from within a workspace, and pin your
favorite queries to a dashboard for reuse.
Agent installation
Install agents automatically and at scale, rather than individually, by using automation options such as Azure Policy,
Azure PowerShell, Resource Manager templates, or Desired State Configuration (DSC). For more information, refer
to Enable Azure Monitor for VMs by using Azure Policy, Enable Azure Monitor for VMs using Azure PowerShell, and
Enable Azure Monitor for VMs for a hybrid virtual machine - Desired State Configuration.
Dashboard
For critical applications, create an Azure Dashboard view. Share or make your dashboard available on a shared
screen, in real time, to people who need critical application data.

Scalability considerations
The following are considerations for scaling your environment.
Automate installation and configuration of your resources and applications.
Large-scale geographically dispersed applications. Use Distributed Tracing within Application Insights to track
dependencies and calls across multiple application components, backend resources, and microservices
environments. With Distributed Tracing you can debug applications that call across process boundaries,
outside the local stack. (You don't need to enable Distributed Tracing , it's available automatically as part of
App Insights.)
Two options for consuming distributed trace data are:
Transaction diagnostics experience. This is similar to a call stack with an added time dimension. The
transaction diagnostics experience provides visibility into one single transaction/request, and it's
helpful for finding the root cause of reliability issues and performance bottlenecks on a per-request
basis. For more information, refer to What is Distributed Tracing?.
Application map experience. This aggregates many transactions to demonstrate how systems
interact topologically, and provide average performance and error rates. For more information,
refer to Application Map: Triage Distributed Applications.

Availability considerations
The following are considerations for ensuring availability in your environment.
Availability tests. The URL ping test used in this architecture is the simplest outside-in availability test. However,
other options are available, such as:
Multi-step web test. Plays back recordings of sequenced web requests to test complex scenarios. Multiple-
step web tests are created in Microsoft Visual Studio Enterprise, and then uploaded to the portal for
execution.
Custom track availability tests. Use the TrackAvailability() method to send test results to Application
Insights.
Alerts. When you create an availability test in Application Insights, event alert notifications are enabled by
default. You can edit the alert rules by specifying the notification type and details, from Azure Monitor >
Aler ts .
Manageability considerations
The following are considerations for making your environment more manageable.
Workbooks. Use workbooks to help perform further analysis, and create rich reports. Workbooks combine
text,log queries, metrics, and parameters into interactive reports. Team members with access to the same Azure
resources can edit workbooks. For more information, refer to Create interactive reports Azure Monitor for VMs
with workbooks.
Partner integrations. Integrate Azure Monitor with partner and third-party tools to assist with analysis,
visualization, alerts, or service management and DevOps pipelines. For more information, refer to Azure Monitor
partner integrations.
Integrate Azure Monitor with Microsoft System Center. Integrate Azure Monitor with the System Center product
suite. For more information, refer to Connect Operations Manager to Azure Monitor.
Send data to Azure Event Hub. For integrating Azure Monitor with visualization and external monitoring tools,
refer to Stream Azure monitoring data to an event hub or external partner.
Log Analytics gateway. For smaller environments such as the branch office in this architecture, use the agent to
transfer data into the Log Analytics workspace, rather than into a gateway. For more information, refer to
Establish connectivity to Azure Log Analytics.

Security considerations
The following are considerations for making your environment more secure.
Log Analytics workspace. Access modes are defined as one of the following:
Workspace context. All logs that the workspace has permission to access can be queried. This is a vertical
access approach. For example, a security team might need access to all resource data from the top down.
Resource context. Only logs for specific resources can be queried. For example, an application team can be
granted access to logs for the particular resource they're working on.
Secure data in transit to Log Analytics. Data in transit is secured using minimum Transport Layer Security (TLS)
1.2. You don't need to enable this explicitly. For more information, refer to Log Analytics data security.
Secure data at rest in Log Analytics. Data at rest in Log Analytics is secured, as per Azure Storage, using 256-bit
Advanced Encryption Standard (AES) encryption by default.
Smart Detection. Use Smart Detection in Application Insights to analyze the telemetry generated by your
application, and to detect security issues. For more information, refer to Application security detection pack
(preview).
Integrate Azure Monitor with Security Information and Event Management (SIEM) tools. Route your monitoring
data to an event hub with Azure Monitor to integrate external SIEM and monitoring tools. For more information,
refer to Stream Azure monitoring data to an event hub or external partner.

DevOps considerations
The following are considerations for integrating your environment with DevOps processes and solutions.
Application Insights. Integrate Application Insights into DevOps pipelines to help make performance and
usability improvements. Application Insights can detect performance anomalies automatically, and it connects to
a variety of development tools such as Azure DevOps services and GitHub.
Application Instrumentation. Instrument applications by modifying application code to enable telemetry with
Application Insights. The following methods are ways to instrument applications:
At runtime. Instrumenting your web application on the server at runtime is ideal for applications that are
deployed already, as it avoids having to update code. Suitable scenarios include:
Microsoft ASP.NET or ASP.NET Core applications hosted on Azure Web Apps
ASP.NET applications hosted in Microsoft Internet Information Services (IIS) on an Azure VM or
Azure VMSS
ASP.NET applications hosted in IIS on an on-premises VM
Java-based Azure Functions
Node.JS apps on Linux App Services
Microservices hosted on AKS
At development time. Add Application Insights to your code to customize telemetry collection and send
additional data. Supported languages and platforms include:
ASP.NET applications
ASP.NET Core applications
.NET Console applications
Java
Node.js
Python
Use IT Service Management Connector (ITSMC) to connect to external IT service management (ITSM) tools.
ITSMC connects Azure to supported ITSM products and services, where issue-related work items typically reside.
For more information, refer to Connect Azure to ITSM tools using IT Service Management Connector.

Cost considerations
The following are considerations for controlling and managing costs in your environment.
Azure Monitor. Azure Monitor costs are consumption-based, often referred to as pay as you go.
Log Analytics. You pay for data ingestion and data retention . You can estimate and forecast the number of
VMs, and the amount of data (in gigabytes) you expect to collect from each VM. A typical Azure VM consumes
between 1 and 3 gigabytes (GB) of data each month. If you're evaluating data usage with Azure Monitor logs, use
the data statistics from your own environment and obtain a discount with Capacity reser vations .
Application Insights. This is billed according to the volume of telemetry data your application sends, and the
number of web tests you run.
Metric queries. Metric queries are billed by the number of calls made.
Alerts. Alerts are billed based on the type, and number, of signals monitored.
Notifications. Notifications are billed according to the type, and number, of notifications you send.
Azure Monitor. The Usage and estimated costs section of Azure Monitor estimates your monthly costs based
on the previous 31 days of usage.
For more information, refer to Azure Monitor pricing and Pricing calculator.
Hybrid Security Monitoring using Azure Security
Center and Azure Sentinel
11/2/2020 • 16 minutes to read • Edit Online

This reference architecture illustrates how to use Azure Security Center and Azure Sentinel to monitor the security
configuration and telemetry of on-premises and Azure operating system workloads. This includes Azure Stack.

Download a Visio file of this architecture.


Typical uses for this architecture include:
Best practices for integrating on-premises security and telemetry monitoring with Azure-based workloads
How to integrate Azure Security Center with Azure Stack
How to integrate Azure Security Center with Azure Sentinel

Architecture
The architecture consists of the following components:
Azure Security Center . This is an advanced, unified security-management platform that Microsoft offers to all
Azure subscribers. Security Center is segmented as a cloud security posture management (CSPM) and cloud
workload protection platform (CWPP). CWPP is defined by workload-centric security protection solutions, which
are typically agent-based. Azure Security Center provides threat protection for Azure workloads, both on-
premises and in other clouds, including Windows and Linux virtual machines (VMs), containers, databases, and
Internet of Things (IoT). When activated, the Log Analytics agent deploys automatically into Azure Virtual
Machines. For on-premises Windows and Linux servers and VMs, you can manually deploy the agent, use your
organization's deployment tool, such as Microsoft Endpoint Protection Manager, or utilize scripted deployment
methods. Security Center begins assessing the security state of all your VMs, networks, applications, and data.
Azure Sentinel . Is a cloud-native Security Information and Event Management (SIEM) and security
orchestration automated response (SOAR) solution that uses advanced AI and security analytics to help you
detect, hunt, prevent, and respond to threats across your enterprise.
Azure Stack . Is a portfolio of products that extend Azure services and capabilities to your environment of
choice, from the datacenter to edge locations and remote offices. Systems that you integrate with Azure Stack
typically utilize racks of four to sixteen servers, built by trusted hardware partners and delivered straight to your
datacenter.
Azure Monitor . Collects monitoring telemetry from a variety of on-premises and Azure sources. Management
tools, such as those in Azure Security Center and Azure Automation, also push log data to Azure Monitor.
Log Analytics workspace . Azure Monitor stores log data in a Log Analytics workspace, which is a container
that includes data and configuration information.
Log Analytics agent . The Log Analytics agent collects monitoring data from the guest operating system and
VM workloads in Azure, other cloud providers, and on-premises. The Log Analytics Agent supports Proxy
configuration and, typically in this scenario, a Microsoft Operations Management Suite (OMS) Gateway acts as
proxy.
On-premises network . This is the firewall configured to support HTTPS egress from defined systems.
On-premises Windows and Linux systems . Systems with the Log Analytics Agent installed.
Azure Windows and Linux VMs . Systems on which the Azure Security Center monitoring agent is installed.

Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have a
specific requirement that overrides them.
Azure Security Center upgrade
This reference architecture uses Azure Security Center to monitor on-premises systems, Azure VMs, Azure
Monitor resources, and even VMs hosted by other cloud providers. To support that functionality, the standard fee-
based tier of Azure Security Center is needed. We recommend that you use the 30-day free trial to validate your
requirements.
Details about Azure Security Center pricing can be found here.
Customized Log Analytics Workspace
Azure Sentinel needs access to a Log Analytics workspace. In this scenario, you can’t use the default ASC Log
Analytics workspace with Azure Sentinel. You’ll need to create a customized workspace. Data retention for a
customized workspace is based on the workspace pricing tier, and you can find pricing models for Monitor Logs
here.

NOTE
Azure Sentinel can run on workspaces in any general availability (GA) region of Log Analytics except the China and Germany
(Sovereign) regions. Data that Azure Sentinel generates, such as incidents, bookmarks, and alert rules, which may contain
some customer data sourced from these workspaces, is saved either in Europe (for Europe-based workspaces), in Australia (for
Australia-based workspaces), or in the East US (for workspaces located in any other region).

Scalability considerations
The Log Analytics Agent for Windows and Linux is designed to have very minimal impact on the performance of
VMs or physical systems.
Azure Security Center operational process won’t interfere with your normal operational procedures. Instead, it
passively monitors your deployments and provides recommendations based on the security policies you enable.

Manageability considerations
Azure Security Center roles
Security Center assesses your resources’ configuration to identify security issues and vulnerabilities, and displays
information related to a resource when you are assigned the role of owner, contributor, or reader for the
subscription or resource group to which a resource belongs.
In addition to these roles, there are two specific Security Center roles:
Security Reader . A user that belongs to this role has read only rights to Security Center. The user can
observe recommendations, alerts, a security policy, and security states, but can’t make changes.
Security Admin . A user that belongs to this role has the same rights as the Security Reader, and also can
update security policies, and dismiss alerts and recommendations. Typically, these are users that manage the
workload.
The security roles, Security Reader and Security Admin , have access only in Security Center. The security
roles don’t have access to other Azure service areas, such as storage, web, mobile, or IoT.
Azure Sentinel subscription
To enable Azure Sentinel, you need contributor permissions to the subscription in which the Azure Sentinel
workspace resides.
To use Azure Sentinel, you need contributor or reader permissions on the resource group to which the
workspace belongs.
Azure Sentinel is a paid service. For more information, refer to Azure Sentinel pricing.

Security considerations
A security policy defines the set of controls that are recommended for resources within a specified subscription.
In Azure Security Center, you define policies for your Azure subscriptions according to your company's security
requirements and the type of applications or data sensitivity for each subscription.
The security policies that you enable in Azure Security Center drive security recommendations and monitoring. To
learn more about security policies, refer to Strengthen your security policy with Azure Security Center. You can
assign security policies in Azure Security Center only at the management or subscription group levels.

NOTE
Part one of the reference architecture details how to enable Azure Security Center to monitor Azure resources, on-premises
systems, and Azure Stack systems.

Deploy the solution


Create a Log Analytics workspace in Azure Portal
1. Sign into the Azure portal as a user with Security Admin privileges.
2. In the Azure portal, select All ser vices . In the list of resources, enter Log Analytics . As you begin entering, the
list filters based on your input. Select Log Analytics workspaces .
3. Select Add on the Log Analytics page.
4. Provide a name for the new Log Analytics workspace, such as ASC-SentinelWorkspace . This name must be
globally unique across all Azure Monitor subscriptions.
5. Select a subscription by selecting from the drop-down list if the default selection is not appropriate.
6. For Resource Group , choose to use an existing resource group or create a new one.
7. For Location , select an available geolocation.
8. Select OK to complete the configuration.
Enable Security Center
While you're still signed into the Azure portal as a user with Security Admin privileges, select Security Center in
the panel. Security Center - Over view opens:

Security Center automatically enables the Free tier for any of the Azure subscriptions not previously onboarded by
you or another subscription user.
Upgrade to the Standard tier

IMPORTANT
This reference architecture uses the 30-day free trial of Security Center Standard tier.

1. On the Security Center main menu, select Getting Star ted .


2. Select the Upgrade Now button. Security Center lists your subscriptions and workspaces that are eligible for
use in the Standard tier.
3. You can select eligible workspaces and subscriptions to start your trial. Select the previously created workspace,
ASC-SentinelWorkspace. from the drop-down menu.
4. In the Security Center main menu, select Star t trial .
5. The Install Agents dialog box should display.
6. Select the Install Agents button. The Security Center - Coverage blade displays and you should observe
your selected subscription in the Standard coverage tab.
You've now enabled automatic provisioning and Security Center will install the Log Analytics Agent for Windows
(HealthSer vice.exe ) and the omsagent for Linux on all supported Azure VMs and any new ones that you create.
You can turn off this policy and manually manage it, although we strongly recommend automatic provisioning.
To learn more about the specific Security Center features available in Windows and Linux, refer to Feature coverage
for machines.
Enable Azure Security Center monitoring of on-premises Windows computers
1. In the Azure Portal on the Security Center - Over view blade, select the Get Star ted tab.
2. Select Configure under Add new non-Azure computers . A list of your Log Analytics workspaces displays,
and should include the ASC-SentinelWorkspace .
3. Select this workspace. The Direct Agent blade opens with a link for downloading a Windows agent and keys for
your workspace identification (ID) to use when you configure the agent.
4. Select the Download Windows Agent link applicable to your computer processor type to download the setup
file.
5. To the right of Workspace ID , select Copy , and then paste the ID into Notepad.
6. To the right of Primar y Key , select Copy , and then paste the key into Notepad.
Install the Windows agent
To install the agent on the targeted computers, follow these steps.
1. Copy the file to the target computer and then Run Setup .
2. On the Welcome page, select Next .
3. On the License Terms page, read the license and then select I Agree .
4. On the Destination Folder page, change or keep the default installation folder and then select Next .
5. On the Agent Setup Options page, choose to connect the agent to Azure Log Analytics and then select Next .
6. On the Azure Log Analytics page, paste the Workspace ID and Workspace Key (Primar y Key) that you
copied into Notepad in the previous procedure.
7. If the computer should report to a Log Analytics workspace in Azure Government cloud, select Azure US
Government from the Azure Cloud drop-down list. If the computer needs to communicate through a proxy
server to the Log Analytics service, select Advanced , and then provide the proxy server's URL and port number.
8. After you provide the necessary configuration settings, select Next .
9. On the Ready to Install page, review your choices and then select Install .
10. On the Configuration completed successfully page, select Finish .
When complete, the Log Analytics agent appears in Windows Control Panel, and you can review your configuration
and verify that the agent is connected.
For further information about installing and configuring the agent, refer to Install Log Analytics agent on Windows
computers.
The Log Analytics Agent service collects event and performance data, executes tasks, and other workflows defined
in a management pack. Security Center extends its cloud workload protection platforms by integrating with
Microsoft Defender Advanced Threat Protection (ATP) for Ser vers . Together, they provide comprehensive
endpoint detection and response (EDR) capabilities.
For more information about Microsoft Defender ATP, refer to Onboard servers to the Microsoft Defender ATP
service.
Enable Azure Security Center monitoring of on-premises Linux computers
1. Return to the Getting Star ted tab as previously described.
2. Select Configure under Add new non-Azure computers . A list of your Log Analytics workspaces displays.
The list should include the ASC-SentinelWorkspace that you created.
3. On the Direct Agent blade under DOWNLOAD AND ONBOARD AGENT FOR LINUX, select copy to copy
the wget command.
4. Open Notepad and then paste this command. Save this file to a location that you can access from your Linux
computer.

NOTE
On Unix and Linux operating systems, wget is a tool for non-interactive file downloading from the web. It supports HTTPS,
FTPs, and proxies.

The Linux agent uses the Linux Audit Daemon framework. Security Center integrates functionalities from this
framework within the Log Analytics agent, which enables audit records to be collected, enriched, and aggregated
into events by using the Log Analytics Agent for Linux. Security Center continuously adds new analytics that use
Linux signals to detect malicious behaviors on cloud and on-premises Linux machines.
For a list of the Linux alerts, refer to the Reference table of alerts.
Install the Linux agent
To install the agent on the targeted Linux computers, follow these steps:
1. On your Linux computer, open the file that you previously saved. Select and copy the entire content, open a
terminal console, and then paste the command.
2. Once the installation finishes, you can validate that the omsagent is installed by running the pgrep command.
The command will return the omsagent process identifier (PID). You can find the logs for the agent at:
/var/opt/microsoft/omsagent/"workspace id"/log/ .
It can take up to 30 minutes for the new Linux computer to display in Security Center.
Enable Azure Security Center monitoring of Azure Stack VMs
After you onboard your Azure subscription, you can enable Security Center to protect your VMs running on Azure
Stack by adding the Azure Monitor, Update and Configuration Management VM extension from the Azure
Stack marketplace. To do this:
1. Return to the Getting Star ted tab as previously described.
2. Select Configure under Add new non-Azure computers . A list of your Log Analytics workspaces displays,
and it should include the ASC-SentinelWorkspace that you created.
3. On the Direct Agent blade there is a link for downloading the agent and keys for your workspace ID to use
during agent configuration. You don’t need to download the agent manually. It’ll be installed as a VM extension
in the following steps.
4. To the right of Workspace ID , select Copy , and then paste the ID into Notepad.
5. To the right of Primar y Key , select Copy , and then paste the key into Notepad.
Enable ASC monitoring of Azure Stack VMs
Azure Security Center uses the Azure Monitor, Update and Configuration Management VM extension
bundled with Azure Stack. To enable the Azure Monitor, Update and Configuration Management extension,
follow these steps:
1. In a new browser tab, sign into your Azure Stack portal.
2. Refer to the Vir tual machines page, and then select the virtual machine that you want to protect with Security
Center.
3. Select Extensions . The list of VM extensions installed on this VM displays.
4. Select the Add tab. The New Resource menu blade opens and displays the list of available VM extensions.
5. Select the Azure Monitor, Update and Configuration Management extension and then select Create . The
Install extension configuration blade opens.
6. On the Install extension configuration blade, paste the Workspace ID and Workspace Key (Primar y Key)
that you copied into Notepad in the previous procedure.
7. When you finish providing the necessary configuration settings, select OK .
8. Once the extension installation completes, its status will display as Provisioning Succeeded . It might take up
to one hour for the VM to appear in the Security Center portal.
For more information about installing and configuring the agent for Windows, refer to Install the agent using setup
wizard.
For troubleshooting issues for the Linux agent, refer to How to troubleshoot issues with the Log Analytics agent for
Linux.
Now you can monitor your Azure VMs and non-Azure computers in one place. Azure Compute provides you with
an overview of all VMs and computers along with recommendations. Each column represents one set of
recommendations, and the color represents the VMs or computers and the current security state for that
recommendation. Security Center also provides any detections for these computers in security alerts.

There are two types of icons represented on the Compute blade:

Non-Azure computer

Azure computer

NOTE
Part two of the reference architecture will connect alerts from Azure Security Center and stream them into Azure Sentinel.

The role of Azure Sentinel is to ingest data from different data sources and perform data correlation across these
data sources. Azure Sentinel leverages machine learning and AI to make threat hunting, alert detection, and threat
responses smarter.
To onboard Azure Sentinel, you need to enable it, and then connect your data sources. Azure Sentinel comes with a
number of connectors for Microsoft solutions, which are available out of the box and provide real-time integration,
including Microsoft Security Center, Microsoft Threat Protection solutions, Microsoft 365 sources (including Office
365), Azure Active Directory (Azure AD), Azure ATP, Microsoft Cloud App Security, and more. Additionally, there are
built-in connectors to the broader security ecosystem for non-Microsoft solutions. You can also use Common Event
Format, syslog, or the Representational State Transfer API to connect your data sources with Azure Sentinel.
Requirements for integrating Azure Sentinel with Azure Security Center
1. A Microsoft Azure Subscription
2. A Log Analytics workspace that isn't the default workspace created when you enable Azure Security Center.
3. Azure Security Center with Security Center Standard tier enabled.
All three requirements should be in place if you worked through the previous section.
Global prerequisites
To enable Azure Sentinel, you need contributor permissions to the subscription in which the Azure Sentinel
workspace resides.
To use Azure Sentinel, you need contributor or reader permissions on the resource group to which the
workspace belongs.
You might need additional permissions to connect specific data sources. You don't need additional permissions
to connect to ASC.
Azure Sentinel is a paid service. For more information, refer to Azure Sentinel pricing.
Enable Azure Sentinel
1. Sign into the Azure portal with a user that has contributor rights for ASC-Sentinelworkspace .
2. Search for and select Azure Sentinel .

3. Select Add .
4. On the Azure Sentinel blade, select ASC-Sentinelworkspace .
5. In Azure Sentinel, select Data connectors from the navigation menu.
6. From the data connectors gallery, select Azure Security Center , and select the Open connector page button.

7. Under Configuration , select Connect next to those subscriptions for which you want alerts to stream into
Azure Sentinel. The Connect button will be available only if you have the required permissions and the ASC
Standard tier subscription.
8. You should now observe the Connection Status as Connecting . After connecting, it will switch to Connected .
9. After confirming the connectivity, you can close ASC Data Connector settings and refresh the page to observe
alerts in Azure Sentinel. It might take some time for the logs to start syncing with Azure Sentinel. After you
connect, you'll observe a data summary in the Data received graph and the connectivity status of the data types.
10. You can select whether you want the alerts from Azure Security Center to automatically generate incidents in
Azure Sentinel. Under Create incidents , select Enabled to turn on the default analytics rule that automatically
creates incidents from alerts. You can then edit this rule under Analytics , in the Active rules tab.
11. To use the relevant schema in Log Analytics for the Azure Security Center alerts, search for SecurityAler t .
One advantage of using Azure Sentinel as your SIEM is that it provides data correlation across multiple sources,
which enables you to have an end-to-end visibility of your organization’s security-related events.

NOTE
To learn how to increase visibility in your data and identify potential threats, refer to Azure playbooks on TechNet Gallery,
which has a collection of resources including a lab in which you can simulate attacks. You should not use this lab in a
production environment.

To learn more about Azure Sentinel, refer to the following articles:


Quickstart: Get started with Azure Sentinel
Tutorial: Detect threats out-of-the-box

Cost considerations
As previously described, costs beyond your Azure subscription might include:
1. Azure Security Center Standard tier. For more information, refer to Security Center pricing.
2. Azure Monitor workspace offers granularity of billing. For more information, refer to Manage Usage and
Costs with Azure Monitor Logs.
3. Azure Sentinel is a paid service. For more information, refer to Azure Sentinel pricing.

References
Azure Monitor
Azure Monitor
Azure Security Center
Azure Security Center
Azure Security Center Cloud Smart Alert Correlation
Azure Security Center Connect Data
Azure Security Center Coverage
Azure Security Center Endpoint Protection
Azure Security Center FAQ
Azure Security Center Planning
Azure Security Center Secure Score
Azure Security Center Security Alerts
Azure Security Center Security Policies
Azure Security Center Security Recommendations
Azure Security Center Security Recommendations
Azure Security Center Supported Platforms
Azure Security Center Threat Protection
Azure Security Center Tutorial
Azure Sentinel
Azure Sentinel
Azure Sentinel Analytics
Azure Sentinel Attack Detection
Azure Sentinel Connect Windows Firewall
Azure Sentinel Connect Windows Security Events
Azure Sentinel Data Sources
Azure Sentinel Hunting
Azure Sentinel Investigate
Azure Sentinel Monitor
Azure Sentinel Overview
Azure Sentinel Permissions
Azure Sentinel Quickstart
Azure Stack
Azure Stack
Azure Stack Automate Onboarding PowerShell
Azure Stack Hub
Manage hybrid Azure workloads using Windows
Admin Center
11/2/2020 • 13 minutes to read • Edit Online

This reference architecture illustrates how to design a hybrid Windows Admin Center solution to manage
workloads that are hosted on-premises and in Microsoft Azure. This architecture includes two scenarios:
Windows Admin Center deployed to a virtual machine (VM) in Azure.
Windows Admin Center deployed to a server (physical or virtual) on-premises.
The first diagram illustrates Windows Admin Center deployed to a VM in Azure.

The second diagram illustrates Windows Admin Center deployed on-premises.


Download a Visio file of these architectures.
Typical uses for this architecture include:
Organizations that want to manage individual Windows Server instances, Hyper-Converged Infrastructure or
Hyper-V VMs that run on-premises and in Microsoft Azure.
Organizations that want to manage Windows Server instances hosted by other cloud providers.

Architecture
The architecture consists of the following components:
On-premises corporate network . A private local area network that runs within an organization.
On-premises corporate firewall . An organizational firewall configured to allow users access to the Windows
Admin Center gateway when the gateway is deployed on-premises.
Windows VM . A Windows VM installed with the Windows Admin Center gateway that hosts web services with
which users can connect.
Azure AD app . An app used for all points of Azure integration in Windows Admin Center, including Azure
Active Directory (Azure AD) authentication to the gateway.
Managed nodes . Nodes managed by Windows Admin Center, which might include physical servers running
Azure Stack, or Windows Server or virtual machines running Windows Server.
VPN or ExpressRoute . A Site-to-Site (S2S) VPN or ExpressRoute to manage nodes on-premises when
Windows Admin Center is deployed in Azure.
Azure network adapter . An adapter for a Point-to-Site (P2S) VPN to Azure when Windows Admin Center is
deployed on-premises. Windows Admin Center can automatically deploy the adapter and also create an Azure
gateway.
Domain Name System (DNS) . A DNS record that users reference to connect to the Windows Admin Center
gateway.

Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have a
specific requirement that overrides them.
Installation types
You have multiple options when deploying Windows Admin Center, including installing on a Windows 10 PC, a
Windows server, or an Azure VM. The deployment type you choose will depend on your business requirements.
The on-premises installation types are:

Option 1: Local Client. Deploy Windows Admin Center on a Windows 10 client. This is ideal for quick
deployments, testing, and improvised or small-scale scenarios.
Option 2: Gateway Ser ver. Install on a designated gateway server running Windows Server 2016, Windows
Server 2019, or later, and access from any client browser with connectivity to the gateway server.
Option 3: Managed Ser ver. Install directly on a managed server running Windows Server 2016, Windows
Server 2019, or later to enable the server to manage itself or a cluster in which the managed server is a member
node. This is great for distributed branch office scenarios.
Option 4: Failover Cluster. Deploy in a failover cluster to enable high availability of the gateway service. This
is great for production environments to ensure management service resiliency.
Because Windows Admin Center has no cloud service dependencies, some organizations might choose to deploy
Windows Admin Center to an on-premises system for managing on-premises computers and then enable hybrid
services over time. However, many organizations prefer to take advantage of service offerings in Azure and other
cloud platforms that are integrated with Windows Admin Center.
Deploying Windows Admin Center on an Azure VM provides gateway functionality similar to Windows Server 2016
and Windows Server 2019. However, Azure also enables additional features for Azure VMs. Refer to Availability
considerations for more information about the benefits of deploying Windows Admin Center to Azure VMs.
Automated deployment
Microsoft provides an .msi file that you use to deploy Windows Admin Center to an on-premises VM. The .msi file
installs Windows Admin Center and automatically generates a self-signed certificate or associates an existing
certificate based on your preference.

TIP
For more information about using the .msi file to deploy Windows Admin Center, refer to Install Windows Admin Center.

When implementing Windows Admin Center on an Azure VM, Microsoft provides the Deploy-Windows Admin
CenterAzVM.ps1 script to automatically deploy all required resources. In this scenario, the script deploys a new
Azure VM with Windows Admin Center installed and opens port 443 on the public IP address. The script can also
deploy Windows Admin Center to an existing Azure VM. When running the script, you can use variables to specify
the resource group, virtual network name, VM name, location, and many other options.

IMPORTANT
Prior to deployment, upload the certificate to an Azure Key Vault. Alternatively, you can use the Azure portal to generate a
certificate.

TIP
For more information about using the Deploy-Windows Admin CenterAzVM.ps1 script to deploy Windows Admin
Center to an Azure VM, refer to Deploy Windows Admin Center in Azure.

TIP
For more information about the manual steps required to deploy Windows Admin Center to an Azure VM, refer to Deploy
manually on an existing Azure virtual machine.

Topology recommendations
While you can implement Windows Admin Center on any existing or new VM that's on-premises or hosted in
Azure, Microsoft recommends deploying Windows Admin Center gateway on a Windows Server 2019 Azure VM.
The automated deployment option discussed earlier allows you to implement Windows Admin Center on an Azure
VM more quickly while still safeguarding your environment. Instead of deploying Windows Admin Center on an on-
premises VM, you can minimize the configuration changes required for an on-premises firewall and network.

IMPORTANT
Managing nodes on-premises requires a connection (such as S2S VPN or ExpressRoute) from Azure to on-premises.

When deployed on-premises, you can use Windows Admin Center to build the hybrid connection to Azure. The
hybrid connection, called an Azure network adapter, is a P2S VPN to Azure that allows Windows Admin Center to
access hybrid cloud features. This process also automatically creates an Azure gateway for the VPN connection. For
more information, refer to About Point-to-Site VPN.

IMPORTANT
You must have an Azure virtual network in Azure before creating an Azure network adapter.

You should also configure the Windows Admin Center gateway for high availability. This will help ensure that
management of systems and services remain available during unexpected failures. Azure provides multiple service
offerings for these scenarios regardless of whether Windows Admin Center gateway is deployed to an on-premises
or Azure VM. Refer to Availability Considerations for more information.
Certificate recommendations
Windows Admin Center gateway is a web-based tool that uses a Secure Sockets Layer (SSL) certificate to encrypt
web traffic for administrators who are using the gateway. Windows Admin Center allows you to use an SSL
certificate that a trusted third-party certification authority (CA) created from a self-signed certificate. You will
receive a warning when trying to connect from a browser if you choose the latter option. As a result, we
recommend that you only use self-signed certificates for test environments.
TIP
To learn how other Microsoft customers have used Windows Admin Center to improve their productivity and reduce costs,
refer to Windows Admin Center Case Studies.

Administrative considerations
Deploying Windows Admin Center on a local Windows 10 client is great for quick-start tests, and ad-hoc or small-
scale scenarios. However, for production scenarios with multiple administrators, we recommend Windows Server.
When deployed on Windows Server, Windows Admin Center provides a centralized management hub for your
server environment with additional capabilities such as smart card authentication, conditional access, and multi-
factor authentication. For more information about controlling access to Windows Admin Center, refer to User
Access Options with Windows Admin Center.

Availability considerations
You can help ensure high availability of the Windows Admin Center gateway service by deploying it in an
active/passive model on a failover cluster. In this scenario, only one instance of the Windows Admin Center
gateway service is active. If one of the nodes in the cluster fails, the Windows Admin Center gateway service
seamlessly fails over to another node.
Cau t i on

Deploying Windows Admin Center on a Windows 10 client computer doesn't provide high availability
because gateway functionality isn't included with the Windows 10 deployment. Deploy the Windows Admin
Center gateway to Windows Server 2016 or Windows Server 2019.
To help ensure high availability of Windows Admin Center data in case of a node failure, configure a Cluster
Shared Volume (CSV) into which Windows Admin Center will store persistent data that all the nodes in the
cluster access. You can deploy the failover cluster for Windows Admin Center gateway by using an
automated deployment script. The Install-WindowsAdminCenterHA.ps1 script automatically deploys the
failover cluster, configures IP addresses for the cluster service, installs Windows Admin Center gateway,
configures the port number for the gateway service, and configures the web service with the appropriate
certificate.

TIP
For more information about using the automated deployment script, refer to Deploy Windows Admin Center with
high availability.

Deploying Windows Admin Center gateway to an Azure VM provides additional high-availability options. For
example, by using availability sets you can provide VM redundancy and availability within an Azure
datacenter by distributing the VMs across multiple hardware nodes. You can also use availability zones in
Azure, which provide datacenter fault tolerance. Availability zones are unique physical locations that span
datacenters within an Azure region. This helps ensure that Azure VMs have independent power, cooling, and
networking. For more information on high availability, refer to Availability options for virtual machines in
Azure and High availability and disaster recovery for IaaS apps.

Manageability considerations
Access to the Windows Admin Center management tool set depends on the installation type. For example, in a
local client scenario, you would open the Windows Admin Center from the Star t menu and connect to it from a
client web browser by accessing https://localhost:6516 . In other scenarios where you deploy the Windows
Admin Center gateway to a Window Server, you can connect to the Windows Admin Center gateway from a
client web browser by accessing the server name URL, such as https://servername.contoso.com , or the URL of
the cluster name.
When deployed on Windows Server, Windows Admin Center provides a centralized point of management for
your server environment. Windows Admin Center provides management tools for many common scenarios and
tools, including:
Certificate management
Event Viewer
File Explorer
Network settings
Remote Desktop Connection
Windows PowerShell
Firewall management
Registry editing
Enabling and disabling roles and features
Managing installed apps and devices
Managing scheduled tasks
Configuring local users and groups
Managing Windows services
Managing storage
Managing Windows Update
For a complete list of server management capabilities, refer to Manage Servers with Windows Admin Center.

IMPORTANT
Windows Admin Center doesn't replace all Remote Server Administration Tools (RSAT), such as Internet Information Services
(IIS), because there is no equivalent management capability for it in Windows Admin Center.

NOTE
Windows Admin Center provides a subset of Server Manager features for managing Windows 10 client PCs.

Windows Admin Center provides support for managing failover clusters and cluster resources, managing
Hyper-V VMs and virtual switches, and managing Windows Server Storage Replica. In addition to using
Windows Admin Center to manage and monitor an existing Hyper-Converged Infrastructure, you can also
use Windows Admin Center to deploy a new Hyper-Converged Infrastructure. By using a multistage
workflow, Windows Admin Center guides you through installing features, configuring networking, creating a
cluster, and deploying Storage Spaces Direct and Software-Defined Networking. For more information, refer
to Manage Hyper-Converged Infrastructure with Windows Admin Center.

NOTE
You can use Windows Admin Center to manage Microsoft Hyper-V Server 2016 or Microsoft Hyper-V Server 2019
(the free Microsoft virtualization product).

The Azure hybrid services tool in Windows Admin Center provides a centralized location for all the
integrated Azure services. You can use Azure hybrid services to protect VMs in Azure and on-premises with
cloud-based backup and disaster recovery. You can also extend on-premises capacity with storage and
compute in Azure, and you can simplify network connectivity to Azure. With the help of cloud-intelligent
Azure management services, you can centralize monitoring, governance, configuration, and security across
your applications, network, and infrastructure.

TIP
For more information on Azure integration, refer to Connecting Windows Server to Azure hybrid services.

IMPORTANT
The Azure AD app requires Azure integration in Windows Admin Center.

Security considerations
Deploying Windows Admin Center provides your organization with a centralized management interface for your
server environment. By controlling access to Windows Admin Center, you can improve the security of your
management landscape.
Windows Admin Center provides multiple features to help secure your management platform. For starters,
Windows Admin Center gateway authentication can use local groups, Active Directory Domain Services (AD DS),
and cloud-based Azure AD. You can also enforce smart card authentication by specifying an additional required
group for smart card-based security groups. And, by requiring Azure AD authentication for the gateway, you can
use other Azure AD security features such as conditional access and Azure Multi-Factor Authentication.
Access to the Windows Admin Center gateway doesn't imply access to the target servers that are made visible
by the gateway. To manage a target server, users must connect with credentials that grant administrator
privileges on the servers they want to manage. However, some users might not need administrative access to
perform their responsibilities. In this scenario, you can use role-based access control (RBAC) in Windows Admin
Center to provide these users with limited access to servers rather than granting them full administrative access.

NOTE
If you deployed Local Administrator Password Solution (LAPS) in your environment, use LAPS credentials through
Windows Admin Center to authenticate with a target server.

In Windows Admin Center, RBAC works by configuring each managed server with a Windows PowerShell Just
Enough Administration endpoint. This endpoint defines the roles, including which parts of the system each role
can manage and which users are assigned to the roles. When a user connects to the endpoint, RBAC creates a
temporary local administrator account to manage the system on their behalf and automatically removes the
account when the user stops managing the server through the Windows Admin Center. For more information,
refer to Just Enough Administration.
Windows Admin Center also provides visibility into the management actions performed in your environment.
Windows Admin Center records events by logging actions to the Microsoft-Ser verManagementExperience
event channel in the event log of the managed server. This allows you to audit actions that administrators
perform, and helps you troubleshoot Windows Admin Center issues and review usage metrics. For more
information on auditing, refer to Use event logging in Windows Admin Center to gain insight into management
activities and track gateway usage.

DevOps considerations
Windows Admin Center was built for extensibility so that Microsoft and other developers can build tools and
solutions beyond the current offerings. Windows Admin Center provides an extensible platform in which each
connection type and tool is an extension that you can install, uninstall, and update individually. Microsoft offers a
software development kit (SDK) that enables developers to build their own tools for Windows Admin Center.
You can build Windows Admin Center extensions using modern web technologies—including HTML5, CSS,
Angular, TypeScript, and jQuery—to manage target servers via Windows PowerShell or Windows Management
Instrumentation. You can also manage target servers, services, and devices over different protocols such as
Representational State Transfer (REST) by building a Windows Admin Center gateway plugin.

TIP
For more information about using the SDK to develop extensions for Windows Admin Center, refer to Extensions for
Windows Admin Center.

Cost considerations
For on-premises deployments, Windows Admin Center costs nothing beyond the cost of the Windows operating
system, which requires valid Windows Server or Windows 10 licenses.
For Azure deployments, plan for costs associated with deploying Windows Admin Center to an Azure VM. Some
of these costs might include the VM, storage, static or public IP addresses (if enabled), and any networking
components required for integration with on-premises environments. In addition, you might need to plan for the
cost of purchasing non-Microsoft CA certificates.
Azure Arc hybrid management and deployment for
Kubernetes clusters
11/2/2020 • 8 minutes to read • Edit Online

This reference architecture demonstrates how Azure Arc extends Kubernetes cluster management and configuration
across customer data centers, edge locations, and multiple cloud environments. You can use Azure Arc to register
Kubernetes clusters hosted outside of Microsoft Azure, and use Azure tools to manage these clusters alongside
clusters hosted in Azure Kubernetes Service (AKS).

Download a Visio file of this architecture.


Typical uses for this architecture include:
Managing on-premises Kubernetes clusters alongside clusters hosted in AKS for inventory, grouping, and
tagging.
Monitoring Kubernetes clusters across hybrid environments using Azure Monitor.
Deploying and enforcing policies for Kubernetes clusters across hybrid environments using Azure Policy.
Deploying and enforcing GitOps using Azure Policy.

Architecture
The architecture consists of the following components:
Azure Arc . Attach and configure Kubernetes clusters inside or outside of Azure by using Azure Arc enabled
Kubernetes. When a Kubernetes cluster is attached to Azure Arc, it is assigned an Azure Resource Manager ID
and a managed identity.
Azure Kubernetes Ser vice . Host Kubernetes clusters in Azure, reducing the complexity and operational
overhead of Kubernetes cluster management.
On-premises Kubernetes clusters . Attach Cloud Native Computing Foundation (CNCF)-certified Kubernetes
clusters hosted in on-premises or third-party cloud environments.
Azure Policy . Deploy and manage policies for Arc enabled Kubernetes clusters.
Azure Monitor . Observe and monitor Arc enabled Kubernetes clusters.

Recommendations
The following sections are recommendations that apply for most scenarios. Follow these recommendations unless
you have a specific requirement that overrides them.
Cluster registration
You can register any CNCF Kubernetes cluster that is running. You'll need a kubeconfig file to access the cluster
and cluster-admin role on the cluster for deploying Arc-enabled Kubernetes agents. You'll use Azure Command-Line
Interface (Azure CLI) to perform cluster registration tasks. The user or service principal used with the az login and
az connectedk8s connect commands must have the Read and Write permissions on the
Microsoft.Kubernetes/connectedClusters resource type. The Kubernetes Cluster - Azure Arc Onboarding role has
these permissions and can be used for role assignments on either the user principal or the service principal. Helm 3
is required for onboarding the cluster using the connectedk8s extension. Azure CLI version 2.3 or later is required to
install the Azure Arc-enabled Kubernetes command-line interface extensions.
Azure Arc agents for Kubernetes
Azure Arc enabled Kubernetes consists of a few agents (also referred to as operators) that run in the cluster
deployed to the azure-arc namespace:
deployment.apps/config-agent . Watches the connected cluster for source control configuration resources
applied on the cluster, and updates the compliance state.
deployment.apps/controller-manager . An operator of operators that orchestrates interactions between
Azure Arc components.
deployment.apps/metrics-agent . Collects metrics from other Arc agents to ensure that these agents are
exhibiting optimal performance.
deployment.apps/cluster-metadata-operator . Gathers cluster metadata, cluster version, node count, and
Azure Arc agent version.
deployment.apps/resource-sync-agent . Syncs the previously mentioned cluster metadata to Azure.
deployment.apps/clusteridentityoperator . Maintains the Managed Service Identity (MSI) certificate used by
other agents to communicate with Azure.
deployment.apps/flux-logs-agent . Collects logs from the flux operators deployed as a part of source control
configuration.
For more information, refer to Connect an Azure Arc-enabled Kubernetes cluster.
Monitoring clusters using Azure Monitor for containers
Monitoring your containers is critical. Azure Monitor for containers provides a rich monitoring experience for the
AKS and AKS engine clusters. You can also configure Azure Monitor for containers to monitor Azure Arc enabled
Kubernetes clusters hosted outside of Azure. This provides comprehensive monitoring of your Kubernetes clusters
across Azure, on-premises, and third-party cloud environments.
Azure Monitor for containers can provide you with performance visibility by collecting memory and processor
metrics from controllers, nodes, and containers available in Kubernetes through the Metrics application
programming interface (API). Container logs are also collected. After you enable monitoring from Kubernetes
clusters, metrics and logs are automatically collected for you through a containerized version of the Log Analytics
agent. Metrics are written to the metrics store and log data is written to the logs store associated with your Log
Analytics workspace. For more information about Azure Monitor for containers, refer to Azure Monitor for
containers overview.
You can enable Azure Monitor for containers for one or more existing deployments of Kubernetes by using either a
PowerShell or a Bash script.
To enable monitoring for Arc enabled Kubernetes clusters, refer to Enable monitoring of Azure Arc enabled
Kubernetes cluster
Using Azure Policy to apply cluster configuration
Use Azure Policy to enforce that each Microsoft.Kubernetes/connectedclusters resource or Git-Ops–enabled
Microsoft.ContainerSer vice/managedClusters resource has specific
Microsoft.KubernetesConfiguration/sourceControlConfigurations applied on it. For example, you can apply
a baseline configuration to one or more clusters, or deploy specific applications to multiple clusters. To use Azure
Policy, select a definition from the Azure Policy built-in definitions for Azure Arc enabled Kubernetes, and then create
a policy assignment.
When creating the policy assignment, set the scope to an Azure resource group or subscription. Also set the
parameters for the sourceControlConfiguration that will be created. When the assignment is created, the Policy
engine will identify all connectedCluster or managedCluster resources that are located within the scope, and
then apply the sourceControlConfiguration to each.
If you are using multiple GitHub repos for each cluster (for example, one repo for the central IT/cluster operator and
other repos for application teams), activate this by using multiple policy assignments, with each policy assignment
configured to use a different Git repo.
For more information, refer to Use Azure Policy to apply cluster configurations at scale.
Deploying and enforcing GitOps using Azure Policy
GitOps is the practice of declaring the desired state of Kubernetes configuration (deployments, namespaces, and so
on) in a Git repository. This is followed by a polling and pull-based deployment of these configurations to the
cluster using an operator. Azure Policy and Flux work together to provide GitOps on Azure Arc enabled Kubernetes
clusters.
The connection between your cluster and one or more Git repositories is tracked in Azure Resource Manager as a
sourceControlConfiguration extension resource. The sourceControlConfiguration resource properties
represent where and how Kubernetes resources should flow from Git to your cluster. The
sourceControlConfiguration data is stored encrypted at rest in an Azure Cosmos DB database to ensure data
confidentiality.
The config-agent running in your cluster is responsible for watching for new or updated
sourceControlConfiguration extension resources on the Azure Arc enabled Kubernetes resource, deploying a
flux operator to watch the Git repository, and propagating any updates made to the
sourceControlConfiguration . It's even possible to create multiple sourceControlConfiguration resources with
the namespace scope on the same Azure Arc enabled Kubernetes cluster to achieve multi-tenancy. In such a case,
each operator can only deploy configurations to its respective namespace.
The Git repository can contain any valid Kubernetes resources, including Namespaces, ConfigMaps, Deployments,
and DaemonSets. It might also contain Helm charts for deploying applications. Common Git repository scenarios
include defining a baseline configuration for your organization, which might include common role-base access
control (RBAC) roles and bindings, monitoring or logging agents, or cluster-wide services.
You can also manage a larger collection of clusters, which might be deployed across heterogeneous environments.
For example, you might have one repository that defines the baseline configuration for your organization, and then
apply that to multiple Kubernetes clusters simultaneously. Azure policy can automate creation of a
sourceControlConfiguration with a specific set of parameters on all Azure Arc enabled Kubernetes resources
under a scope (subscription or resource group).
For more information, refer to Deploy configurations using GitOps on Arc enabled Kubernetes cluster.
Topology, network, and routing
Azure Arc agents require the following protocols/ports/outbound URLs to function:

EN DP O IN T ( DN S) DESC RIP T IO N

https://management.azure.com:443 Required for the agent to connect to Azure and register the
cluster.

https://[region].dp.kubernetesconfiguration.azure.com:443 Data plane endpoint for the agent to push status and fetch
configuration information, where [region] represents the Azure
region that hosts the AKS instance.

https://docker.io:443 Required to pull container images.

https://github.com:443 , git://github.com:9418 Example GitOps repos are hosted on GitHub. The


configuration agent requires connectivity to whichever git
endpoint you specify.

https://login.microsoftonline.com:443 Required to fetch and update Azure Resource Manager tokens.

https://azurearcfork8s.azurecr.io:443 Required to pull container images for Azure Arc agents.

Availability considerations
In most cases, the location you select when you create the installation script should be the Azure region
geographically closest to your on-premises resources. The rest of the data will be stored within the Azure
geography containing the region you specify. This might also affect your choice of region if you have data
residency requirements. If an outage affects the Azure region to which your machine is connected, the outage
will not affect the connected machine, but management operations using Azure might not be able to complete.
For resilience in the event of a regional outage, if you have multiple locations that provide a geographically-
redundant service, it's best to connect the machines in each location to a different Azure region. For available
regions, consult Supported regions for Azure Arc enabled Kubernetes.
You should ensure that services referenced in the Architecture section are supported in the region to which
Azure Arc is deployed.

Manageability considerations
Before configuring your Azure Arc-enabled Kubernetes clusters, review the Azure Resource Manager
Subscription limits and Resource group limits to plan for the number of clusters.

DevOps Considerations
Use Helm, the open-source packaging tool, to install and manage the Kubernetes application life cycles. Similar
to Linux package managers such as APT and Yum, Helm is used to manage Kubernetes charts, which are
packages of preconfigured Kubernetes resources. For more information, refer to Deploy Helm Charts using
GitOps on Arc enabled Kubernetes cluster.

Security considerations
You can use RBAC to manage access to Azure Arc enabled Kubernetes across Azure and on-premises
environments using Azure Active Directory (Azure AD) identities. For more information, refer to Use Azure RBAC
for Kubernetes Authorization.
We recommend using a service principal with limited privileges for onboarding Kubernetes clusters to Azure
Arc. This is useful in CI/CD pipelines such as Azure Pipelines and GitHub Actions. For more information, refer to
Create an Azure Arc-enabled onboarding Service Principal.
To simplify service principal management, you can use managed identities in AKS. However, clusters must be
created using the managed identity and existing clusters (including Azure and on-premises clusters) can't be
migrated to managed identities. For more information, refer to Use managed identities in Azure Kubernetes
Service.

Cost considerations
General cost considerations are described in the Principles of cost optimization section in the Microsoft Azure
Well-Architected Framework.
Connect standalone servers by using Azure Network
Adapter
11/2/2020 • 13 minutes to read • Edit Online

This reference architecture shows how to connect an on-premises standalone server to Microsoft Azure virtual
networks by using the Azure Network Adapter that you deploy through Windows Admin Center (WAC). Azure
Network Adapter creates a secured virtual connection over the internet, thereby extending your on-premises
network into Azure.
Download a Visio file of these architectures.

Architecture
The architecture consists of the following components:
On-premises network . This is an organization's private local area network (LAN).
Branch office . This is a private LAN in a remote branch office that connects through a corporate wide area
network (WAN).
Other cloud provider . This is a private virtual network that a cloud provider offers, which connects through a
virtual private network (VPN).
Windows Ser ver with Windows Admin Center installed . The server that you use to deploy the Azure
Network Adapter.
Windows Ser ver (standalone) . The server on which the Azure Network Adapter is installed. This server can
be on a branch-office network or in a different cloud provider's network.
Azure Vir tual Network (VNet) . The virtual servers, and other services and components, for the Azure VPN
Gateway that reside in the same virtual network inside Azure.
Azure VPN Gateway . The VPN Gateway service that enables you to connect the virtual network to the on-
premises network or standalone servers through a VPN appliance or Azure Network Adapters. For more
information, refer to Connect an on-premises network to a Microsoft Azure virtual network. There are several
pricing tiers, or stock keeping units (SKUs), available for VPN gateways, and each SKU supports different
requirements based on the types of workloads, throughputs, features, and service-level agreements (SLAs). The
VPN gateway includes the following components:
Vir tual network gateway (active) . This Azure resource provides a virtual VPN appliance for the virtual
network, and it's responsible for routing traffic back and forth between the on-premises network and the
virtual network.
Vir tual network gateway (passive) . This Azure resource provides a virtual VPN appliance for the
virtual network, and it's the standby instance of the active Azure VPN Gateway. For more information,
refer to About Azure VPN gateway redundancy.
Gateway subnet . The virtual network gateway is held in its own subnet, which is subject to various
requirements that the following Recommendations section details.
Connection . The connection has properties that specify the connection type, such as Internet Protocol
security (IPsec), and the key shared with the on-premises VPN appliance to encrypt traffic.
Cloud application . This is the application that is hosted in Azure, and it can include multiple tiers with multiple
subnets that connect through Azure load balancers. For more information about the application infrastructure,
refer to Running Windows VM workloads and Running Linux VM workloads.
Internal load balancer . Network traffic from the VPN gateway is routed to the cloud application through an
internal load balancer, which is in the application's production subnet.
Azure Bastion . Azure Bastion allows you to log into VMs in the Azure virtual network by using Secure Shell
(SSH) or Remote Desktop Protocol (RDP), without exposing the VMs directly to the internet. If you lose VPN
connectivity, you can still use Azure Bastion to manage your VMs in the Azure virtual network. However, the
management of on-premises servers through Azure Bastion is not supported.

Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have a
specific requirement that overrides them.
Connect a standalone server
To connect a standalone server through the WAC, you must add the server to the managed-servers list in the
dedicated server's WAC installation. After you add the server to that list, you can select the server for which you
want to install the Azure Network Adapter, and then select Network from the tools followed by the "+ Add Azure
Network Adapter (Preview)" option in the Network pane.

TIP
If you do not see the "+ Add Azure Network Adapter (Preview)" option in your browser window, you might need to enlarge
your window, or you might observe an Action button with a drop-down caret. Select the drop-down caret to access the
option and add the Azure Network Adapter.

When you select the + Add Azure Network Adapter (Preview) option, the Add Azure Network Adapter
configuration blade opens in a browser window. There are several options you can configure within this blade.

NOTE
If you haven't previously authenticated from the WAC against the Azure tenant that you want to use, an authentication
dialog will appear. Provide your tenant's authentication information to proceed. The user credentials that you use to
authenticate must have sufficient permissions to create the Azure resources you'll configure during subsequent steps.

The following information is necessary:

F IEL D VA L UE A DDIT IO N A L IN F O RM AT IO N

Subscription Select from drop-down This field lists only subscriptions


assigned to your tenant.

Location Select from drop-down Select an Azure region for your


deployment.
F IEL D VA L UE A DDIT IO N A L IN F O RM AT IO N

Vir tual Network Select from drop-down or use the Depending on your selection, the field's
provided hyperlink to Create a new contents will vary. If the Virtual Network
Virtual Network in Azure Portal exists, you'll observe a hyperlink that
you can follow to review the Virtual
Network in the Azure portal. If the
selected VNet already contains a VNet
Gateway, an additional hyperlink to that
Azure resource will be provided.

Gateway Subnet Subnet prefix, such as 10.0.1.0/24 Depending on the selected Virtual
Network, this field will vary. If the
selected VNet contains no subnet
labeled GatewaySubnet , the field will
be prefilled with a subnet prefix that
includes the address range and subnet
mask. If the selected VNet already
contains a VNet Gateway, an additional
hyperlink to that Azure resource will be
provided.

Gateway SKU Select from drop-down For more information, refer to the
Gateway SKUs.

Client Address Space Subnet prefix, such as 192.168.1.0/24 The field will be prefilled with a subnet
prefix that includes the address range
and subnet mask, and it's the network
that will be used between the server to
which you add the Azure Network
Adapter and the Azure VPN Gateway. It
must have an address range that
doesn't overlap with any of the address
ranges that are used on-premises or in
any of the connected Azure Virtual
Networks.

Authentication Cer tificate Select one of the options The "Auto-generated Self-signed root
and client Certificate" option is
preselected and works best in most
scenarios. When you select the "Use
own root and client certificate" option,
you must provide two files: a root
certificate (.cer) and a client certificate
(.pfx), and then the password for the
client certificate.

Once you complete all necessary fields, the Create button becomes active, and you should select it to begin your
Azure Network Adapter's deployment to the selected server.
The deployment process has two major parts, the first of which is the deployment and selection of the Azure VPN
Gateway. If you need to deploy your Azure VPN Gateway first, allow for 25 to 45 minutes for the deployment to
complete, as some configurations can take that long to deploy. The WAC will provide information about the
deployment's progress. The second part is the actual installation of the Azure Network Adapter, which can take 10
minutes, and the WAC will notify you about the installation progress, as well.
Once the deployment begins, you can change the focus of the WAC by selecting other tools or servers. The
deployment process continues in the background.
If you select the Auto-generated Self-signed root and client Cer tificate option, Azure creates the two
required certificates for you automatically and stores them in the selected server's certificate store. You can use the
Cer tificates tool in the WAC to find them, and then you can locate a root certificate in the Local Machine/Root
container. The certificate's name begins with Windows Admin Center-Created-vpngw and contains the
additional P2SRoot string. The string's tail includes a timestamp encoded with the certificate's creation date. This
certificate will also be stored in the Local Machine/CA container. The second certificate is stored in the Local
Machine/My container. The name of this certificate starts with Windows Admin Center-Created-vpngw and
contains the additional P2SClient string. The string's tail includes a timestamp encoded with the certificate's
creation date.
After the deployment finishes, the selected server's Networks tool is updated with the new Azure Network Adapter,
which automatically starts after the deployment ends and indicates an active status. You can select the adapter to
activate the More drop-down list, which you can select to disconnect or delete the adapter. On the actual server, the
Azure Network Adapter is installed as a VPN connection. The adapter's name begins with Windows Admin
CenterVPN- followed by a random three-digit number.
When the Azure Network Adapter is installed and connected, you can use this new network connection to connect
directly to the Azure VNets and their systems. This type of connection is typically used to establish a remote-
desktop session via an Azure VM's internal IP address rather than by using the VM's public IP address.
Using a dedicated WAC server
For a centralized administration, we recommend that you use a dedicated Windows Admin Server installation, from
which you can add other servers. This approach means that none of the administered servers requires additional
software. For more information about WAC, refer to Microsoft docs.
Prepare a dedicated VNet
In some scenarios, the Azure Network Adapter's installation interface might not meet your needs with respect to
naming conventions or the pricing tier. To avoid this conflict, you can create the requisite Azure resources before
you deploy the adapter. During the deployment, you select the already existing resources instead of creating them
through the installation interface.

NOTE
Make sure you select the correct VPN Gateway SKU, as not all of them support the VPN connection that comes with the
Azure Network Adapter. The installation dialog offers you VpnGw1, VpnGw2, and VpnGw3. Currently, the adapter doesn't
support the zone-redundant versions of the VPN Gateway.

Scalability considerations
VPN Gateway SKU:
The VPN Gateway SKU you're selecting determines how many connections it can take in parallel, as well
as the bandwidth that is available to all of those connections. The number of concurrent connections
varies from 250 to 1,000 when you're using the P2S IKEv2/OpenVPN option. IKE refers to IPsec Key
Exchange. It's advisable to begin with VpnGw1 and scale out later if you need more connections. If you
need to switch the VPN Gateway generation, you need to install a new gateway and deploy a new Azure
Network Adapter to connect to it.
Connect multiple standalone servers:
You can use the WAC to deploy the Azure Network Adapter to as many servers as you need, and you also
can also add multiple Azure Network Adapters to a single server to connect to different Azure VNets.
Once the initial deployment of the VPN Gateway is complete, you can configure subsequent servers to
use the same gateway by selecting the existing gateway in the installation interface.
Standalone servers can be located on the same network, on a branch-office network, or on a different
cloud-based network. You can use the network connection that you've established, such as your corporate
WAN or a dedicated VPN to a different cloud provider, if the required network ports are available through
these connections. For more information, refer to the "Security considerations" section in this article.
Azure Site-to-Site connection:
The Azure Network Adapter is a single installation on a single server, but if you want to connect several
servers, you could face a significant administrative effort. However, you can avoid this effort by
connecting your on-premises systems using the Azure Site-2-Site connection (S2S) method, which
connects an existing on-premises network to an Azure VNet and its subnets. At this connection's core is
an Azure VPN Gateway through which you can connect a local, on-premises VPN gateway with the
remote Azure VPN Gateway. This secure connection allows the two network segments to communicate
transparently with each other.

Availability considerations
The Azure Network Adapter only supports an active-passive configuration of the Azure VPN Gateway. During
the adapter's configuration, you can point to an existing active-active Azure VPN gateway. The setup will
reconfigure the gateway to the active-passive configuration. A manual gateway reconfiguration to the active-
active state is possible, but the Azure Network Adapter won't connect to this gateway.

WARNING
Configuring an Azure Network Adapter against an existing Azure VPN Gateway with an active-active configuration will
reconfigure the gateway to active-passive. This will impact all existing VPN connections to this gateway. Changing
from active-active configuration to active-standby configuration will cause a drop of one of the two IPsec VPN tunnels
for each connection. Do not proceed without evaluating your overall connection requirements and consulting with
your network administrators.

Manageability considerations
Administrative account:
The WAC is the core tool that you use to deploy the Azure Network Adapter and configure account
handling. For more information about the available options, refer to User access options with
Windows Admin Center. You can configure an individual account per server connection.

NOTE
The dialog box in which you configure the administrative account per server will validate your credentials when
you select Continue . To open the dialog box, in the WAC, select the row with the applicable server name, and
then select Manage as . Do not select the hyperlink that represents the server, as it'll connect you to that
server immediately.

Additionally, you must configure a user account for the Azure connection by opening the Settings
dialog box in the WAC and modifying the account section. You also can switch users or log out of a
user's session in the Settings dialog box.
Azure Recovery Vault integration:
When you install Azure Network Adapter on a standalone server, you then can consider that server a port
for your business continuity. You can integrate that server into your back-up and disaster-recovery
procedures by using Azure Recovery Vault services that you configure by selecting Azure Backup in the
Tools section of the WAC. Azure Backup helps protect your Windows server from corruptions, attacks, or
disasters by backing up your server directly to Microsoft Azure.

Security considerations
Required network ports:
Network ports for PowerShell remoting must be open if you want to use the WAC to deploy the Azure
Network Adapter.
PowerShell Remoting uses Windows Remote Management (WinRM). For more information, refer to
PowerShell Remoting Security Considerations and PowerShell Remoting default settings.
In some scenarios, you're required to use additional authentication methods. WAC can use PowerShell in
conjunction with the Credential Security Support Provider protocol (CredSSP) to connect to remote
servers. For more information, refer to PowerShell Remoting and CredSSP and how Windows Admin
Center uses CredSSP.
PowerShell Remoting (and WinRM) use the following ports: | Protocol | Port | | :-- | :-- | | HTTP | 5985 | |
HTTPS | 5986 |
How you connect to the server on which the Windows Admin Center (WAC) is installed depends on your
WAC's installation type. The default port varies and can be port 6516 when installed on Windows 10 or
port 443 when installed on Windows Server. For more information, refer to Install Windows Admin
Center.
Azure Security Center integration:
To help protect the server on which the Azure Network Adapter is installed, you can integrate the server
to Azure Security Center by selecting Azure Security Center from the Tools section in WAC. During the
integration, you must select an existing Azure Log Analytics workspace or create a new one. You'll be
billed separately for each server that you integrate with Azure Security Center. For more information,
refer to Azure Security Center pricing.

DevOps considerations
Azure Automation:
The WAC gives you access to the PowerShell code that creates the Azure Network Adapter, and you can
review it by selecting the Network tool, and then selecting the View PowerShell scripts icon at the top
of the WAC page. The script's name is Complete-P2SVPNConfiguration , and it's implemented as a
PowerShell function. The code is digitally signed and ready to be reused, and you can integrate it into
Azure Automation by configuring additional services inside the Azure portal.

Cost considerations
Azure Pricing Calculator:
Using the Azure Network Adapter doesn't actually cost anything, as it's a component that you deploy to
an on-premises system. The Azure VPN Gateway, as part of the solution, does generate additional costs,
as does the use of additional services, such as Azure Recovery Vault or Azure Security Center. For more
information about actual costs, refer to the Azure Pricing Calculator. It's important to note that actual
costs vary by Azure region and your individual contract. Contact a Microsoft sales representative for
additional information about pricing.
Egress costs:
There are additional costs associated with outbound Inter-VNet data transfers, and they're dependent on
your VPN Gateway's SKU and the actual amount of data you're using. For more information, refer to the
Azure Pricing Calculator. It's important to note that actual costs vary by Azure region and your individual
contract. Contact a Microsoft sales representative for additional information about pricing.
Disaster Recovery for Azure Stack Hub virtual
machines
11/2/2020 • 36 minutes to read • Edit Online

This document describes the architecture and design considerations of a solution that delivers an optimized
approach to disaster recovery of virtual machine (VM)-based user workloads hosted on Azure Stack Hub.

Download a Visio file of this architecture.


Azure Stack Hub includes self-healing functionality, providing auto-remediation in a range of scenarios involving
localized failures of its components. However, large-scale failures, including outages affecting server racks or site-
level disasters, require additional considerations. These considerations should be part of the business continuity
and disaster recovery strategy for VM-based user workloads. This strategy must also account for recovery of the
Azure Stack infrastructure, which is separate from workload recovery.
Traditional, on-premises workload recovery solutions are complex to configure, expensive and labor-intensive to
maintain, and challenging to automate, especially when using another on-premises location as the failover site.
Microsoft recommends an alternative solution that relies on a combination of the cloud and on-premises
components to deliver resilient, performance-based, highly automated, and straightforward ways to manage,
secure, and achieve a cost-efficient disaster recovery strategy. The core element of this solution is the Microsoft
Azure Recovery Services offering, with the failover site residing in Azure.
Azure Site Recovery with Azure as the failover site eliminates all of these drawbacks. You can use its capabilities to
protect both physical and virtual servers, including those running on either Microsoft Hyper-V or VMware ESXi
virtualization platforms. You also have the option to leverage the same capabilities to facilitate recovery of
workloads running on Azure Stack Hub VMs.

Architecture of the proposed solution


The cloud components of the proposed solution include the following services:
An Azure subscription hosting all cloud resources that are part of this solution.
An Azure Active Directory (Azure AD) tenant associated with the Azure subscription that provides
authentication of Azure AD security principals to authorize access to Azure resources.
An Azure Recovery Services vault in the Azure region closest to an on-premises datacenter that will host the
Azure Stack Hub deployment.

NOTE
The choice of the Azure region which is closest to the on-premises datacenter is specific to the sample scenario
included in this reference architecture document. From a disaster recovery standpoint, it would be preferable to select
an Azure region further away from the location hosting the production environment. The decision, however, might
depend on additional factors, such as the need to minimize latency of regional data feeds or to satisfy data residency
requirements.

An Azure ExpressRoute circuit connecting the on-premises datacenters to the Azure region hosting the Azure
Recovery Services vault, configured with private peering and Microsoft peering. The former ensures that the
latency requirements following a failover during disaster recovery scenarios are satisfied. The purpose of the
latter is to minimize the amount of time it takes to replicate changes between the on-premises workloads
and the failover site in Azure.
An Azure Storage account that hosts blobs containing VHD files created by replication of the operating
system and data volumes of protected Azure Stack Hub VMs. These VHD files serve as the source for
managed disks of Azure VMs which are automatically provisioned following a failover.
An Azure virtual network that will host the disaster recovery environment, configured in a manner that
mirrors the virtual network environment in Azure Stack Hub hosting the production workloads, including
components such as load balancers and network security groups. This virtual network is typically connected
to the Azure Stack Hub virtual network via an ExpressRoute connection to facilitate workload-level recovery.

NOTE
A site-to-site VPN connection might be sufficient in scenarios where Recovery Point Objectives (RPOs) requirements
are less stringent.

An isolated Azure virtual network intended for test failovers, configured in a manner that mirrors the virtual
network environment in Azure Stack Hub hosting the production workloads, including components such as
load balancers and network security groups.
The on-premises components of the proposed solution include the following services:
An Azure Stack Hub integrated system in the connected deployment model, running the current update
(2002 as of 9/20), and located within the customer's on-premises datacenter.
An Azure Stack Hub subscription and a virtual network or multiple peered virtual networks hosting all on-
premises VMs that are part of this solution.
Azure Site Recovery configuration and process servers, running on Windows Server 2016 or 2012 R2 Azure
Hub Stack VMs, managing communications with the Azure Recovery Services vault and the routing,
optimization, and encryption of replication traffic.

NOTE
By default, a configuration server hosts a single process server. You have the option to deploy dedicated process
servers to accommodate a larger volume of replication traffic.

Azure Stack Hub VMs to be protected, running supported versions of Windows Server, CentOS, or Ubuntu
operating systems.
Azure Site Recovery Mobility service (also referred to as mobility agent) installed and running on protected
VMs, which tracks changes to local disks, records them into replication logs, and replicates the logs to the
process server, which, in turn, routes them to the target Azure storage account. The logs are used to create
recovery points for managed disks implemented by using blobs stored in the Azure storage account you
designated.

Core functionality
Azure Site Recovery is a disaster recovery solution that facilitates protection of physical and virtual computers by
providing two sets of features:
Replication of changes to computer disks between the production and disaster recovery locations
Orchestration of failover and failback between these two locations
Azure Site Recovery offers three types of failovers:
Test failover. This failover gives you the opportunity to validate your Azure Site Recovery configuration in an
isolated environment, without any data loss or impact to the production environment.
Planned failover. This failover gives you the option to initiate disaster recovery without data loss, typically as part
of planned downtime.
Unplanned failover. This failover serves as the last resort in case of an unplanned outage affecting availability of
the primary site and potentially resulting in data loss.
Azure Site Recovery supports several scenarios, such as failover and failback between two on-premises sites,
failover and failback between two Azure regions, and migration from third party provider's clouds. However, in the
context of this reference architecture document, the focus is on replication of local disks of Azure Stack Hub VMs to
Azure Storage, and on VM failover and failback between an Azure Stack Hub stamp and an Azure region.

NOTE
The Site Recovery scenario which involves replicating between on-premises VMware-based or physical datacenters reaches its
end of service on December 31, 2020.

NOTE
There is no support for Azure Site Recovery between two deployments of Azure Stack Hub.

Details of Azure Site Recovery architecture and its components depend on a number of criteria, including:
The types of computers to be protected (physical versus virtual).
For virtualized environments, the type of hypervisor hosting the virtual machines to be protected (Hyper-V
versus VMware ESXi).
For Hyper-V environments, the use of System Center Virtual Machine Manager (SCVMM) for management of
Hyper-V hosts.
With Azure Stack Hub, the architecture matches the one applicable to physical computers. This isn't particularly
surprising, because in both cases, Azure Site Recovery can't benefit from direct access to a hypervisor. Instead, the
mechanism that tracks and replicates changes to local disks is implemented within the protected operating system.
NOTE
Incidentally, this is also the reason that you need to select Physical machines as the Machine type when configuring
replication of Azure Stack Hub VMs in the Azure Site Recovery interface within the Azure portal. Another implication is a
unique approach to failback, which doesn't offer the same degree of automation as the one available in Hyper-V or ESXi-
based scenarios.

To accomplish this, Azure Site Recovery relies on the Site Recovery Mobility service (also referred to as mobility
agent), which is automatically deployed to individual VMs as you enroll them into the scope of Azure Site Recovery-
based protection. On each protected VM, the locally installed instance of the mobility agent continuously monitors
and forwards changes to the operating system and data disks to the process server. However, to optimize and
manage the flow of replication traffic originating from protected VMs, Azure Site Recovery implements an
additional set of components running on a separate VM, referred to as the configuration server.
The configuration server coordinates communications with the Azure Site Recovery vault and manages data
replication. In addition, the configuration server hosts a component referred to as the process server, which acts as
a gateway, receiving replication data, optimizing it through caching and compression, encrypting it, and finally
forwarding it to Azure Storage. Effectively, the configuration server functions as the integration point between
Azure Site Recovery and protected VMs running on Azure Stack Hub. You implement that integration by deploying
the configuration server and registering it with the Azure Recovery Services vault.
As part of Azure Site Recovery configuration, you define the intended disaster recovery environment, including
such infrastructure components as virtual networks, load balancers, or network security groups in the manner that
mirrors the production environment. The configuration also includes a replication policy, which determines
recovery capabilities and consists of the following parameters:
RPO threshold. This setting represents the desired recovery point objective that you want to implement and
determines the frequency in which Azure Site Recovery generates crash-consistent recovery point snapshots.
Its value doesn't affect the frequency of replication because that replication is continuous. Azure Site
Recovery will generate an alert, and optionally, an email notification, if the current effective RPO provided by
Azure Site Recovery exceeds the threshold that you specify. Azure Site Recovery generates crash-consistent
recovery point snapshots every five minutes.

NOTE
A crash consistent snapshot captures data that were on the disk when the snapshot was taken. It doesn't include
memory content. Effectively, a crash-consistent snapshot doesn't guarantee data consistency for the operating system
or locally installed apps.

Recovery point retention. This setting represents the duration (in hours) of the retention window for each
recovery point snapshot. Protected VMs can be recovered to any recovery point within a retention window.
Azure Site Recovery supports up to 24 hours of retention for VMs replicated to Azure Storage accounts with
the premium performance tier. There is a 72-hour retention limit when using Azure Storage accounts with
the standard performance tier.
App-consistent snapshot frequency. This setting determines the frequency (in hours) in which Azure Site
Recovery generates application-consistent snapshots. An app-consistent snapshot represents a point-in-time
snapshot of applications running in a protected VM. There is a limit of 12 app-consistent snapshots. For VMs
running Windows Server, Azure Site Recovery leverages Volume Shadow Copy Service (VSS). Azure Site
Recovery also supports app-consistent snapshots for Linux, but that requires implementing custom scripts.
The scripts are used by the mobility agent when applying an app-consistent snapshot.
NOTE
For details regarding implementing app-consistent snapshots on Azure Stack Hub VMs running Linux, refer to General
questions about Azure Site Recovery.

For each disk of a protected Azure Stack Hub VM that you designate, data are replicated to a corresponding
managed disk in Azure Storage. The disk stores the copy of the source disk and all the recovery point crash-
consistent and app-consistent snapshots. As part of a failover, you choose a recovery point crash-consistent or app-
consistent snapshot that should be used when attaching the managed disk to the Azure VM, which serves as a
replica of the protected Azure Stack Hub VM.
During regular business operations, protected workloads run on Azure Stack Hub VMs, with changes to their disks
being continuously replicated through interactions among the mobility agent, process server, and configuration
server to the designated Azure Storage account. When you initiate a test, planned, or unplanned failover, Azure Site
Recovery automatically provisions Azure VMs using the replicas of disks of the corresponding Azure Stack Hub
VMs.

NOTE
The process of provisioning Azure VMs by using Azure Site Recovery-replicated disks is referred to as hydration.

You have the option to orchestrate a failover by creating recovery plans that contain manual and automated steps.
To implement the latter, you can leverage Azure Automation runbooks, which consist of custom PowerShell scripts,
PowerShell workflows, or Python 2 scripts.
After the primary site becomes available again, Azure Site Recovery supports reversing the direction of replication,
allowing you to perform a failback with minimized downtime and without data loss. However, with Azure Stack Hub,
this approach isn't available. Instead, to fail back, it's necessary to download Azure VM disk files, upload them into
Azure Stack Hub, and attach them to existing or new VMs.

Prerequisites
Implementing the recommended solution is contingent on satisfying the following prerequisites:
Access to an Azure subscription, with permissions sufficient to provision and manage all cloud components
of the Azure Site Recovery components, including:
Azure Recovery Services vault in the Azure region designated as the disaster recovery site for the Azure
Stack Hub production environment.
An Azure Storage account hosting content of replicated disks of Azure Stack Hub VMs.
An Azure virtual network representing the disaster recovery environment to which hydrated Azure VMs
will be connected following a planned or unplanned failover.
An isolated Azure virtual network representing the test environment to which hydrated Azure VMs will be
connected following a test failover.
An Azure ExpressRoute-based connectivity between the on-premises environment, Azure virtual
networks, and the Azure storage account used for hosting copies of VHD files with content replicated
from disks of protected Azure Stack Hub VMs.
An Azure Stack Hub user subscription. All Azure Stack Hub VMs protected by an individual Azure Site
Recovery configuration server must belong to the same Azure Stack Hub user subscription.
An Azure Stack Hub virtual network. All protected VMs must have direct connectivity to the VMs hosting the
process server component (by default this is the configuration server VM).
An Azure Stack Hub Windows Server VM that will host the configuration server and a process server. The VM
must belong to the same subscription and be attached to the same virtual network as the Azure Stack Hub
VMs that need to be protected. In addition, the VM needs to:
comply with Site Recovery configuration server software and hardware requirements
satisfy external connectivity network requirements documentation.

NOTE
Additional storage and performance considerations for the configuration and process servers are described in more
detail later in this architecture.

satisfy internal connectivity requirements. In particular, Azure Stack Hub VMs that you want to protect
need to be able to communicate with:
the configuration server via TCP port 443 (HTTPS) inbound for replication management
the process server via TCP port 9443 to deliver replication data.

NOTE
You can change the port used by the process server for both external and internal connectivity as part of its
configuration when running Azure Site Recovery Unified Setup.

Azure Stack Hub VMs to be protected, running any of the supported operating systems To protect Azure
Stack Hub VMs that are running Windows Server operating systems, you must:
Create a Windows account with administrative rights. You can specify this account when you enable Azure
Site Recovery on these VMs. The process server uses this account to install the Azure Site Recovery
Mobility service. In a workgroup environment, make sure to disable Remote User Access control on target
Windows Server operating systems by setting the value of the LocalAccountTokenFilterPolicy
DWORD registry entry in the
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System key
to 1.
Enable File and Printer Sharing and Windows Management Instrumentation rules in Windows Defender
firewall.
To protect Azure Stack Hub VMs that are running Linux operating systems, you must:
Create a root user account. You can specify this account when you enable Azure Site Recovery on these
VMs. The process server uses this account to install the Azure Site Recovery Mobility service.
Install the latest openssh, openssh-server, and openssl packages.
Enable and run Secure Shell (SSH) port 22 .
Enable Secure FTP subsystem and password authentication.

High level implementation steps


At a high level, the implementation of Azure Site Recovery-based disaster recovery on Azure Stack Hub consists of
the following stages:
1. Prepare Azure Stack Hub VMs to be protected by Azure Site Recovery. Ensure that the VMs satisfy Azure Site
Recovery prerequisites listed in the previous section.
2. Create and configure an Azure Recovery Services vault. Set up an Azure Recovery Services vault and specify
what you want to replicate. Azure Site Recovery components and activities are configured and managed by
using the vault.
3. Set up the source replication environment. Provision an Azure Site Recovery configuration server and
process server by installing Azure Site Recovery Unified Setup binaries and register it with the vault.

NOTE
You can rerun the Azure Site Recovery Unified Setup to implement additional process servers on Azure Stack Hub
VMs.

4. Set up the target replication environment. Create or select an existing Azure storage account and an Azure
virtual network in the Azure region that will host the disaster recovery site. During replication, the content of
the disks for the protected Azure Stack Hub VMs is copied to the Azure Storage account. During failover,
Azure Site Recovery automatically provisions Azure VMs serving as replicas of protected Azure Stack Hub
VMs and connects them to the Azure virtual network.
5. Enable replication. Configure replication setting and enable replication for Azure Stack Hub VMs. The
mobility service is installed automatically on each Azure Stack Hub VM for which replication is enabled.
Azure Site Recovery initiates replication of each Azure Stack Hub VM, according to the policy settings you
defined.
6. Perform a test failover. After replication is established, verify that failover will work as expected by
performing a test failover.
7. Perform a planned or unplanned failover. Following a successful test failover, you are ready to conduct either
a planned or unplanned failover to Azure. You have the option to designate which Azure Stack Hub VMs to
include in the failover.
8. Perform a failback. When you are ready to fail back, stop the Azure VMs corresponding to the Azure Stack
Hub VMs you failed, download their disk files to on-premises storage, upload them into Azure Stack Hub,
and attach them to an existing or new VM.

Availability considerations
Azure Stack Hub helps increase workload availability through resiliency inherent to its infrastructure. This resiliency
provides high availability for Azure Stack Hub VMs protected by Azure Site Recovery and to essential components
of the on-premises Azure Site Recovery infrastructure, including the configuration and process servers.
Similarly, you have the option to leverage resiliency of cloud-based components of Azure Site Recovery
infrastructure. By default, Azure Recovery Services is geo-redundant, which means that its configuration is
automatically replicated to an Azure region that is part of a pre-defined region pair. You have the option to change
the replication settings to locally redundant if that is sufficient for your resiliency needs. Note that you can't change
this option if the vault contains any protected items. The same resiliency option is available for any Azure Storage
accounts with the standard performance tier, although it's possible to change it at any point.

NOTE
For the listing of Azure region pairs, refer to Business continuity and disaster recovery (BCDR): Azure Paired Regions.

You can further enhance the degree of this resiliency by designing and implementing solutions which purpose is to
extend the scope of workload protection. This is the added value provided by Azure Site Recovery. In the context of
Azure Site Recovery running on Azure Stack Hub, there are two main aspects of workload availability that need to
be explored in more detail:
Failover to Azure
Failback to Azure Stack Hub
You need to consider both when developing a disaster recovery strategy driven by recovery point objectives (RPOs)
and recovery time objectives (RTOs). RTO and RPO represent continuity requirements stipulated by individual
business functions within an organization. RPO designates a time period representing maximum acceptable data
loss following an incident that affected availability of that data. RTO designates the maximum acceptable duration of
time it can take to reinstate business functions following an incident that affected the availability of these functions.
Failover to Azure
It's self-evident that failover to Azure is at the core of availability considerations in the context of Azure Site
Recovery-based protection of Azure Stack Hub VMs. To maximize workload availability, the failover strategy should
address both the need to minimize potential data loss (RPO) and minimize failover time (RTO).
To minimize potential data loss, you might consider:
Maximizing throughput and minimizing latency of the replication traffic by following scalability and
performance considerations. For more information, refer to the next section of this reference architecture
document.
Increasing the frequency of app-consistent recovery points for database workloads (up to the maximum of one
recovery point per hour). App-consistent recovery points are created from app-consistent snapshots. App-
consistent snapshots capture app data on disk and in memory. While this approach minimizes potential data
loss, it has one major drawback. App-consistent snapshots require the use of Volume Shadow Copy Service on
Windows or custom scripts on Linux, to quiesce locally installed apps. The capture process can have negative
impacts on performance, especially if resource utilization is high. We do not recommend that you use low
frequency for app-consistent snapshots for non-database workloads.
The primary method of minimizing failover time involves the use of Azure Site Recovery recovery plans. A recovery
plan orchestrates a failover between the primary and secondary sites, defining the sequence in which protected
servers fail over. You can customize a plan by adding manual instructions and automated tasks. Its purpose is to
make the process consistent, accurate, repeatable, and automated.
When creating a recovery plan, you assign protected servers to recovery groups for the purpose of failover. Servers
in each group fail over together. This helps you to divide the failover process into smaller, easier to manage units,
representing sets of servers which can fail over without relying on external dependencies.
To minimize failover time, as part of creating a recovery plan, you should:
Define groups of Azure Stack Hub VMs that should fail over together.
Define dependencies between groups of Azure Stack Hub VMs to determine the optimal sequence of a failover.
Automate failover tasks, if possible.
Include custom manual actions, if required.

NOTE
A single recovery plan can contain up to 100 protected servers.

NOTE
In general, recovery plans can be used for both failover to and failback from Azure. This doesn't apply to Azure Stack Hub,
which doesn't support Azure Site Recovery-based failback.

You define a recovery plan and create recovery groups to capture app-specific properties. As an example, let's
consider a traditional three-tier app with a SQL Server-based backend, a middleware component, and a web
frontend. When creating a recovery plan, you can control the startup order of servers in each tier, with the servers
running SQL Server instances coming online first, followed by those in the middleware tier, and joined afterwards
by servers hosting the web frontend. This sequence ensures that the app is working by the time the last server
starts. To implement it, you can simply create a recovery plan with three recovery groups, containing servers in the
respective tiers.
In addition to controlling failover and startup order, you also have the option to add actions to a recovery plan. In
general, there are two types of actions:
Automated. This action is based on Azure Automation runbooks, which involves one of two types of tasks:
Provisioning and configuring Azure resources, including, for example, creating a public IP address and
associating it with the network interface attached to an Azure VM.
Modifying the configuration of the operating system and applications running within an Azure VM that
was provisioned following a failover.
Manual. This action doesn't support automation and is included in a recovery plan primarily for documentation
purposes.

NOTE
To determine the failover time of a recovery plan, perform a test failover and then examine the details of the corresponding
Site Recovery job.

NOTE
To address the RTO requirements for Azure Stack Hub workloads, you should account for recovery of the Azure Stack
infrastructure, user VMs, applications, and user data. In the context of this reference architecture document, we are interested
only in the last two of these components, although we also present considerations regarding the availability of the Modern
Backup Storage functionality.

Failback to Azure Stack Hub


In Azure Site Recovery-based scenarios, failback, if properly implemented, doesn't involve data loss. This means that
the focus of the failover strategy is to minimize failback time (RTO). However, as previously mentioned, when failing
back to Azure Stack Hub, you can't rely on your recovery plans. Instead, the failback involves the following sequence
of steps:
1. Stop and deallocate Azure VMs in the disaster recovery environment.
2. Identify the URI parameter of each of the managed disks attached to the VMs you intend to download.
3. Download the virtual hard disk (VHD) files identified by the URI parameters you identified in the previous step to
your on-premises environment.
4. Upload the VHD files to Azure Stack Hub.
5. Attach the uploaded VHDs to new or existing Azure Stack Hub VMs.
6. Start the Azure Stack Hub VMs.
The optimal approach to minimizing the failback time, is to automate it.

NOTE
For more information regarding automating the failback procedure described in this section, refer to Create VM disk storage
in Azure Stack Hub.
NOTE
For more information regarding identifying the URI parameter of managed disks, refer to Download a Windows VHD from
Azure.

Workload-specific considerations
Azure Site Recovery integrates with Windows Server-based apps and roles, including SharePoint, Exchange, SQL
Server, and Active Directory Domain Services (AD DS). This allows you to leverage the following capabilities to
implement app-level protection and recovery:
Integration with app-level replication technologies, such as SQL Server AlwaysOn Availability Groups, Exchange
Database Availability Groups (DAGs), and AD DS replication
App-consistent snapshots, for single or multiple tier applications
A rich automation library that provides production-ready, application-specific scripts that can be downloaded
and integrated with recovery plans
Alternatively, you have the option to use workload-specific replication mechanisms to provide site-level resiliency.
This is a commonly used option when implementing disaster recovery for AD DS domain controllers, SQL Server, or
Exchange, all of which natively support replication. Though this requires provisioning Azure VMs hosting these
workloads in the disaster recovery environment, which increases the cost, it offers the following benefits:
Reduces time required for failover and failback
Simplifies workload-level failover, accommodating scenarios in which site-level failover isn't required

NOTE
For more information regarding Azure Site Recovery workload-specific considerations, refer to About disaster recovery for on-
premises apps.

Scalability and performance considerations


When planning to deploy Azure Site Recovery on Azure Stack Hub, you need to consider the amount of processing,
storage, and network resources allocated to the configuration and process servers. You might need to adjust the
estimated sizing of the Azure Stack Hub VM hosting the Azure Site Recovery components post deployment to
accommodate changes in processing or storage requirements. You have three basic options to adjust the sizing:
Implement vertical scaling. This involves modifying the amount and type of processor, memory, and disk
resources of the Azure Stack Hub VM hosting the configuration server including the process server. To
estimate resource requirements, you can use the information in the following table:
Table 1: Configuration and process server sizing requirements

P ROT EC T ED
CPU M EM O RY C A C H E DISK DATA C H A N GE RAT E M A C H IN ES

8 vCPUs 2 sockets * 4 16GB 300 GB 500 GB or less < 100 machines


cores @ 2.5 GHz

12 vCPUs 2 sockets * 18 GB 600 GB 500 GB-1 TB 100 to 150 machines


6 cores @ 2.5 GHz

16 vCPUs 2 sockets * 32 GB 1 TB 1-2 TB 150-200 machines


8 cores @ 2.5 GHz
Implement horizontal scaling. This involves provisioning or deprovisioning Azure Stack Hub VMs with the
process server installed to match processing demands of protected Azure Stack Hub VMs. In general, if you
have to scale your deployment to more than 200 source machines, or you have a total daily churn rate of
more than two terabytes (TB), you need additional process servers to handle replication traffic. To estimate
the number and configuration of additional process servers, refer to Size recommendations for the process
server.
Modify replication policy. This involves changing parameters of the replication policy, with focus on app-
consistent snapshots.
From the networking standpoint, there are several different methods to adjust bandwidth available for replication
traffic:
Modify VM size. The size of Azure Stack Hub VMs determines the maximum network bandwidth. However,
it's important to note that there are no bandwidth guarantees. Instead, VMs can utilize the amount of
available bandwidth up to the limit determined by their size.
Replace uplink switches. Azure Stack Hub systems support a range of hardware switches, offering several
choices of uplink speeds. Each Azure Stack Hub cluster node has two uplinks to the top of rack switches for
fault tolerance. The system allocates half of the uplink capacity for critical infrastructure. The remainder is
shared capacity for Azure Stack Hub services and all user traffic. Systems deployed with faster speeds have
more bandwidth available for replication traffic.

NOTE
While it's possible to segregate network traffic by attaching a second network adapter to a server, with Azure Stack
Hub VMs, all VM traffic to the internet shares the same uplink. A second, virtual network adapter won't segregate
traffic at the physical transport level.

Modify throughput of the network connection to Azure. To accommodate larger volumes of replication
traffic, you might consider leveraging Azure ExpressRoute with Microsoft peering for connections between
Azure Stack Hub virtual networks and Azure Storage. Azure ExpressRoute extends on-premises networks into
the Microsoft cloud over a private connection supplied by a connectivity provider. You can buy ExpressRoute
circuits for a wide range of bandwidths, from 50 megabits per second (Mbps) to 100 gigabits per second.

NOTE
For details regarding implementing Azure ExpressRoute in Azure Stack Hub scenarios, refer to Connect Azure Stack
Hub to Azure using Azure ExpressRoute.

Modify throttling of replication traffic on the process server. You can control how much bandwidth is used by
the replication traffic on the VMs that are hosting process servers from the graphical interface of the
Microsoft Azure Recovery Services agent. The supported capabilities include setting the limits for work and
non-work hours, with the bandwidth values ranging from 512 kilobits per second to 1,023 Mbps.
Alternatively, you can apply the same configuration by using the Set-OBMachineSetting PowerShell
cmdlet.
Modify network bandwidth allocated per protected VM on the process server. To accomplish this, modify the
value of UploadThreadsPerVM entry within the
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows Azure Backup\Replication key. By default,
the value is set to 4, but you can increase it to 32 if there's enough network bandwidth available.

Manageability considerations
The primary considerations regarding manageability of Azure Site Recovery-based disaster recovery of Azure Stack
Hub VMs include:
Implementation of Azure Site Recovery on Azure Stack Hub
Failover and failback procedures
Delegation of roles and responsibilities
Implementation of Azure Site Recovery on Azure Stack Hub
To implement Azure Site Recovery on Azure Stack Hub in a small to medium sized single-tenant environment, you
can follow the manual provisioning process driven by the graphical interface of Recovery Services Vault in the
Azure portal. For multi-tenant implementations, you might want to consider automating parts of the
implementation process, because you will typically need to set up a separate configuration server VM and a
separate Recovery Services vault for each tenant. You also have the option to automate deployment of the mobility
agent by following the procedure described in Prepare source machine for push installation of mobility agent.
Failover and failback procedures
To simplify the management of failover, consider implementing recovery plans for all protected workloads. For
more information, refer to the Availability considerations section earlier in this reference architecture document.
You will also find recommendations for optimizing the management of the failback procedure.
Delegation of roles and responsibilities
Planning for and implementing disaster recovery of Azure Stack Hub VM-based workloads by using Azure Site
Recovery typically involves interaction of stakeholders:
Azure Stack Hub operators manage Azure Stack Hub infrastructure, ensuring that there are sufficient compute,
storage, and network resources necessary for implementing a comprehensive disaster recovery solution and
making these resources available to tenants. They also collaborate with application and data owners to help
determine the optimal approach to deploying their workloads to Azure Stack Hub.
Azure administrators manage Azure resources necessary to implement hybrid disaster recovery solutions.
Azure AD administrators manage Azure AD resources, including user and group objects that are used to
provision, configure, and manage Azure resources.
Azure Stack Hub tenant IT staff designs, implements, and manages Azure Site Recovery, including failover and
failback.
Azure Stack Hub users need to provide RPO and RTO requirements and submit requests to implement disaster
recovery for their workloads.
Make sure there'sa clear understanding of the roles and responsibilities attributed to application owners and
operators in the context of protection and recovery. Users are responsible for protecting VMs. Operators are
responsible for the operational status of the Azure Stack Hub infrastructure.

NOTE
For guidance regarding fine-grained delegation of permissions in Azure Site Recovery scenarios, refer to Manage Site
Recovery access with role-based access control (RBAC).

Security considerations
Managing disaster recovery of user VM-based workloads in hybrid scenarios warrants additional security
considerations. These considerations can be grouped into the following categories:
Encryption in transit. This includes communication between protected Azure Stack Hub VMs, on-premises
Azure Site Recovery components, and cloud-based Azure Site Recovery components:
Mobility agent installed on the protected VMs always communicates with the process server via Transport
Layer Security (TLS) 1.2.
It's possible for communication from the configuration server to Azure and from the process server to
Azure to use TLS 1.1 or 1.0. To increase the level of security for hybrid connectivity, you should consider
enforcing the use of TLS 1.2.

NOTE
For details regarding configuring TLS 1.2-based encryption, refer to Transport Layer Security (TLS) registry settings
and Update to enable TLS 1.1 and TLS 1.2 as default secure protocols in WinHTTP in Windows

Encryption at rest. This includes Azure Storage and Azure VMs in the disaster recovery site.
Azure Storage is encrypted at rest for all storage accounts using 256-bit Advanced Encryption Standard
encryption and is Federal Information Processing Standard 140-2 compliant. Encryption is enabled
automatically and can't be disabled. By default, encryption uses Microsoft-managed keys, but customers
have the option to use their own keys stored in an Azure Key vault.
Managed disks of Azure VMs are automatically encrypted by using server-side encryption of Azure
managed disks, which also applies to their snapshots, by relying on using platform-managed encryption
keys.
In addition, you can enforce restricted access to the Azure Storage accounts hosting content of Azure Site Recovery-
replicated disks. To do this, enable the managed identity for the Recovery Services vault and assign to that managed
identity the following Azure role-based access control (RBAC) roles at the Azure Storage account level:
Resource Manager-based storage accounts (standard performance tier):
Contributor
Storage Blob Data Contributor
Resource Manager-based storage accounts (premium performance tier):
Contributor
Storage Blob Data Owner
The Azure Recovery Services vault offers mechanisms that further protect its content, including the following
protections:
RBAC. This allows for delegation and segregation of responsibilities according to the principle of least privilege.
There are three Azure Site Recovery-related built-in roles that restrict access to Azure Site Recovery operations:
Site Recovery Contributor. This role has all the permissions required to manage Azure Site Recovery
operations in an Azure Recovery Services vault. A user with this role, however, can't create or delete the
vault or assign access rights to other users. This role is best suited for disaster recovery administrators
who can enable and manage disaster recovery for an Azure Stack Hub tenant.
Site Recovery Operator. This role has permissions to execute and manage failover and failback operations.
A user with this role can't enable or disable replication, create or delete vaults, register new infrastructure,
or assign access rights to other users. This role is best suited for a disaster recovery operator who can fail
over Azure Stack Hub VMs when instructed by application owners and IT administrators in an actual or
simulated disaster scenario.
Site Recovery Reader. This role has permissions to track all Azure Site Recovery management operations.
This role is best suited for IT staff responsible for monitoring the status of protected Azure Stack Hub VMs
and raising support tickets if required.
Azure Resource Locks. You have the option to create and assign read-only and delete locks to an Azure Site
Recovery vault to mitigate the risk of the vault being accidentally and maliciously changed or deleted.
Soft delete. The purpose of soft delete is to help protect the vault and its data from accidental or malicious
deletions. With soft delete, any deleted content is retained for 14 additional days, allowing for its retrieval during
that period. The additional 14-day retention of vault content doesn't incur any cost. Soft delete is enabled by
default.
Protection of security-sensitive operations. Azure Recovery Services vault allows you to enable an additional
layer of authentication whenever a security-sensitive operation, such as disabling protection, is attempted. This
extra validation helps ensure that authorized users perform such operations.
Monitoring and alerts of suspicious activity. Azure Recovery Services provides built-in monitoring and alerting
of security-sensitive events related to the vault operations.

DevOps considerations
While configuring VM-level recovery by using Azure Site Recovery is primarily a responsibility of IT operations,
there are some DevOps-specific considerations that should be incorporated into a comprehensive disaster recovery
strategy. Azure Stack Hub facilitates implementing Infrastructure-as-Code (IaC), which incorporates the automated
deployment of a variety of workloads, including VM-based applications and services. You can leverage this
capability to streamline the provisioning of Azure Site Recovery-based disaster recovery scenarios, which simplifies
the initial setup in multiple tenant scenarios.
For example, you can use the same Azure Resource Manager templates to provision all of the network resources
necessary to accommodate VM-based workloads in an Azure Stack Hub stamp for your application in a single,
coordinated operation. You can use the same template to provision a matching set of resources in Azure to
provision a disaster recovery site. To account for any differences between the two environments, you can simply
specify different values of template parameters in each case.

Cost considerations
When considering the cost of the Azure Site Recovery-based disaster recovery solution described in this reference
architecture document, you need to account for both on-premises and cloud-based components. The Azure Stack
Hub pricing model determines the pricing of on-premises components. As with Azure, Azure Stack Hub offers a
pay-as-you-use arrangement, available through enterprise agreements and the Cloud Solution Provider program.
This arrangement includes a monthly price for each Windows Server VM. If you have the option to leverage existing
Windows Server licenses, you can significantly reduce the cost to the base VM pricing. However, with Azure Site
Recovery, you will usually need only a single Azure Stack Hub VM per tenant, which is required to implement the
tenant-specific configuration server.
Azure-related charges are associated with the use of the following resources:
Azure Recovery Services. The pricing is determined by the number of protected instances. It's worth noting
that every protected instance incurs no Azure Site Recovery charges for the first 31 days.
Azure Storage. The pricing reflects a combination of the following factors:
Performance tier
Volume of data stored
Volume of outbound data transfer
Quantity and types of operations performed (for standard performance tier only)
Data redundancy (for standard performance tier only)
Azure ExpressRoute. The pricing is based on one of two models:
Unlimited data. This model includes a monthly fee with all inbound and outbound data transfers included.
Metered data. This model includes a monthly fee with all inbound data transfers free of charge and
outbound data transfers charged per GB.

NOTE
This assessment doesn't include cost of physical connections delivered by third party connectivity providers.
Azure VMs. The pricing of Azure VMs reflects a combination of two components:
Compute cost. The VM size, its uptime, and the licensing model of its operating system determine the
cost.
Managed disk cost. The disk size and performance tier determine the cost.

NOTE
It's worth noting that hydration eliminates the need to run Azure VMs during regular business operations, with
workloads running on Azure Stack Hub, which considerably reduces the compute costs of Azure Site Recovery-based
implementations, especially in comparison to traditional disaster recovery solutions.

NOTE
The prices of resources vary between Azure regions.

NOTE
For details regarding pricing, refer to Azure Pricing.

Alternative solutions
The recommended solution described in this reference architecture document isn't the only way to provide disaster
recovery functionality for Azure Stack Hub VM-based workloads. Customers have other options, including:
A failover to another Azure Stack Hub stamp. Users that need to protect against a datacenter or site outage
might be able to use another Azure Stack Hub deployment to implement disaster recovery provisions. With
primary and secondary locations, users can deploy applications in an active/passive configuration across two
environments. For less critical workloads, it might be acceptable to leverage unused capacity in the
secondary location to perform on-demand restoration of applications from backup. You also have the option
to implement a recovery site in another datacenter, which, in turn, leverages Azure Site Recovery to provision
a replica of the recovery site in Azure. Several factors determine whether the use of Azure Site Recovery with
Azure serving as the failover site is a viable solution. These factors include government regulations,
corporate policies, and latency requirements.

NOTE
As of July 2020, Azure Site Recovery doesn't support this scenario, which means that the implementation would need
to rely on a third party or in-house solution.

Backup and restore. Backing up your applications and datasets enables you to recover quickly from
downtime because of data corruption, accidental deletions, or localized outages. For Azure Stack Hub VM-
based applications, you can use an in-guest agent to protect application data, operating system
configuration, and data stored on volumes. Backing up a VM using a guest OS agent typically includes
capturing operating system configuration, files, folders, volumes, application binaries, and application data.
Recovering an application from an agent requires recreating the VM, followed by installing the operating
system and the guest agent. At that point, you can restore data into the guest OS.
Backup of disk snapshots. It's possible to use snapshots to capture an Azure Stack Hub VM configuration and
the disks attached to a stopped VM. This requires backup products that integrate with Azure Stack Hub APIs
to capture VM configuration and create disk snapshots.
NOTE
As of July 2020, using disk snapshots for VM in a running state isn't supported. Creating a snapshot of a disk
attached to a running VM might degrade the performance or impact the availability of the operating system or
application in the VM.

Backup and restore VMs using an external backup solution in the same datacenter followed by the
replication of backups to another location. This allows you to restore Azure Stack Hub VMs to the same or a
different Azure Stack Hub instance, or to Azure.

Summary
In conclusion, Azure Stack Hub is a unique offering, which differs in many aspects from other virtualization
platforms. As such, it warrants special considerations in regard to business continuity strategy for its workloads. By
leveraging Azure services, you can simplify designing and implementing this strategy. In this architecture reference
document, we explored the use of Microsoft Azure Site Recovery for protecting Azure Stack Hub VM-based
workloads in the connected deployment model. This approach allows customers to benefit from resiliency and
manageability of Azure Stack Hub and from the hyperscale and global presence of the Azure cloud.
It's important to note that the disaster recovery solution described here focused exclusively on VM-based
workloads of Azure Stack Hub. This is only part of an overall business continuity strategy that should account for
other Azure Stack Hub workload types and scenarios that affect their availability.
On-premises data gateway for Azure Logic Apps
11/2/2020 • 3 minutes to read • Edit Online

This reference architecture illustrates a logic app that's running in Microsoft Azure, which is triggered by Azure
Spring Cloud. It then connects to on-premises resources such as Microsoft SQL Server and Microsoft SharePoint
Server.

Download a Visio file of this architecture.


Typical uses for this architecture include:
Cloud-based Azure Logic Apps workflows that require data from on-premises software as part of their run.
Extending the capabilities of existing on-premises software by triggering Logic Apps workflows in the cloud.

Architecture
The architecture consists of the following components:
Azure Spring Cloud . Spring Cloud provides a managed service that's designed and optimized specifically for
Spring microservices that are written in Java.
Azure Logic Apps . Logic apps are automated workflows that are provided as a scalable cloud service for
common enterprise orchestration tasks. Logic apps include connectors for many popular cloud services, on-
premises products, or other software as a service applications. The Logic Apps workflow includes the following
features:
Trigger that fires whenever a new Azure Queue storage message is received.
Action to parse the JavaScript Object Notation (JSON) body of the queue message.
Action to query SQL Server for relevant data.
Action to compose a response from the collected data.
Action to persist the composed response to a list on the SharePoint Server.
On-premises data gateway . An on-premises data gateway is bridge software that connects on-premises data
to cloud services. The gateway typically installs on a dedicated on-premises virtual machine.
SQL Ser ver . This is an installation of SQL Server.
SharePoint Ser ver . This is an installation of SharePoint Server.

Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have specific
requirements that override them.
On-premises data gateway
While it's possible to expose your on-premises servers to the public internet, it's preferable to use an on-premises
data gateway. This gateway creates a secure read/write connection between your on-premises data sources and
Azure. The on-premises data gateway is used with Logic Apps connectors to connect local data sources to Azure.
Before installing an on-premises data gateway, consider the following:
You can install an on-premises data gateway on any on-premises virtual machine. You can choose to install it on
the same virtual machine as your server applications or on a co-located virtual machine with the appropriate
network connectivity.
The architecture of the on-premises data gateway is designed to make outbound connections to Azure Service
Bus.
Review the additional considerations for installing an on-premises data gateway.
An on-premises data gateway can be used with more than just Logic Apps. It's possible to use the same data
gateway installation with:
Power BI
Power Automate
Power Apps
Azure Analysis Services

Scalability considerations
As the amount of Logic Apps flows increase, consider the capacity constraints of the on-premises servers. You'll
need to determine if the on-premises servers can handle the increased workload.

Availability considerations
Avoid single point of failure issues for an on-premises data gateway by installing the software on multiple on-
premises virtual machines.
Consider implementing high availability topologies for your on-premises servers by using techniques such as:
High availability architecture for SharePoint Server
SQL Server AlwaysOn

Manageability considerations
Service Bus is used for outbound data gateway communication. This might require configuring your firewall to
allow outbound connections to Azure.
Consider Azure ExpressRoute if you want consistent throughput from your on-premises data sources to Azure.

DevOps considerations
The corresponding Azure resource for an on-premises data gateway should only be created after the
corresponding software is installed on an on-premises virtual machine.
Consider storing workflow configuration as a JSON template within an Azure Resource Manager template to
automate deployment.

Security considerations
While it's possible to expose your on-premises servers to the public internet, it's preferable to use an on-
premises data gateway. This gateway creates a secure read/write connection between your on-premises data
sources and Azure.

Cost considerations
Use the Azure pricing calculator to estimate costs.
This reference assumes that the consumption plan is used to create a global Logic Apps resource.
This reference only uses standard connectors, which are metered at a lower price than enterprise connectors.
Run containers in a hybrid environment
11/2/2020 • 7 minutes to read • Edit Online

This reference architecture illustrates how developers can create, manage, and monitor deployed containers in the
public cloud, across multiple clouds, and on-premises.

Download a Visio file of this architecture.


Typical uses for this architecture include:
Web applications with internal and external components that deploy both to the public cloud and on-premises
by using shared container images.
Modern deployment testing cycles with quality analysis, testing, development, or staging that's hosted on-
premises and in the public cloud.

Architecture
Microsoft Azure Container Registr y (ACR) . ACR is a service that creates a managed registry. ACR builds,
stores, and manages container images and can store containerized machine learning models.
Azure Kubernetes Ser vice (AKS) . AKS is a managed service that offers a managed Kubernetes cluster with
elastic scale-out functionality.
Azure Container Instances . Azure Container Instances runs containers on-demand in a serverless Microsoft
Azure environment. Azure Container Instances is a low-friction method of running containers that doesn't
require a full Docker host or Kubernetes installation.
Azure Cosmos DB . Azure Cosmos DB is a multiple model database that can serve data elastically at a massive
scale. Azure Cosmos DB was designed for applications that are globally distributed in a multi-write model.
Azure Key Vault . Azure Key Vault is a hardware-backed credential management service that has tight
integration with Microsoft identity services and compute resources.
Azure Policy . Azure Policy enforces standards and assesses compliance for targeted resources deployed to
Azure.
Azure Private Link . Azure Private Link creates a private endpoint in your virtual network that you can use to
communicate with Azure platform as a service (PaaS) without exposing your service to the public internet.
Azure Monitor . Azure Monitor is an all-encompassing suite of monitoring services for applications that
deploy both in Azure and on-premises.
Azure Security Center . Azure Security Center is a unified security management and threat protection system
for workloads across on-premises, multiple clouds, and Azure.
On-premises Kubernetes cluster . In this architecture, a local Kubernetes cluster is used to run multiple
containers on-premises.

Recommendations
Azure Container Registry
ACR is an enterprise container registry that can implement common best practices by protecting images from
unauthorized access, replicating images across multiple geographies, preventing unnecessary ingress/egress, and
optimizing costs. Additionally, ACR supports turnkey geo-replication across multiple Azure regions, which helps
you minimize latency between ACR, your container hosts, and your development team.
ACR includes a suite of tasks, referred to as ACR Tasks, that can manage cloud-based container image building and
maintenance across a variety of operating systems. ACR Tasks can be triggered manually, by a change to source
control, by a change to the base container image, or on a fixed schedule. The following are scenarios in which you
could use ACR Tasks:
An Internet of Things (IoT) developer is building container images to run on ARM-based IoT devices. The
developer might be using a Linux or macOS operating system to develop the software, but they will need to
perform the build on an ARM platform.
A software as a service (SaaS) development team is building software on Windows computers that will run
their container images on Linux hosts. The team would like its build to be performed on a Linux host.
An open source project maintainer is building a container image that augments a well-known operating system
base image. The maintainer will likely want their container image to update every time the base image updates.

NOTE
ACR Tasks can standardize the build environment and perform continuous integration of your container images.

Azure Container Instances


Azure Container Instances is a low-friction, serverless compute environment for containerized applications. Azure
Container Instances is an excellent choice for container deployment because of its low management overhead and
quick startup times. Container images that are stored in ACR can deploy directly to Azure Container Instances
container groups.
In this reference, Azure Container Instances container groups are utilized as virtual nodes for an Azure Kubernetes
Service cluster. AKS uses virtual nodes to register a virtual pod with unlimited capacity and the ability to dispatch
pods by using Azure Container Instances container groups. This is ideal for scenarios where you want very fast
provisioning of individual pods and only want to pay for the execution time per second.

Scalability considerations
Customer-facing containerized web applications benefit from variable scales. You can use services such as
Azure Container Instances and AKS to dynamically scale out to meet anticipated or measured demand.
Additionally, you can use services such as Azure Functions and Azure App Service to run container images at
scale.
Internal application usage is more predictable and can run on an existing Kubernetes cluster. If you're interested
in deploying Azure-managed services on-premises, consider:
Connecting a Kubernetes cluster to Azure Arc.
Deploying Functions or App Service in Azure Stack Hub.
Azure Cosmos DB automatically scales service resources to meet the storage needs of your application in an
elastic manner. For throughput, you can choose to pre-provision throughput or operate Azure Cosmos DB as a
serverless service. If your workload has variable or unpredictable demands, you can also choose to provision
your throughput using autoscale.

Availability considerations
Modern applications typically include a website, one or more HTTP APIs, and some connection to a data store.
Applications within a container image should be designed to be stateless for maximum horizontal scale and
availability. Any data should be stored in a separate service that has similar availability. For guidance on
designing an application that can scale to thousands of nodes, refer to the performance efficiency section of the
Azure Well-Architected Framework.
AKS has a reference architecture baseline that defines each of the Well-Architected Framework categories and
recommends an implementation in line with the category.
To reduce the impact of large pulls of container images, deploy ACR in a region that's closest to the
development team and the production compute services. Consider a geo-replicated ACR deployment for
distributed teams and distributed production containers.
Azure Cosmos DB is a database service that supports turnkey global distribution and supports automatic
failover across multiple regions. Azure Cosmos DB also has the ability to enable multiple region writes and
dynamically add or remove regions.

Manageability considerations
Consider using Azure Resource Manager templates to deploy Azure Container Instance container groups in a
repeatable fashion for multiple region deployments and large-scale orchestration. You can similarly use Azure
Resource Manager templates to deploy Azure Kubernetes Service, Azure Key Vault, and Azure Cosmos DB.
Consider utilizing role-based access control (RBAC) to prevent users from accidentally creating or deleting
container instances without permission.
Use Azure Monitor to monitor metrics and logs for both on-premises and remote containers, analyze the data
using queries, and create alerts for abnormal situations.
Use Azure Policy to implement enforcement of a set of rules for clusters and pods deployed to Kubernetes
Service or an Azure Arc-enabled Kubernetes cluster.

DevOps considerations
Use ACR Tasks to automate the build of container images on a schedule or when changes are made to the
source code. Additionally, consider using ACR Tasks to automatically update container images as base images
are patched and updated.
The AKS team has developed GitHub actions that can assist with implementing GitOps and facilitate
deployments from ACR to AKS clusters.
If your Kubernetes cluster is attached to Azure Arc, you can manage your Kubernetes cluster using GitOps. To
review best practices for connecting a hybrid Kubernetes cluster to Azure Arc, refer to the Azure Arc hybrid
management and deployment for Kubernetes clusters reference architecture.

Security considerations
Use Azure Private Link to communicate to and across services in your virtual network. This will route traffic
through specific subnets to reach the individual Azure services directly and protect your data from inadvertent
exposure to the public internet.

Cost considerations
Use the Azure pricing calculator to estimate costs.
If your development team and production instances are in a single region, consider placing the Container
Registry resource in the same region. This will allow you to minimize container push and pull latency and avoid
the additional costs associated with the Premium Azure Container Registry service tier.
Configuring ACR to use an Azure Virtual Network through an Azure Private Link service endpoint requires the
ACR instance to be deployed in the Premium tier.
AKS offers free cluster management. Billing is isolated to the compute, storage, and networking resources used
by AKS to host nodes. Refer to Azure Virtual Machine or Azure Container Instances pricing to review specific
pricing details for each compute service.
If you require a specific uptime service-level agreement (SLA), you can enable the uptime SLA optional feature
of AKS.
Azure Container Instances resources are billed per second, based on an allocation of virtual CPU and memory
resources, to the container group. Allocating unnecessary compute resources can exponentially increase the
costs required to run this architecture solution. Cost monitoring and optimization is a continuous process that
should be conducted at regular intervals throughout the lifetime of your deployment. For further guidance on
minimizing Azure Container Instances operational costs, refer to the cost optimization section of the Azure
Well-Architected Framework.
Choose a solution for connecting an on-premises
network to Azure
11/13/2020 • 2 minutes to read • Edit Online

This article compares options for connecting an on-premises network to an Azure Virtual Network (VNet). For
each option, a more detailed reference architecture is available.

VPN connection
A VPN gateway is a type of virtual network gateway that sends encrypted traffic between an Azure virtual
network and an on-premises location. The encrypted traffic goes over the public Internet.
This architecture is suitable for hybrid applications where the traffic between on-premises hardware and the
cloud is likely to be light, or you are willing to trade slightly extended latency for the flexibility and processing
power of the cloud.
Benefits
Simple to configure.
Much higher bandwidth available; up to 10 Gbps depending on the VPN Gateway SKU.
Challenges
Requires an on-premises VPN device.
Although Microsoft guarantees 99.9% availability for each VPN Gateway, this SLA only covers the VPN
gateway, and not your network connection to the gateway.
Reference architecture
Hybrid network with VPN gateway

Azure ExpressRoute connection


ExpressRoute connections use a private, dedicated connection through a third-party connectivity provider. The
private connection extends your on-premises network into Azure.
This architecture is suitable for hybrid applications running large-scale, mission-critical workloads that require a
high degree of scalability.
Benefits
Much higher bandwidth available; up to 10 Gbps depending on the connectivity provider.
Supports dynamic scaling of bandwidth to help reduce costs during periods of lower demand. However, not
all connectivity providers have this option.
May allow your organization direct access to national clouds, depending on the connectivity provider.
99.9% availability SLA across the entire connection.
Challenges
Can be complex to set up. Creating an ExpressRoute connection requires working with a third-party
connectivity provider. The provider is responsible for provisioning the network connection.
Requires high-bandwidth routers on-premises.
Reference architecture
Hybrid network with ExpressRoute
ExpressRoute with VPN failover
This options combines the previous two, using ExpressRoute in normal conditions, but failing over to a VPN
connection if there is a loss of connectivity in the ExpressRoute circuit.
This architecture is suitable for hybrid applications that need the higher bandwidth of ExpressRoute, and also
require highly available network connectivity.
Benefits
High availability if the ExpressRoute circuit fails, although the fallback connection is on a lower bandwidth
network.
Challenges
Complex to configure. You need to set up both a VPN connection and an ExpressRoute circuit.
Requires redundant hardware (VPN appliances), and a redundant Azure VPN Gateway connection for which
you pay charges.
Reference architecture
Hybrid network with ExpressRoute and VPN failover

Hub-spoke network topology


A hub-spoke network topology is a way to isolate workloads while sharing services such as identity and security.
The hub is a virtual network (VNet) in Azure that acts as a central point of connectivity to your on-premises
network. The spokes are VNets that peer with the hub. Shared services are deployed in the hub, while individual
workloads are deployed as spokes.
Reference architectures
Hub-spoke topology
Connect an on-premises network to Azure using
ExpressRoute with VPN failover
11/2/2020 • 4 minutes to read • Edit Online

This reference architecture shows how to connect an on-premises network to an Azure virtual network (VNet)
using ExpressRoute, with a site-to-site virtual private network (VPN) as a failover connection. Traffic flows between
the on-premises network and the Azure VNet through an ExpressRoute connection. If there is a loss of
connectivity in the ExpressRoute circuit, traffic is routed through an IPSec VPN tunnel. Deploy this solution .
Note that if the ExpressRoute circuit is unavailable, the VPN route will only handle private peering connections.
Public peering and Microsoft peering connections will pass over the Internet.

Download a Visio file of this architecture.

Architecture
The architecture consists of the following components.
On-premises network . A private local-area network running within an organization.
VPN appliance . A device or service that provides external connectivity to the on-premises network. The
VPN appliance may be a hardware device, or it can be a software solution such as the Routing and Remote
Access Service (RRAS) in Windows Server 2012. For a list of supported VPN appliances and information on
configuring selected VPN appliances for connecting to Azure, see About VPN devices for Site-to-Site VPN
Gateway connections.
ExpressRoute circuit . A layer 2 or layer 3 circuit supplied by the connectivity provider that joins the on-
premises network with Azure through the edge routers. The circuit uses the hardware infrastructure
managed by the connectivity provider.
ExpressRoute vir tual network gateway . The ExpressRoute virtual network gateway enables the VNet to
connect to the ExpressRoute circuit used for connectivity with your on-premises network.
VPN vir tual network gateway . The VPN virtual network gateway enables the VNet to connect to the
VPN appliance in the on-premises network. The VPN virtual network gateway is configured to accept
requests from the on-premises network only through the VPN appliance. For more information, see
Connect an on-premises network to a Microsoft Azure virtual network.
VPN connection . The connection has properties that specify the connection type (IPSec) and the key
shared with the on-premises VPN appliance to encrypt traffic.
Azure Vir tual Network (VNet) . Each VNet resides in a single Azure region, and can host multiple
application tiers. Application tiers can be segmented using subnets in each VNet.
Gateway subnet . The virtual network gateways are held in the same subnet.
Cloud application . The application hosted in Azure. It might include multiple tiers, with multiple subnets
connected through Azure load balancers. For more information about the application infrastructure, see
Running Windows VM workloads and Running Linux VM workloads.

Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have a
specific requirement that overrides them.
VNet and GatewaySubnet
Create the ExpressRoute virtual network gateway connection and the VPN virtual network gateway connection in
the same VNet with a Gateway object already in place. They will both share the same subnet named
GatewaySubnet.
If the VNet already includes a subnet named GatewaySubnet, ensure that it has a /27 or larger address space. If
the existing subnet is too small, use the following PowerShell command to remove the subnet:

$vnet = Get-AzVirtualNetwork -Name <your-vnet-name> -ResourceGroupName <your-resource-group>


Remove-AzVirtualNetworkSubnetConfig -Name GatewaySubnet -VirtualNetwork $vnet

If the VNet does not contain a subnet named GatewaySubnet , create a new one using the following PowerShell
command:

$vnet = Get-AzVirtualNetwork -Name <your-vnet-name> -ResourceGroupName <your-resource-group>


Add-AzVirtualNetworkSubnetConfig -Name "GatewaySubnet" -VirtualNetwork $vnet -AddressPrefix
"10.200.255.224/27"
$vnet = Set-AzVirtualNetwork -VirtualNetwork $vnet

VPN and ExpressRoute gateways


Verify that your organization meets the ExpressRoute prerequisite requirements for connecting to Azure.
If you already have a VPN virtual network gateway in your Azure VNet, use the following PowerShell command to
remove it:

Remove-AzVirtualNetworkGateway -Name <your-gateway-name> -ResourceGroupName <your-resource-group>

Follow the instructions in Implementing a hybrid network architecture with Azure ExpressRoute to establish your
ExpressRoute connection.
Follow the instructions in Implementing a hybrid network architecture with Azure and On-premises VPN to
establish your VPN virtual network gateway connection.
After you have established the virtual network gateway connections, test the environment as follows:
1. Make sure you can connect from your on-premises network to your Azure VNet.
2. Contact your provider to stop ExpressRoute connectivity for testing.
3. Verify that you can still connect from your on-premises network to your Azure VNet using the VPN virtual
network gateway connection.
4. Contact your provider to reestablish ExpressRoute connectivity.
DevOps considerations
For ExpressRoute DevOps considerations, see the Implementing a Hybrid Network Architecture with Azure
ExpressRoute guidance.
For site-to-site VPN DevOps considerations, see the Implementing a Hybrid Network Architecture with Azure and
On-premises VPN guidance.

Security considerations
For general Azure security considerations, see Microsoft cloud services and network security.

Cost considerations
For ExpressRoute cost considerations, see these articles:
Cost considerations in Implementing a Hybrid Network Architecture with Azure ExpressRoute.
Cost considerations in Implementing a Hybrid Network Architecture with Azure and On-premises VPN.

Deploy the solution


Prerequisites . You must have an existing on-premises infrastructure already configured with a suitable network
appliance.
To deploy the solution, perform the following steps.
1. Click the link below.

2. Wait for the link to open in the Azure portal, then follow these steps:
The Resource group name is already defined in the parameter file, so select Create New and enter
ra-hybrid-vpn-er-rg in the text box.
Select the region from the Location drop down box.
Do not edit the Template Root Uri or the Parameter Root Uri text boxes.
Review the terms and conditions, then click the I agree to the terms and conditions stated above
checkbox.
Click the Purchase button.
3. Wait for the deployment to complete.
4. Click the link below.

5. Wait for the link to open in the Azure portal, then enter then follow these steps:
Select Use existing in the Resource group section and enter ra-hybrid-vpn-er-rg in the text box.
Select the region from the Location drop down box.
Do not edit the Template Root Uri or the Parameter Root Uri text boxes.
Review the terms and conditions, then click the I agree to the terms and conditions stated above
checkbox.
Click the Purchase button.
Cross Cloud Scaling Architecture
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Modern software is increasingly connected and distributed. The consistency of Azure Stack with Azure
infrastructure and platform services enable you to scale resources cross cloud to meet increased load as needed,
and decrease resources as demand drops. Optimize cost and maximize resource efficiency while remaining
compliant with cross cloud architecture.

Architecture

Download an SVG of this architecture.

Data Flow
1. A large number of users attempt to access a web app.
2. Traffic manager returns the Azure Stack DNS name.
3. Users access the Azure Stack web app.
4. Once a threshold is reached, a function starts the Azure Web App and enables the Azure Traffic Manager route.
5. Traffic is routed to Azure, which can automatically scale App Service.

Components
Traffic Manager: Route incoming traffic for high performance and availability
Azure Functions: Process events with serverless code
Azure Stack: Build and run innovative hybrid applications across cloud boundaries

Next steps
Traffic Manager documentation
Azure Functions documentation
Azure Stack documentation
Cross-platform Chat
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Accelerate development of reliable, high-performing chat applications.

Architecture

Download an SVG of this architecture.

Data Flow
1. Web chat app connects to SignalR Service and receives token
2. User logs into app with multi-factor authentication; if passed, SignalR endpoint and bearer token returned
3. User connects to the SignalR Service with endpoint and token
Connect an on-premises network to Azure using
ExpressRoute
11/2/2020 • 14 minutes to read • Edit Online

This reference architecture shows how to connect an on-premises network to virtual networks on Azure, using
Azure ExpressRoute. ExpressRoute connections use a private, dedicated connection through a third-party
connectivity provider. The private connection extends your on-premises network into Azure. Deploy this
solution .

Download a Visio file of this architecture.

Architecture
The architecture consists of the following components.
On-premises corporate network . A private local-area network running within an organization.
ExpressRoute circuit . A layer 2 or layer 3 circuit supplied by the connectivity provider that joins the on-
premises network with Azure through the edge routers. The circuit uses the hardware infrastructure
managed by the connectivity provider.
Local edge routers . Routers that connect the on-premises network to the circuit managed by the
provider. Depending on how your connection is provisioned, you may need to provide the public IP
addresses used by the routers.
Microsoft edge routers . Two routers in an active-active highly available configuration. These routers
enable a connectivity provider to connect their circuits directly to their datacenter. Depending on how your
connection is provisioned, you may need to provide the public IP addresses used by the routers.
Azure vir tual networks (VNets) . Each VNet resides in a single Azure region, and can host multiple
application tiers. Application tiers can be segmented using subnets in each VNet.
Azure public ser vices . Azure services that can be used within a hybrid application. These services are
also available over the Internet, but accessing them using an ExpressRoute circuit provides low latency and
more predictable performance, because traffic does not go through the Internet.
Microsoft 365 ser vices . The publicly available Microsoft 365 applications and services provided by
Microsoft. Connections are performed using Microsoft peering, with addresses that are either owned by
your organization or supplied by your connectivity provider. You can also connect directly to Microsoft
CRM Online through Microsoft peering.
Connectivity providers (not shown). Companies that provide a connection either using layer 2 or layer 3
connectivity between your datacenter and an Azure datacenter.

Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have a
specific requirement that overrides them.
Connectivity providers
Select a suitable ExpressRoute connectivity provider for your location. To get a list of connectivity providers
available at your location, use the following Azure PowerShell command:

Get-AzureRmExpressRouteServiceProvider

ExpressRoute connectivity providers connect your datacenter to Microsoft in the following ways:
Co-located at a cloud exchange . If you're co-located in a facility with a cloud exchange, you can order
virtual cross-connections to Azure through the co-location provider's Ethernet exchange. Co-location
providers can offer either layer 2 cross-connections, or managed layer 3 cross-connections between your
infrastructure in the co-location facility and Azure.
Point-to-point Ethernet connections . You can connect your on-premises datacenters/offices to Azure
through point-to-point Ethernet links. Point-to-point Ethernet providers can offer layer 2 connections, or
managed layer 3 connections between your site and Azure.
Any-to-any (IPVPN) networks . You can integrate your wide area network (WAN) with Azure. Internet
protocol virtual private network (IPVPN) providers (typically a multiprotocol label switching VPN) offer any-
to-any connectivity between your branch offices and datacenters. Azure can be interconnected to your WAN to
make it look just like any other branch office. WAN providers typically offer managed layer 3 connectivity.
For more information about connectivity providers, see the ExpressRoute introduction.
ExpressRoute circuit
Ensure that your organization has met the ExpressRoute prerequisite requirements for connecting to Azure.
If you haven't already done so, add a subnet named GatewaySubnet to your Azure VNet and create an
ExpressRoute virtual network gateway using the Azure VPN gateway service. For more information about this
process, see ExpressRoute workflows for circuit provisioning and circuit states.
Create an ExpressRoute circuit as follows:
1. Run the following PowerShell command:

New-AzureRmExpressRouteCircuit -Name <<circuit-name>> -ResourceGroupName <<resource-group>> -Location


<<location>> -SkuTier <<sku-tier>> -SkuFamily <<sku-family>> -ServiceProviderName <<service-provider-
name>> -PeeringLocation <<peering-location>> -BandwidthInMbps <<bandwidth-in-mbps>>

2. Send the ServiceKey for the new circuit to the service provider.
3. Wait for the provider to provision the circuit. To verify the provisioning state of a circuit, run the following
PowerShell command:

Get-AzureRmExpressRouteCircuit -Name <<circuit-name>> -ResourceGroupName <<resource-group>>

The Provisioning state field in the Service Provider section of the output will change from
NotProvisioned to Provisioned when the circuit is ready.
NOTE
If you're using a layer 3 connection, the provider should configure and manage routing for you. You provide the
information necessary to enable the provider to implement the appropriate routes.

4. If you're using a layer 2 connection:


a. Reserve two /30 subnets composed of valid public IP addresses for each type of peering you want
to implement. These /30 subnets will be used to provide IP addresses for the routers used for the
circuit. If you are implementing private and Microsoft peering, you'll need 4 /30 subnets with valid
public IP addresses.
b. Configure routing for the ExpressRoute circuit. Run the following PowerShell commands for each
type of peering you want to configure (private and Microsoft). For more information, see Create and
modify routing for an ExpressRoute circuit.

Set-AzureRmExpressRouteCircuitPeeringConfig -Name <<peering-name>> -Circuit <<circuit-name>> -


PeeringType <<peering-type>> -PeerASN <<peer-asn>> -PrimaryPeerAddressPrefix <<primary-peer-
address-prefix>> -SecondaryPeerAddressPrefix <<secondary-peer-address-prefix>> -VlanId <<vlan-
id>>

Set-AzureRmExpressRouteCircuit -ExpressRouteCircuit <<circuit-name>>

c. Reserve another pool of valid public IP addresses to use for network address translation (NAT) for
Microsoft peering. It is recommended to have a different pool for each peering. Specify the pool to
your connectivity provider, so they can configure border gateway protocol (BGP) advertisements for
those ranges.
5. Run the following PowerShell commands to link your private VNet(s) to the ExpressRoute circuit. For more
information, see Link a virtual network to an ExpressRoute circuit.

$circuit = Get-AzureRmExpressRouteCircuit -Name <<circuit-name>> -ResourceGroupName <<resource-group>>


$gw = Get-AzureRmVirtualNetworkGateway -Name <<gateway-name>> -ResourceGroupName <<resource-group>>
New-AzureRmVirtualNetworkGatewayConnection -Name <<connection-name>> -ResourceGroupName <<resource-
group>> -Location <<location> -VirtualNetworkGateway1 $gw -PeerId $circuit.Id -ConnectionType
ExpressRoute

You can connect multiple VNets located in different regions to the same ExpressRoute circuit, as long as all VNets
and the ExpressRoute circuit are located within the same geopolitical region.
Troubleshooting
If a previously functioning ExpressRoute circuit now fails to connect, in the absence of any configuration changes
on-premises or within your private VNet, you may need to contact the connectivity provider and work with them
to correct the issue. Use the following PowerShell commands to verify that the ExpressRoute circuit has been
provisioned:

Get-AzureRmExpressRouteCircuit -Name <<circuit-name>> -ResourceGroupName <<resource-group>>

The output of this command shows several properties for your circuit, including ProvisioningState ,
CircuitProvisioningState , and ServiceProviderProvisioningState as shown below.
ProvisioningState : Succeeded
Sku : {
"Name": "Standard_MeteredData",
"Tier": "Standard",
"Family": "MeteredData"
}
CircuitProvisioningState : Enabled
ServiceProviderProvisioningState : NotProvisioned

If the ProvisioningState is not set to Succeeded after you tried to create a new circuit, remove the circuit by
using the command below and try to create it again.

Remove-AzureRmExpressRouteCircuit -Name <<circuit-name>> -ResourceGroupName <<resource-group>>

If your provider had already provisioned the circuit, and the ProvisioningState is set to Failed , or the
CircuitProvisioningState is not Enabled , contact your provider for further assistance.

Scalability considerations
ExpressRoute circuits provide a high bandwidth path between networks. Generally, the higher the bandwidth the
greater the cost.
ExpressRoute offers two pricing plans to customers, a metered plan and an unlimited data plan. Charges vary
according to circuit bandwidth. Available bandwidth will likely vary from provider to provider. Use the
Get-AzureRmExpressRouteServiceProvider cmdlet to see the providers available in your region and the bandwidths
that they offer.
A single ExpressRoute circuit can support a certain number of peerings and VNet links. See ExpressRoute limits
for more information.
For an extra charge, the ExpressRoute Premium add-on provides some additional capability:
Increased route limits for private peering.
Increased number of VNet links per ExpressRoute circuit.
Global connectivity for services.
See ExpressRoute pricing for details.
ExpressRoute circuits are designed to allow temporary network bursts up to two times the bandwidth limit that
you procured for no additional cost. This is achieved by using redundant links. However, not all connectivity
providers support this feature. Verify that your connectivity provider enables this feature before depending on it.
Although some providers allow you to change your bandwidth, make sure you pick an initial bandwidth that
surpasses your needs and provides room for growth. If you need to increase bandwidth in the future, you are left
with two options:
Increase the bandwidth. You should avoid this option as much as possible, and not all providers allow you
to increase bandwidth dynamically. But if a bandwidth increase is needed, check with your provider to
verify they support changing ExpressRoute bandwidth properties via PowerShell commands. If they do,
run the commands below.

$ckt = Get-AzureRmExpressRouteCircuit -Name <<circuit-name>> -ResourceGroupName <<resource-group>>


$ckt.ServiceProviderProperties.BandwidthInMbps = <<bandwidth-in-mbps>>
Set-AzureRmExpressRouteCircuit -ExpressRouteCircuit $ckt

You can increase the bandwidth without loss of connectivity. Downgrading the bandwidth will result in
disruption in connectivity, because you must delete the circuit and recreate it with the new configuration.
Change your pricing plan and/or upgrade to Premium. To do so, run the following commands. The
Sku.Tier property can be Standard or Premium ; the Sku.Name property can be MeteredData or
UnlimitedData .

$ckt = Get-AzureRmExpressRouteCircuit -Name <<circuit-name>> -ResourceGroupName <<resource-group>>

$ckt.Sku.Tier = "Premium"
$ckt.Sku.Family = "MeteredData"
$ckt.Sku.Name = "Premium_MeteredData"

Set-AzureRmExpressRouteCircuit -ExpressRouteCircuit $ckt

IMPORTANT
Make sure the Sku.Name property matches the Sku.Tier and Sku.Family . If you change the family and tier,
but not the name, your connection will be disabled.

You can upgrade the SKU without disruption, but you cannot switch from the unlimited pricing plan to
metered. When downgrading the SKU, your bandwidth consumption must remain within the default limit
of the standard SKU.

Availability considerations
ExpressRoute does not support router redundancy protocols such as hot standby routing protocol (HSRP) and
virtual router redundancy protocol (VRRP) to implement high availability. Instead, it uses a redundant pair of BGP
sessions per peering. To facilitate highly-available connections to your network, Azure provisions you with two
redundant ports on two routers (part of the Microsoft edge) in an active-active configuration.
By default, BGP sessions use an idle timeout value of 60 seconds. If a session times out three times (180 seconds
total), the router is marked as unavailable, and all traffic is redirected to the remaining router. This 180-second
timeout might be too long for critical applications. If so, you can change your BGP time-out settings on the on-
premises router to a smaller value. ExpressRoute also supports Bidirectional Forwarding Detection (BFD) over
private peering. By enabling BFD over ExpressRoute, you can expedite link failure detection between Microsoft
Enterprise edge (MSEE) devices and the routers on which you terminate the ExpressRoute circuit (PE). You can
terminate ExpressRoute over Customer Edge routing devices or Partner Edge routing devices (if you went with
managed Layer 3 connection service).
You can configure high availability for your Azure connection in different ways, depending on the type of provider
you use, and the number of ExpressRoute circuits and virtual network gateway connections you're willing to
configure. The following summarizes your availability options:
If you're using a layer 2 connection, deploy redundant routers in your on-premises network in an active-
active configuration. Connect the primary circuit to one router, and the secondary circuit to the other. This
will give you a highly available connection at both ends of the connection. This is necessary if you require
the ExpressRoute service level agreement (SLA). See SLA for Azure ExpressRoute for details.
The following diagram shows a configuration with redundant on-premises routers connected to the
primary and secondary circuits. Each circuit handles the traffic for private peering (each peering is
designated a pair of /30 address spaces, as described in the previous section).
If you're using a layer 3 connection, verify that it provides redundant BGP sessions that handle availability
for you.
Connect the VNet to multiple ExpressRoute circuits, supplied by different service providers. This strategy
provides additional high-availability and disaster recovery capabilities.
Configure a site-to-site VPN as a failover path for ExpressRoute. For more about this option, see Connect
an on-premises network to Azure using ExpressRoute with VPN failover. This option only applies to private
peering. For Azure and Microsoft 365 services, the Internet is the only failover path.

Security considerations
You can configure security options for your Azure connection in different ways, depending on your security
concerns and compliance needs.
ExpressRoute operates in layer 3. Threats in the application layer can be prevented by using a network security
appliance that restricts traffic to legitimate resources.
To maximize security, add network security appliances between the on-premises network and the provider edge
routers. This will help to restrict the inflow of unauthorized traffic from the VNet:

For auditing or compliance purposes, it may be necessary to prohibit direct access from components running in
the VNet to the Internet and implement forced tunneling. In this situation, Internet traffic should be redirected
back through a proxy running on-premises where it can be audited. The proxy can be configured to block
unauthorized traffic flowing out, and filter potentially malicious inbound traffic.
To maximize security, do not enable a public IP address for your VMs, and use NSGs to ensure that these VMs
aren't publicly accessible. VMs should only be available using the internal IP address. These addresses can be
made accessible through the ExpressRoute network, enabling on-premises DevOps staff to perform configuration
or maintenance.
If you must expose management endpoints for VMs to an external network, use NSGs or access control lists to
restrict the visibility of these ports to an allowed list of IP addresses or networks.

NOTE
Azure VMs deployed through the Azure portal can include a public IP address that provides login access. However, it is a
best practice not to permit this.

DevOps considerations
Use the Infrastructure as Code (IaC) process for deploying the infrastructure. In this architecture, we've used a set
of Azure Building Blocks custom templates deployed using the Azure portal. To automate infrastructure
deployment, you can use Azure DevOps Services or other CI/CD solutions. The deployment process is also
idempotent.
For a given resource, there can be other resources that must exist before the resource is deployed. Azure Building
Blocks templates are also good for dependency tracking because they allow you to define dependencies for
resources that are deployed in the same template.
Network monitoring
Use the Network Watcher to monitor and troubleshoot the network components, tools like Traffic Analytics will
show you the systems in your virtual networks that generate most traffic, so that you can visually identify
bottlenecks before they degenerate into problems. Network Performance Manager has the ability to monitor
information about Microsoft ExpressRoute circuits.
You also can use the Azure Connectivity Toolkit (AzureCT) to monitor connectivity between your on-premises
datacenter and Azure.
For more information, see the DevOps section in Microsoft Azure Well-Architected Framework. For information
specific to monitoring, see Monitoring For DevOps.

Cost considerations
Use the Azure pricing calculator to estimate costs. For general considerations, see the Cost section in Microsoft
Azure Well-Architected Framework.
The services used in this architecture are charged as follows:
Azure ExpressRoute
In this architecture, an ExpressRoute circuit is used to join the on-premises network with Azure through the edge
routers.
There are two main plans. In the Metered Data plan, all inbound data transfer is free. All outbound data transfer
is charged based on a pre-determined rate.
You can also opt for the Unlimited Data plan in which all inbound and outbound data transfer is free. Users are
charged a fixed monthly port fee based on high availability dual ports.
Calculate your utilization and choose a billing plan accordingly. The Unlimited Data plan is recommended if you
exceed about 68% of utilization.
For more information, see Azure ExpressRoute pricing.
Azure Virtual Network
All application tiers are hosted in a single virtual network and are segmented using subnets.
Azure Virtual Network is free. Every subscription is allowed to create up to 50 virtual networks across all regions.
All traffic that occurs within the boundaries of a virtual network is free. So, communication between two VMs in
the same virtual network is free.
Virtual machine and internal load balancers
In this architecture, internal load balancers are used to load balance traffic inside a virtual network. Basic load
balancing between virtual machines that reside in the same virtual network is free.
Virtual machine scale sets are available on all Linux and windows VM sizes. You are only charged for the Azure
VMs you deploy and underlying infrastructure resources consumed such as storage and networking. There are no
incremental charges for the virtual machine scale sets service.
For more information, see Azure VM pricing.

Deploy the solution


Prerequisites . You must have an existing on-premises infrastructure already configured with a suitable network
appliance.
To deploy the solution, perform the following steps.
1. Click the link below.

2. Wait for the link to open in the Azure portal, then follow these steps:
The Resource group name is already defined in the parameter file, so select Create New and enter
ra-hybrid-er-rg in the text box.
Select the region from the Location drop down box.
Do not edit the Template Root Uri or the Parameter Root Uri text boxes.
Review the terms and conditions, then click the I agree to the terms and conditions stated above
checkbox.
Click the Purchase button.
3. Wait for the deployment to complete.
4. Click the link below.

5. Wait for the link to open in the Azure portal, then follow these steps:
Select Use existing in the Resource group section and enter ra-hybrid-er-rg in the text box.
Select the region from the Location drop down box.
Do not edit the Template Root Uri or the Parameter Root Uri text boxes.
Review the terms and conditions, then click the I agree to the terms and conditions stated above
checkbox.
Click the Purchase button.
6. Wait for the deployment to complete.
On-premises network connected to Azure using a
VPN gateway
11/2/2020 • 13 minutes to read • Edit Online

This reference architecture shows how to extend a network from on premises or from Azure Stack into an Azure
virtual network, using a site-to-site virtual private network (VPN). Traffic flows between the on-premises network
and Azure through an IPSec VPN tunnel or through the Azure Stack multitenant VPN gateway. Deploy this
solution .

A diagram of the VPN gateway architecture. An on-premises network connects to an Azure virtual network
through a VPN gateway. A virtual network in Azure Stack also connects to the VPN gateway through public VIPs.
Download a Visio file of this architecture.

Architecture
The architecture consists of the following components.
On-premises network . A private local-area network running within an organization.
Azure Stack . A network environment on an Azure Stack tenant subscription, running within an
organization. The Azure Stack VPN gateway sends encrypted traffic across a public connection to virtual IP
(VIP) addresses and includes the following components:
Gateway subnet. A special subnet required to deploy the VPN Gateway on Azure Stack.
Local network gateway. Indicates the target IP of the VPN gateway in Azure, as well as the address
space of the Azure virtual network.
Site-to-site VPN tunnel. The connection type (IPSec) and the key shared with the Azure VPN Gateway to
encrypt traffic.
VPN appliance . A device or service that provides external connectivity to the on-premises network. The
VPN appliance may be a hardware device, or it can be a software solution such as the Routing and Remote
Access Service (RRAS) in Windows Server 2012. For a list of supported VPN appliances and information
on configuring them to connect to an Azure VPN gateway, see the instructions for the selected device in
the article About VPN devices for Site-to-Site VPN Gateway connections.
Vir tual network . The cloud application and the components for the Azure VPN gateway reside in the
same virtual network.
Azure VPN gateway . The VPN gateway service enables you to connect the virtual network to the on-
premises network through a VPN appliance or to connect to Azure Stack through a site-to-site VPN tunnel.
For more information, see Connect an on-premises network to a Microsoft Azure virtual network. The
VPN gateway includes the following elements:
Vir tual network gateway . A resource that provides a virtual VPN appliance for the virtual network. It
is responsible for routing traffic from the on-premises network to the virtual network.
Local network gateway . An abstraction of the on-premises VPN appliance. Network traffic from the
cloud application to the on-premises network is routed through this gateway.
Connection . The connection has properties that specify the connection type (IPSec) and the key
shared with the on-premises VPN appliance to encrypt traffic.
Gateway subnet . The virtual network gateway is held in its own subnet, which is subject to various
requirements, described in the Recommendations section below.
Cloud application . The application hosted in Azure. It might include multiple tiers, with multiple subnets
connected through Azure load balancers. For more information about the application infrastructure, see
Running Windows VM workloads and Running Linux VM workloads.
Internal load balancer . Network traffic from the VPN gateway is routed to the cloud application
through an internal load balancer. The load balancer is located in the front-end subnet of the application.
Bastion . Azure Bastion allows you to log into VMs in the virtual network through SSH or remote desktop
protocol (RDP) without exposing the VMs directly to the internet. If you lose connectivity through the VPN,
you can still use Bastion to manage the VMs in the virtual network.

Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have a
specific requirement that overrides them.
Virtual network and gateway subnet
Create an Azure virtual network with an address space large enough for all of your required resources. Ensure
that the virtual network address space has sufficient room for growth if additional VMs are likely to be needed in
the future. The address space of the virtual network must not overlap with the on-premises network. For
example, the diagram above uses the address space 10.20.0.0/16 for the virtual network.
Create a subnet named GatewaySubnet, with an address range of /27. This subnet is required by the virtual
network gateway. Allocating 32 addresses to this subnet will help to prevent reaching gateway size limitations in
the future. Also, avoid placing this subnet in the middle of the address space. A good practice is to set the address
space for the gateway subnet at the upper end of the virtual network address space. The example shown in the
diagram uses 10.20.255.224/27. Here is a quick procedure to calculate the CIDR:
1. Set the variable bits in the address space of the virtual network to 1, up to the bits being used by the gateway
subnet, then set the remaining bits to 0.
2. Convert the resulting bits to decimal and express it as an address space with the prefix length set to the size of
the gateway subnet.
For example, for a virtual network with an IP address range of 10.20.0.0/16, applying step #1 above becomes
10.20.0b11111111.0b11100000. Converting that to decimal and expressing it as an address space yields
10.20.255.224/27.
WARNING
Do not deploy any VMs to the gateway subnet. Also, do not assign an NSG to this subnet, as it will cause the gateway to
stop functioning.

Virtual network gateway


Allocate a public IP address for the virtual network gateway.
Create the virtual network gateway in the gateway subnet and assign it the newly allocated public IP address. Use
the gateway type that most closely matches your requirements and that is enabled by your VPN appliance:
Create a policy-based gateway if you need to closely control how requests are routed based on policy
criteria such as address prefixes. Policy-based gateways use static routing, and only work with site-to-site
connections.
Create a route-based gateway
You connect to the on-premises network using RRAS,
You support multi-site or cross-region connections, or
You have connections between virtual networks, including routes that traverse multiple virtual
networks.
Route-based gateways use dynamic routing to direct traffic between networks. They can tolerate failures
in the network path better than static routes because they can try alternative routes. Route-based
gateways can also reduce the management overhead because routes might not need to be updated
manually when network addresses change.
For a list of supported VPN appliances, see About VPN devices for Site-to-Site VPN Gateway connections.

NOTE
After the gateway has been created, you cannot change between gateway types without deleting and re-creating the
gateway.

Select the Azure VPN gateway SKU that most closely matches your throughput requirements. For more
information, see Gateway SKUs

NOTE
The Basic SKU is not compatible with Azure ExpressRoute. You can change the SKU after the gateway has been created.

Create routing rules for the gateway subnet that direct incoming application traffic from the gateway to the
internal load balancer, rather than allowing requests to pass directly to the application VMs.
On-premises network connection
Create a local network gateway. Specify the public IP address of the on-premises VPN appliance, and the address
space of the on-premises network. Note that the on-premises VPN appliance must have a public IP address that
can be accessed by the local network gateway in Azure VPN Gateway. The VPN device cannot be located behind a
network address translation (NAT) device.
Create a site-to-site connection for the virtual network gateway and the local network gateway. Select the site-to-
site (IPSec) connection type, and specify the shared key. Site-to-site encryption with the Azure VPN gateway is
based on the IPSec protocol, using preshared keys for authentication. You specify the key when you create the
Azure VPN gateway. You must configure the VPN appliance running on-premises with the same key. Other
authentication mechanisms are not currently supported.
Ensure that the on-premises routing infrastructure is configured to forward requests intended for addresses in
the Azure virtual network to the VPN device.
Open any ports required by the cloud application in the on-premises network.
Test the connection to verify that:
The on-premises VPN appliance correctly routes traffic to the cloud application through the Azure VPN
gateway.
The virtual network correctly routes traffic back to the on-premises network.
Prohibited traffic in both directions is blocked correctly.
Azure Stack network connection
This reference architecture shows how to connect a virtual network in your Azure Stack deployment to a virtual
network in Azure through the Azure Stack multitenant VPN gateway. A common scenario is to isolate critical
operations and sensitive data in Azure Stack and take advantage of Azure for public transaction and transitory,
non-sensitive operations.
In this architecture, network traffic flows through a VPN tunnel using the multitenant gateway on Azure Stack.
Alternatively, traffic can flow over the Internet between Azure Stack and Azure through tenant VIPs, Azure
ExpressRoute, or a network virtual appliance that acts as the VPN endpoint.
Azure Stack virtual network gateway capacity
Both the Azure VPN Gateway and the Azure Stack VPN gateway support Border Gateway Protocol (BGP) for
exchanging routing information between Azure and Azure Stack. Azure Stack does not support static routing for
the multitenant gateway.
Create an Azure Stack virtual network with an assigned IP address space large enough for all your required
resources. The address space of the virtual network must not overlap with any other network that is going to be
connected to this virtual network.
A public IP address is assigned to the multitenant gateway during the deployment of Azure Stack. It is taken from
the public VIP pool. The Azure Stack operator has no control over what IP address is used but can determine its
assignment.
Cau t i on

Workload VMs cannot be deployed on the Azure Stack gateway subnet. Also, do not assign an NSG to this
subnet, as it will cause the gateway to stop functioning.

Scalability considerations
You can achieve limited vertical scalability by moving from the Basic or Standard VPN Gateway SKUs to the High
Performance VPN SKU.
For virtual networks that expect a large volume of VPN traffic, consider distributing the different workloads into
separate smaller virtual networks and configuring a VPN gateway for each of them.
You can partition the virtual network either horizontally or vertically. To partition horizontally, move some VM
instances from each tier into subnets of the new virtual network. The result is that each virtual network has the
same structure and functionality. To partition vertically, redesign each tier to divide the functionality into different
logical areas (such as handling orders, invoicing, customer account management, and so on). Each functional
area can then be placed in its own virtual network.
Replicating an on-premises Active Directory domain controller in the virtual network, and implementing DNS in
the virtual network, can help to reduce some of the security-related and administrative traffic flowing from on-
premises to the cloud. For more information, see Extending Active Directory Domain Services (AD DS) to Azure.
Availability considerations
If you need to ensure that the on-premises network remains available to the Azure VPN gateway, implement a
failover cluster for the on-premises VPN gateway.
If your organization has multiple on-premises sites, create multi-site connections to one or more Azure virtual
networks. This approach requires dynamic (route-based) routing, so make sure that the on-premises VPN
gateway supports this feature.
For details about service level agreements, see SLA for VPN Gateway.
On Azure Stack, you can expand VPN gateways to include interfaces to multiple Azure Stack stamps and Azure
deployments.

DevOps considerations
Use the Infrastructure as Code (IaC) process for deploying the infrastructure. In this architecture, we've used a set
of Azure Building Blocks custom templates deployed using the Azure portal. To automate infrastructure
deployment, you can use Azure DevOps Services or other CI/CD solutions. The deployment process is also
idempotent.
For a given resource, there can be other resources that must exist before the resource is deployed. Azure Building
Blocks templates are also good for dependency tracking because they allow you to define dependencies for
resources that are deployed in the same template.
All the main resources (Virtual machine scale set, VPN gateway, Azure Bastion) are in the same virtual network so
they are isolated in the same basic workload. It's then easier to associate the workload's specific resources to a
team, so that the team can independently manage all aspects of those resources. This isolation enables DevOps
to perform continuous integration and continuous delivery (CI/CD).
Monitoring
Monitor diagnostic information from on-premises VPN appliances. This process depends on the features
provided by the VPN appliance. For example, if you are using the Routing and Remote Access Service on
Windows Server 2012, RRAS logging.
Use Azure VPN gateway diagnostics to capture information about connectivity issues. These logs can be used to
track information such as the source and destinations of connection requests, which protocol was used, and how
the connection was established (or why the attempt failed).
Monitor the operational logs of the Azure VPN gateway using the audit logs available in the Azure portal.
Separate logs are available for the local network gateway, the Azure network gateway, and the connection. This
information can be used to track any changes made to the gateway, and can be useful if a previously functioning
gateway stops working for some reason.
A screenshot of the Azure portal, showing audit log events filtered by date.
Monitor connectivity, and track connectivity failure events. You can use a monitoring package such as Nagios to
capture and report this information.
To troubleshoot the connection, see Troubleshoot a hybrid VPN connection.
If gateway connectivity from your on-premises network to Azure is down, you can still reach the VMs in the
Azure virtual network through Azure Bastion.

Security considerations
Generate a different shared key for each VPN gateway. Use a strong shared key to help resist brute-force attacks.
For Azure Stack connections, generate a different shared key for each VPN tunnel. Use a strong shared key to
help resist brute-force attacks.

NOTE
Currently, you cannot use Azure Key Vault to preshare keys for the Azure VPN gateway.

Ensure that the on-premises VPN appliance uses an encryption method that is compatible with the Azure VPN
gateway. For policy-based routing, the Azure VPN gateway supports the AES256, AES128, and 3DES encryption
algorithms. Route-based gateways support AES256 and 3DES.
If your on-premises VPN appliance is on a perimeter network (DMZ) that has a firewall between the perimeter
network and the Internet, you might have to configure additional firewall rules to allow the site-to-site VPN
connection.
If the application in the virtual network sends data to the Internet, consider implementing forced tunneling to
route all Internet-bound traffic through the on-premises network. This approach enables you to audit outgoing
requests made by the application from the on-premises infrastructure.

NOTE
Forced tunneling can impact connectivity to Azure services (the Storage Service, for example) and the Windows license
manager.

Cost considerations
Use the Azure pricing calculator to estimate costs. For general considerations, see the Cost section in Microsoft
Azure Well-Architected Framework.
The services used in this architecture are charged as follows:
Azure VPN Gateway
The main component of this architecture is the VPN gateway service. You are charged based on the amount of
time that the gateway is provisioned and available.
All inbound traffic is free, all outbound traffic is charged. Internet bandwidth costs are applied to VPN outbound
traffic.
For more information, see VPN Gateway Pricing.
Azure Virtual Network
Azure Virtual Network is free. Every subscription is allowed to create up to 50 virtual networks across all regions.
All traffic that occurs within the boundaries of a virtual network is free. So, communication between two virtual
machines in the same virtual network is free.
Azure Bastion
Azure Bastion securely connects to your virtual machine in the virtual network over RDP and SSH without having
the need to configure a public IP on the virtual machine. You will need Bastion in every virtual network that
contains virtual machines that you want to connect to. This solution is more economical and secure than using
jump boxes.
For examples, see Azure Bastion Pricing.
Virtual machine and internal load balancers
In this architecture, internal load balancers are used to load balance traffic inside a virtual network. Basic load
balancing between virtual machines that reside in the same virtual network is free.
Virtual machine scale sets are available on all Linux and windows VM sizes. You are only charged for the Azure
VMs you deploy and underlying infrastructure resources consumed such as storage and networking. There are
no incremental charges for the virtual machine scale sets service.
For more information, see Azure VM pricing.

Deploy the solution


To deploy this reference architecture, see the GitHub readme.

Next steps
Although VPNs can be used to connect virtual networks within Azure, it's not always the best choice. For more
information, see Choose between virtual network peering and VPN gateways in Azure.
Hybrid Connection
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Azure Stack Hub enables you to deploy Azure services on-premises or in the cloud with a consistent application
logic, development paradigm, and operations methodology.
Hybrid cloud applications are a single system that has components running in both Azure and Azure Stack Hub.
This solution blueprint is relevant to establishing connectivity for any application that involves communications
between the Azure public cloud and on-premises Azure Stack Hub components. Hybrid connectivity is a
foundational blueprint that will be applicable to most Azure Stack Hub solutions.
Note: This doesn't apply to Azure Stack Hub deployments that are disconnected from the public internet.

Architecture

Download an SVG of this architecture.

Data Flow
1. Deploy a virtual network in Azure and Azure Stack Hub.
2. Deploy a virtual network gateway in Azure and Azure Stack Hub.
3. Deploy virtual machines in each virtual network.
4. Establish a VPN connection over the public internet between the network gateways.

Components
Virtual Network: Provision private networks, optionally connect to on-premises datacenters.
Virtual Network Gateway: Learn how to configure VPN Gateway, a virtual private network gateway.
Virtual Machines: Provision Windows and Linux virtual machines in seconds.
Azure Stack Hub is a hybrid cloud platform that lets you use Azure services on-premises.

Next steps
Virtual Network
VPN Gateway Documentation
Virtual Machines Overview
Azure Stack Hub User Documentation
Troubleshoot a hybrid VPN connection
11/2/2020 • 8 minutes to read • Edit Online

This article gives some tips for troubleshooting a VPN gateway connection between an on-premises network and
Azure. For general information on troubleshooting common VPN-related errors, see Troubleshooting common
VPN related errors.

Verify the VPN appliance is functioning correctly


The following recommendations are useful for determining if your on-premises VPN appliance is functioning
correctly.
Check any log files generated by the VPN appliance for errors or failures. This will help you determine if
the VPN appliance is functioning correctly. The location of this information will vary according to your appliance.
For example, if you are using RRAS on Windows Server 2012, you can use the following PowerShell command to
display error event information for the RRAS service:

Get-EventLog -LogName System -EntryType Error -Source RemoteAccess | Format-List -Property *

The Message property of each entry provides a description of the error. Some common examples are:
Inability to connect, possibly due to an incorrect IP address specified for the Azure VPN gateway in the RRAS
VPN network interface configuration.
EventID : 20111
MachineName : on-premises-vm
Data : {41, 3, 0, 0}
Index : 14231
Category : (0)
CategoryNumber : 0
EntryType : Error
Message : RoutingDomainID- {00000000-0000-0000-0000-000000000000}: A demand dial connection
to the remote
interface AzureGateway on port VPN2-4 was successfully initiated but failed to
complete
successfully because of the following error: The network connection between your
computer and
the VPN server could not be established because the remote server is not
responding. This could
be because one of the network devices (for example, firewalls, NAT, routers, and
so on) between your computer
and the remote server is not configured to allow VPN connections. Please contact
your
Administrator or your service provider to determine which device may be causing
the problem.
Source : RemoteAccess
ReplacementStrings : {{00000000-0000-0000-0000-000000000000}, AzureGateway, VPN2-4, The network
connection between
your computer and the VPN server could not be established because the remote
server is not
responding. This could be because one of the network devices (for example,
firewalls, NAT, routers, and so on)
between your computer and the remote server is not configured to allow VPN
connections. Please
contact your Administrator or your service provider to determine which device
may be causing the
problem.}
InstanceId : 20111
TimeGenerated : 3/18/2016 1:26:02 PM
TimeWritten : 3/18/2016 1:26:02 PM
UserName :
Site :
Container :

The wrong shared key being specified in the RRAS VPN network interface configuration.
EventID : 20111
MachineName : on-premises-vm
Data : {233, 53, 0, 0}
Index : 14245
Category : (0)
CategoryNumber : 0
EntryType : Error
Message : RoutingDomainID- {00000000-0000-0000-0000-000000000000}: A demand dial connection
to the remote
interface AzureGateway on port VPN2-4 was successfully initiated but failed to
complete
successfully because of the following error: Internet key exchange (IKE)
authentication credentials are unacceptable.

Source : RemoteAccess
ReplacementStrings : {{00000000-0000-0000-0000-000000000000}, AzureGateway, VPN2-4, IKE authentication
credentials are
unacceptable.
}
InstanceId : 20111
TimeGenerated : 3/18/2016 1:34:22 PM
TimeWritten : 3/18/2016 1:34:22 PM
UserName :
Site :
Container :

You can also obtain event log information about attempts to connect through the RRAS service using the following
PowerShell command:

Get-EventLog -LogName Application -Source RasClient | Format-List -Property *

In the event of a failure to connect, this log will contain errors that look similar to the following:

EventID : 20227
MachineName : on-premises-vm
Data : {}
Index : 4203
Category : (0)
CategoryNumber : 0
EntryType : Error
Message : CoId={B4000371-A67F-452F-AA4C-3125AA9CFC78}: The user SYSTEM dialed a connection named
AzureGateway that has failed. The error code returned on failure is 809.
Source : RasClient
ReplacementStrings : {{B4000371-A67F-452F-AA4C-3125AA9CFC78}, SYSTEM, AzureGateway, 809}
InstanceId : 20227
TimeGenerated : 3/18/2016 1:29:21 PM
TimeWritten : 3/18/2016 1:29:21 PM
UserName :
Site :
Container :

Verify connectivity
Verify connectivity and routing across the VPN gateway. The VPN appliance may not be correctly routing
traffic through the Azure VPN Gateway. Use a tool such as PsPing to verify connectivity and routing across the VPN
gateway. For example, to test connectivity from an on-premises machine to a web server located on the VNet, run
the following command (replacing <<web-server-address>> with the address of the web server):
PsPing -t <<web-server-address>>:80

If the on-premises machine can route traffic to the web server, you should see output similar to the following:

D:\PSTools> psping -t 10.20.0.5:80

PsPing v2.01 - PsPing - ping, latency, bandwidth measurement utility


Copyright (C) 2012-2014 Mark Russinovich
Sysinternals - www.sysinternals.com

TCP connect to 10.20.0.5:80:


Infinite iterations (warmup 1) connecting test:
Connecting to 10.20.0.5:80 (warmup): 6.21ms
Connecting to 10.20.0.5:80: 3.79ms
Connecting to 10.20.0.5:80: 3.44ms
Connecting to 10.20.0.5:80: 4.81ms

Sent = 3, Received = 3, Lost = 0 (0% loss),


Minimum = 3.44ms, Maximum = 4.81ms, Average = 4.01ms

If the on-premises machine cannot communicate with the specified destination, you will see messages like this:

D:\PSTools>psping -t 10.20.1.6:80

PsPing v2.01 - PsPing - ping, latency, bandwidth measurement utility


Copyright (C) 2012-2014 Mark Russinovich
Sysinternals - www.sysinternals.com

TCP connect to 10.20.1.6:80:


Infinite iterations (warmup 1) connecting test:
Connecting to 10.20.1.6:80 (warmup): This operation returned because the timeout period expired.
Connecting to 10.20.1.6:80: This operation returned because the timeout period expired.
Connecting to 10.20.1.6:80: This operation returned because the timeout period expired.
Connecting to 10.20.1.6:80: This operation returned because the timeout period expired.
Connecting to 10.20.1.6:80:
Sent = 3, Received = 0, Lost = 3 (100% loss),
Minimum = 0.00ms, Maximum = 0.00ms, Average = 0.00ms

Verify that the on-premises firewall allows VPN traffic to pass and that the correct por ts are opened.
Verify that the on-premises VPN appliance uses an encr yption method that is compatible with the
Azure VPN gateway. For policy-based routing, the Azure VPN gateway supports the AES256, AES128, and 3DES
encryption algorithms. Route-based gateways support AES256 and 3DES. For more information, see About VPN
devices and IPsec/IKE parameters for Site-to-Site VPN Gateway connections.

Check for problems with the Azure VPN gateway


The following recommendations are useful for determining if there is a problem with the Azure VPN gateway:
Examine Azure VPN gateway diagnostic logs for potential issues. For more information, ee Step-by-Step:
Capturing Azure Resource Manager VNet Gateway Diagnostic Logs.
Verify that the Azure VPN gateway and on-premises VPN appliance are configured with the same
shared authentication key. You can view the shared key stored by the Azure VPN gateway using the following
Azure CLI command:

azure network vpn-connection shared-key show <<resource-group>> <<vpn-connection-name>>


Use the command appropriate for your on-premises VPN appliance to show the shared key configured for that
appliance.
Verify that the GatewaySubnet subnet holding the Azure VPN gateway is not associated with an NSG.
You can view the subnet details using the following Azure CLI command:

azure network vnet subnet show -g <<resource-group>> -e <<vnet-name>> -n GatewaySubnet

Ensure there is no data field named Network Security Group ID. The following example shows the results for an
instance of the GatewaySubnet that has an assigned NSG (VPN-Gateway-Group). This can prevent the gateway
from working correctly if there are any rules defined for this NSG.

C:\>azure network vnet subnet show -g profx-prod-rg -e profx-vnet -n GatewaySubnet


info: Executing command network vnet subnet show
+ Looking up virtual network "profx-vnet"
+ Looking up the subnet "GatewaySubnet"
data: Id : /subscriptions/########-####-####-####-
############/resourceGroups/profx-prod-rg/providers/Microsoft.Network/virtualNetworks/profx-
vnet/subnets/GatewaySubnet
data: Name : GatewaySubnet
data: Provisioning state : Succeeded
data: Address prefix : 10.20.3.0/27
data: Network Security Group id : /subscriptions/########-####-####-####-
############/resourceGroups/profx-prod-rg/providers/Microsoft.Network/networkSecurityGroups/VPN-Gateway-Group
info: network vnet subnet show command OK

Verify that the vir tual machines in the Azure VNet are configured to permit traffic coming in from
outside the VNet. Check any NSG rules associated with subnets containing these virtual machines. You can view
all NSG rules using the following Azure CLI command:

azure network nsg show -g <<resource-group>> -n <<nsg-name>>

Verify that the Azure VPN gateway is connected. You can use the following Azure PowerShell command to
check the current status of the Azure VPN connection. The <<connection-name>> parameter is the name of the
Azure VPN connection that links the virtual network gateway and the local gateway.

Get-AzureRmVirtualNetworkGatewayConnection -Name <<connection-name>> - ResourceGroupName <<resource-group>>

The following snippets highlight the output generated if the gateway is connected (the first example), and
disconnected (the second example):
PS C:\> Get-AzureRmVirtualNetworkGatewayConnection -Name profx-gateway-connection -ResourceGroupName profx-
prod-rg

AuthorizationKey :
VirtualNetworkGateway1 : Microsoft.Azure.Commands.Network.Models.PSVirtualNetworkGateway
VirtualNetworkGateway2 :
LocalNetworkGateway2 : Microsoft.Azure.Commands.Network.Models.PSLocalNetworkGateway
Peer :
ConnectionType : IPsec
RoutingWeight : 0
SharedKey : ####################################
ConnectionStatus : Connected
EgressBytesTransferred : 55254803
IngressBytesTransferred : 32227221
ProvisioningState : Succeeded
...

PS C:\> Get-AzureRmVirtualNetworkGatewayConnection -Name profx-gateway-connection2 -ResourceGroupName profx-


prod-rg

AuthorizationKey :
VirtualNetworkGateway1 : Microsoft.Azure.Commands.Network.Models.PSVirtualNetworkGateway
VirtualNetworkGateway2 :
LocalNetworkGateway2 : Microsoft.Azure.Commands.Network.Models.PSLocalNetworkGateway
Peer :
ConnectionType : IPsec
RoutingWeight : 0
SharedKey : ####################################
ConnectionStatus : NotConnected
EgressBytesTransferred : 0
IngressBytesTransferred : 0
ProvisioningState : Succeeded
...

Miscellaneous issues
The following recommendations are useful for determining if there is an issue with Host VM configuration,
network bandwidth utilization, or application performance:
Verify firewall configuration. Verify that the firewall in the guest operating system running on the Azure VMs in
the subnet is configured correctly to allow permitted traffic from the on-premises IP ranges.
Verify that the volume of traffic is not close to the limit of the bandwidth available to the Azure VPN
gateway. How to verify this depends on the VPN appliance running on-premises. For example, if you are using
RRAS on Windows Server 2012, you can use Performance Monitor to track the volume of data being received and
transmitted over the VPN connection. Using the RAS Total object, select the Bytes Received/Sec and Bytes
Transmitted/Sec counters:
You should compare the results with the bandwidth available to the VPN gateway (from 100 Mbps for the Basic
SKU to 1.25 Gbps for VpnGw3 SKU):

Verify that you have deployed the right number and size of VMs for your application load. Determine
if any of the virtual machines in the Azure VNet are running slowly. If so, they may be overloaded, there may be too
few to handle the load, or the load-balancers may not be configured correctly. To determine this, capture and
analyze diagnostic information. You can examine the results using the Azure portal, but many third-party tools are
also available that can provide detailed insights into the performance data.
Verify that the application is making efficient use of cloud resources. Instrument application code
running on each VM to determine whether applications are making the best use of resources. You can use tools
such as Application Insights.
Windows Virtual Desktop for the enterprise
11/13/2020 • 8 minutes to read • Edit Online

Windows Virtual Desktop is a desktop and application virtualization service that runs in the Azure cloud. This
article helps Desktop Infrastructure Architects, Cloud Architects, Desktop Administrators, or System
Administrators explore Windows Virtual Desktop and build virtualized desktop infrastructure (VDI) solutions at
enterprise scale. Enterprise-scale solutions generally cover 1,000 virtual desktops and above.

Relevant use cases


Most demand for enterprise virtual desktop solutions comes from:
Security and regulation applications like financial services, healthcare, and government.
Elastic workforce needs like remote work, mergers and acquisition, short-term employees, contractors, and
partner access.
Specific employees like bring your own device (BYOD) and mobile users, call centers, and branch workers.
Specialized workloads like design and engineering, legacy apps, and software development test.

Architecture

Download a copy of this Visio.


This diagram shows a typical architectural setup for Windows Virtual Desktop.
The application endpoints are in the customer's on-premises network. ExpressRoute extends the on-premises
network into the Azure cloud, and Azure AD Connect integrates the customer's Active Directory Domain
Services (AD DS) with Azure Active Directory (Azure AD).
The Windows Virtual Desktop control plane handles Web Access, Gateway, Broker, Diagnostics, and
extensibility components like REST APIs.
The customer manages AD DS and Azure AD, Azure subscriptions, virtual networks, Azure Files or Azure
NetApp Files, and the Windows Virtual Desktop host pools and workspaces.
To increase capacity, the customer uses two Azure subscriptions in a hub-spoke architecture, and connects
them via virtual network peering.
For more information about FSLogix Profile Container - Azure Files and Azure NetApp Files best practices, see
FSLogix for the enterprise
Components
Windows Virtual Desktop service architecture is similar to Windows Server Remote Desktop Services. Microsoft
manages the infrastructure and brokering components, while enterprise customers manage their own desktop
host virtual machines (VMs), data, and clients.
Components Microsoft manages
Microsoft manages the following Windows Virtual Desktop services as part of Azure:
Web Access: The Web Access service within Window Virtual Desktop lets users access virtual desktops and
remote apps through an HTML5-compatible web browser as they would with a local PC, from anywhere on
any device. You can secure Web Access using multifactor authentication in Azure Active Directory.
Gateway: The Remote Connection Gateway service connects remote users to Windows Virtual Desktop apps
and desktops from any internet-connected device that can run a Windows Virtual Desktop client. The client
connects to a gateway, which then orchestrates a connection from a VM back to the same gateway.
Connection Broker : The Connection Broker service manages user connections to virtual desktops and
remote apps. The Connection Broker provides load balancing and reconnection to existing sessions.
Diagnostics : Remote Desktop Diagnostics is an event-based aggregator that marks each user or administrator
action on the Windows Virtual Desktop deployment as a success or failure. Administrators can query the event
aggregation to identify failing components.
Extensibility components : Windows Virtual Desktop includes several extensibility components. You can
manage Windows Virtual Desktop using Windows PowerShell or with the provided REST APIs, which also
enable support from third-party tools.
Components you manage
Customers manage these components of Windows Virtual Desktop solutions:
Azure Vir tual Network : Azure Virtual Network lets Azure resources like VMs communicate privately
with each other and with the internet. By connecting Windows Virtual Desktop host pools to an Active
Directory domain, you can define network topology to access virtual desktops and virtual apps from the
intranet or internet, based on organizational policy. You can connect a Windows Virtual Desktop to an on-
premises network using a virtual private network (VPN), or use Azure ExpressRoute to extend the on-
premises network into the Azure cloud over a private connection.
Azure AD: Windows Virtual Desktop uses Azure AD for identity and access management. Azure AD
integration applies Azure AD security features like conditional access, multi-factor authentication, and the
Intelligent Security Graph, and helps maintain app compatibility in domain-joined VMs.
AD DS: Windows Virtual Desktop VMs must domain-join an AD DS service, and the AD DS must be in sync
with Azure AD to associate users between the two services. You can use Azure AD Connect to associate AD
DS with Azure AD.
Windows Vir tual Desktop session hosts: A host pool can run the following operating systems:
Windows 7 Enterprise
Windows 10 Enterprise
Windows 10 Enterprise Multi-session
Windows Server 2012 R2 and above
Custom Windows system images with pre-loaded apps, group policies, or other customizations
You can choose VM sizes, including GPU-enabled VMs. Each session host has a Windows Virtual Desktop
host agent, which registers the VM as part of the Windows Virtual Desktop workspace or tenant. Each host
pool can have one or more app groups, which are collections of remote applications or desktop sessions
that users can access.
Windows Vir tual Desktop workspace: The Windows Virtual Desktop workspace or tenant is a
management construct to manage and publish host pool resources.
Personal and pooled desktops
Personal desktop solutions, sometimes called persistent desktops, allow users to always connect to the same
specific session host. Users can typically modify their desktop experience to meet personal preferences, and save
files in the desktop environment. Personal desktop solutions:
Let users customize their desktop environment, including user-installed applications and saving files within the
desktop environment.
Allow assigning dedicated resources to a specific user, which can be helpful for some manufacturing or
development use cases.
Pooled desktop solutions, also called non-persistent desktops, assign users to whichever session host is currently
available, depending on the load-balancing algorithm. Because the users don't always return to the same session
host each time they connect, they have limited ability to customize the desktop environment and don't usually
have administrator access.
Windows servicing
There are several options for updating Windows Virtual Desktop desktops. Deploying an updated image every
month guarantees compliance and state.
Microsoft Endpoint Configuration Manager (MECM) updates server and desktop operating systems.
Windows Updates for Business updates desktop operating systems like Windows 10 multi-session.
Azure Update Management updates server operating systems.
Azure Log Analytics checks compliance.
Deploy a new (custom) image to session hosts every month for the latest Windows and applications updates.
You can use an image from the Azure Marketplace or a custom Azure managed image.

Considerations
Numbers in the following sections are approximate. The numbers are based on a variety of large customer
deployments, and they might change over time.
Also, note that:
You can't create more than 200 application groups per single Azure AD tenant.
We recommend that you don't publish more than 50 applications per application group.
Azure limitations
The Windows Virtual Desktop service is scalable to more than 10,000 session hosts per workspace. You can
address some Azure platform and Windows Virtual Desktop control plane limitations in the design phase to avoid
changes in the scaling phase.
We recommend to deploy not more than 5,000 VMs per Azure subscription per region, this recommendation
applies to both personal and pooled host pools based on Windows 10 Enterprise single and multi-session.
Most customers use Windows 10 Enterprise multi-session, which allows multiple users to log on to each VM.
You can increase the resources of individual session host VMs to accommodate more user sessions.
For automated session host scaling tools, the limits are around 1,200 VMs per Azure subscription per region,
because VM status interaction consumes more resources.
To manage enterprise environments with more than 5,000 VMs per Azure subscription in the same region, you
can create multiple Azure subscriptions in a hub-spoke architecture and connect them via virtual network
peering, as in the preceding example architecture. You could also deploy VMs in a different region in the same
subscription to increase the number of VMs.
Azure Resource Manager (ARM) subscription API throttling limits don't allow more than 600 Azure VM reboots
per hour via the Azure portal. You can reboot all your machines at once via the operating system, which doesn't
consume any Azure Resource Manager subscription API calls. For more information about counting and
troubleshooting throttling limits based on your Azure subscription, see Troubleshoot API throttling errors.
You can currently deploy 399 VMs per Windows Virtual Desktop ARM template deployment without
Availability Sets, or 200 VMs per Availability Set. You can increase the number of VMs per deployment by
switching off Availability Sets in either the ARM template or the Azure portal host pool enrollment.
Azure VM session host name prefixes can't exceed 11 characters, due to auto-assigning of instance names and
the NetBIOS limit of 15 characters per computer account.
By default, you can deploy up to 800 instances of most resource types in a resource group. Azure Compute
doesn't have this limit.
For more information about Azure subscription limitations, see Azure subscription and service limits, quotas, and
constraints.
VM sizing
Virtual machine sizing guidelines lists the maximum suggested number of users per virtual central processing
unit (vCPU) and minimum VM configurations for different workloads. This data helps estimate the VMs you need
in your host pool.
Use simulation tools to test deployments with both stress tests and real-life usage simulations. Make sure the
system is responsive and resilient enough to meet user needs, and remember to vary the load sizes when testing.
Pricing
Architect your Windows Virtual Desktop solution to realize cost savings. Here are five different options to help
manage costs for enterprises:
Windows 10 multi-session : By delivering a multi-session desktop experience for users that have identical
compute requirements, you can let more users log onto a single VM at once, resulting in considerable cost
savings.
Azure Hybrid Benefit : If you have Software Assurance, you can use Azure Hybrid Benefit for Windows Server
to save on the cost of your Azure infrastructure.
Azure Reser ved Instances: You can prepay for your VM usage and save money. Combine Azure Reserved
Instances with Azure Hybrid Benefit for up to 80 percent savings over list prices.
Session host load-balancing : When setting up session hosts, Breadth-first is the standard default mode,
which spreads users randomly across session hosts. Depth-first mode fills up a session host server with the
maximum number of users before it moves on to the next session host. You can adjust this setting for
maximum cost benefits.

Next steps
FSLogix for the enterprise - best practices documentation
Use the new ARM templates to automate the deployment of your Windows Virtual Desktop environment.
These ARM templates support only Azure Resource Manager Windows Virtual Desktop objects. These ARM
templates don't support Windows Virtual Desktop (classic).
For multiple AD forests architecture, read Multiple AD Forests Architecture in Windows Virtual Desktop.
Windows Virtual Desktop partner integrations lists approved Windows Virtual Desktop partner providers and
independent software vendors.
Use the resources at Windows_10_VDI_Optimize to help optimize performance in a Windows 10 Enterprise
VDI environment.
Multiple AD forests architecture with Windows Virtual
Desktop
11/2/2020 • 5 minutes to read • Edit Online

Many organizations desire to leverage Windows Virtual Desktop (WVD) and create environments with multiple
on-premises Active Directory forests. This article expands on the architecture described in the WVD at enterprise
scale article and helps understand how multiple domains and WVD can be integrated using Azure AD Connect to
sync users from on-premises Active Directory Domain Services (AD DS) to Azure Active Directory (Azure AD).
The following are some relevant use cases for this architecture:
Mergers and acquisitions, organization rebranding, and multiple on-premises identities.
Complex on-premises active directory environments (multi-forest, multi-domains, group policy (or GPO)
requirements, and legacy authentication).
Use of on-premises GPO infrastructure with Azure WVD.

NOTE
Active Directory Domain Services (AD DS) is a self-managed, on-premises component in many hybrid environments,
whereas Azure Active Directory Domain Services (Azure AD DS) provides managed domain services with a subset of fully-
compatible traditional AD DS features such as domain join, group policy, LDAP, and Kerberos/NTLM authentication. Read a
detailed comparison of these components in Compare self-managed Active Directory Domain Services, Azure Active
Directory, and managed Azure Active Directory Domain Services.
The solution idea Multiple WVD forests using Azure Active Directory Domain Services discusses this architecture using the
cloud-managed Azure AD DS.

Architecture
Download a Visio file of this architecture

Scenario
This architecture diagram shows a typical scenario that involves the following:
Azure AD tenant is available for the new company named as NewCompanyAB.onmicrosoft.com .
Azure AD Connect syncs users from on-premises AD DS to Azure Active Directory (Azure AD).
Each of the company A and company B has a separate Azure subscription. They also have a shared services
subscription referred to as the Subscription 1 in the above diagram.
An Azure hub-spoke architecture is implemented with a shared services hub virtual network (VNet).
Complex hybrid on-premises Active Directory environments are present with two or more AD forests. Domains
live in separate forests, each with a different UPN suffix. For example, companyA.local with the UPN suffix
companyA.com, companyB.local with the UPN suffix CompanyB.com, and an additional UPN suffix
newcompanyAB.com.
Domain controllers for both forests are located on-premises and in Azure.
Verified domains are present in Azure for CompanyA.com, CompanyB.com, and NewCompanyAB.com.
Group Policy (GPO) and legacy authentication such as Kerberos, NTLM, and LDAP are used.
Azure environments that still have dependency on-premises infrastructure, private connectivity (Site-to-site
VPN or Azure ExpressRoute) is set up between on-premises and Azure.
The WVD environment consists of a WVD workspace for each business unit, and two host pools per workspace.
The WVD session hosts are joined to domain controllers in Azure, that is, companyA session hosts join the
companyA.local domain, and CompanyB session hosts join the CompanyB.local domain.
Azure Storage accounts can leverage Azure Files for FSLogix profiles. One account is created per company
domain (that is, companyA.local and companyB.local), and joined to the corresponding domain.

Components
This architecture uses the same components as listed in WVD at enterprise scale Architecture.
Additionally, the following components are also used in this architecture:
Azure AD connect in staging mode: Staging server for Azure AD Connect topologies provides
additional redundancy for the Azure AD connect instance.
Azure subscriptions, WVD workspaces, and host pools: Multiple subscriptions, WVD workspaces,
and host pools can be leveraged for administration boundaries and business requirements.

Data flow
In this architecture, the identity flow works as follows.
1. Azure AD Connect syncs users from both CompanyA.com and CompanyB.com to Azure AD tenant
(NewCompanyAB.onmicrosoft.com).
2. Host pools, workspaces, and app groups are created in the respective subscriptions and spoke virtual networks.
3. Users are assigned to the app groups.
4. WVD session hosts in the host pools join the domains CompanyA.com and CompanyB.com using the domain
controllers in Azure.
5. Users sign in using either the WVD Desktop or a web client with the corresponding format:
[email protected], [email protected], or [email protected], depending on the UPN
suffix configured.
6. Users are presented with their respective virtual desktops or apps. For example, users in CompanyA will be
presented with virtual desktops or apps in Workspace A, host pool 1 or 2.
7. FSLogix user profiles are created in Azure Files shares on the corresponding storage accounts.
8. Group Policy Objects (GPO) synced from on-premises are applied to users and WVD session hosts.

Considerations
Keep in mind the following considerations while designing your workload based on this architecture.
Group Policy Objects (GPO )
To extend GPO infrastructure for WVD, the on-premises domain controllers should sync to the Azure IaaS
domain controllers.
Extending the GPO infrastructure to Azure IaaS domain controllers requires private connectivity.
Network and connectivity
The domain controllers are shared components, so they need to be deployed in a shared services hub VNet in
this hub-spoke architecture.
WVD session hosts join the domain controller in Azure over their respective hub-spoke vNet peering.
Azure Storage
The following design considerations apply to user profile containers, cloud cache containers, and MSIX packages:
Both Azure Files and NetApp files can be used in this scenario. Choose the right solution based on factors such
as expected performance, cost, and so on.
Both Azure Storage accounts and NetApp files present the same limitation of being able to join to one single AD
DS at a time. In these cases, multiple Azure Storage accounts or NetApp instances will be required.
Azure Active Directory
In scenarios with users in multiple on-premises Active Directory forests, only one Azure AD Connect sync server is
connected to the Azure AD tenant. An exception to this is an AD Connect used in staging mode.

The following identity topologies are supported:


Multiple on-premises Active Directory forests.
One or more resource forests trust all account forests.
A full mesh topology allows users and resources to be in any forest. Commonly, there are two-way trusts
between the forests.
For more details, read the Staging server section of Azure AD Connect topologies.

Next steps
For more information, see these articles:
Azure AD Connect topology.
Windows Virtual Desktop for the enterprise article
Compare different Identity options: Self-managed Active Directory Domain Services (AD DS), Azure Active
Directory (Azure AD), and Azure Active Directory Domain Services (Azure AD DS).
Solution idea Multi forest with Azure AD DS.
Windows Virtual Desktop Documentation.
Multiple AD forests with Windows Virtual Desktop
and Azure AD DS
11/13/2020 • 2 minutes to read • Edit Online

This solution idea shows how you can deploy WVD rapidly in a minimum viable product (MVP) or a proof of
concept (PoC) environment with the use of Azure Active Directory Domain Services (Azure AD DS). Use this idea to
extend on-premises multi-forest AD DS identities to Azure without private connectivity and also support legacy
authentication.
This solution idea also applies to mergers and acquisitions, organization rebranding, and multiple on-premises
identities requirements.

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Architecture

Download a Visio file of this architecture

Data flow
The following steps show how the data flows in this architecture in the form of identity.
1. Complex hybrid on-premises Active Directory environments are present, with two or more AD forests.
Domains live in separate forests, with distinct UPN suffixes. For example, companyA.local with UPN suffix
companyA.com, companyB.local with UPN suffix CompanyB.com, and an additional UPN suffix
newcompanyAB.com.
2. Instead of using customer-managed domain controllers whether on-premises or on Azure (that is, Azure IaaS
domain controllers), the two cloud-managed domain controllers provided by Azure AD DS are used.
3. Azure AD Connect syncs users from both CompanyA.com and CompanyB.com to the Azure AD tenant
(NewCompanyAB.onmicrosoft.com). The user account is represented only once in Azure AD and private
connectivity is not used.
4. Users then sync from Azure AD to the managed Azure AD DS as a one-way sync.
5. A custom and routable Azure AD DS domain name is created (aadds.newcompanyAB.com). The
newcompanyAB.com is a registered domain to support LDAP certificates. It is generally recommended not to
use non-routable domain names (such as, contoso.local) as it can cause issues with DNS resolution.
6. The WVD session hosts join the Azure AD DS domain controllers.
7. Host pools and app groups can be created in a separate subscription and spoke virtual network.
8. Users are assigned to the app groups.
9. Users sign in via WVD Desktop or web client with a format such as, [email protected],
[email protected], or [email protected], depending on the UPN suffix configured.
10. Users are presented with their respective virtual desktops or apps. For example, [email protected] will be
presented with virtual desktops or apps in host pool A, jane@companyB will be presented with virtual
desktops or apps in host pool B, and joe@newcompanyAB will be presented with virtual desktops or apps in
host pool AB.
11. The storage account (Azure Files used for FSLogix) is joined to the managed domain AD DS. The FSLogix user
profiles are created in Azure Files shares.

NOTE
1. For Group Policy requirements in Azure AD DS, you can install Group Policy Management tools on a Windows Server
virtual machine that is joined to Azure AD DS.
2. To extend GPO infrastructure for WVD from the on-premises domain controllers, manual export and import to Azure AD
DS is required.

Next steps
For more information, see these articles:
Multiple AD forests architecture with Windows Virtual Desktop.
Windows Virtual Desktop for enterprises
Azure AD Connect Topology.
Compare different Identity options: Self-managed Active Directory Domain Services (AD DS), Azure Active
Directory (Azure AD), and Azure Active Directory Domain Services (Azure AD DS).
Windows Virtual Desktop Documentation.
Manage identity in multitenant applications
11/2/2020 • 3 minutes to read • Edit Online

This series of articles describes best practices for multitenancy, when using Azure AD for authentication and
identity management.

Sample code
When you're building a multitenant application, one of the first challenges is managing user identities, because
now every user belongs to a tenant. For example:
Users sign in with their organizational credentials.
Users should have access to their organization's data, but not data that belongs to other tenants.
An organization can sign up for the application, and then assign application roles to its members.
Azure Active Directory (Azure AD) has some great features that support all of these scenarios.
To accompany this series of articles, we created a complete end-to-end implementation of a multitenant
application. The articles reflect what we learned in the process of building the application. To get started with the
application, see the GitHub readme.

Introduction
Let's say you're writing an enterprise SaaS application to be hosted in the cloud. Of course, the application will
have users:

But those users belong to organizations:

Example: Tailspin sells subscriptions to its SaaS application. Contoso and Fabrikam sign up for the app. When Alice
( alice@contoso ) signs in, the application should know that Alice is part of Contoso.
Alice should have access to Contoso data.
Alice should not have access to Fabrikam data.
This guidance will show you how to manage user identities in a multitenant application, using Azure Active
Directory (Azure AD) to handle sign-in and authentication.

What is multitenancy?
A tenant is a group of users. In a SaaS application, the tenant is a subscriber or customer of the application.
Multitenancy is an architecture where multiple tenants share the same physical instance of the app. Although
tenants share physical resources (such as VMs or storage), each tenant gets its own logical instance of the app.
Typically, application data is shared among the users within a tenant, but not with other tenants.

Compare this architecture with a single-tenant architecture, where each tenant has a dedicated physical instance.
In a single-tenant architecture, you add tenants by spinning up new instances of the app.

Multitenancy and horizontal scaling


To achieve scale in the cloud, it's common to add more physical instances. This is known as horizontal scaling or
scaling out. Consider a web app. To handle more traffic, you can add more server VMs and put them behind a load
balancer. Each VM runs a separate physical instance of the web app.

Any request can be routed to any instance. Together, the system functions as a single logical instance. You can tear
down a VM or spin up a new VM, without affecting users. In this architecture, each physical instance is multitenant,
and you scale by adding more instances. If one instance goes down, it should not affect any tenant.

Identity in a multitenant app


In a multitenant app, you must consider users in the context of tenants.
Authentication
Users sign into the app with their organization credentials. They don't have to create new user profiles for the
app.
Users within the same organization are part of the same tenant.
When a user signs in, the application knows which tenant the user belongs to.
Authorization
When authorizing a user's actions (say, viewing a resource), the app must take into account the user's tenant.
Users might be assigned roles within the application, such as "Admin" or "Standard User". Role assignments
should be managed by the customer, not by the SaaS provider.
Example. Alice, an employee at Contoso, navigates to the application in her browser and clicks the "Log in" button.
She is redirected to a sign-in screen where she enters her corporate credentials (username and password). At this
point, she is logged into the app as [email protected] . The application also knows that Alice is an admin user for
this application. Because she is an admin, she can see a list of all the resources that belong to Contoso. However,
she cannot view Fabrikam's resources, because she is an admin only within her tenant.
In this guidance, we'll look specifically at using Azure AD for identity management.
We assume the customer stores their user profiles in Azure AD (including Office365 and Dynamics CRM
tenants)
Customers with on-premises Active Directory can use Azure AD Connect to sync their on-premises Active
Directory with Azure AD. If a customer with on-premises Active Directory cannot use Azure AD Connect (due to
corporate IT policy or other reasons), the SaaS provider can federate with the customer's directory through
Active Directory Federation Services (AD FS). This option is described in Federating with a customer's AD FS.
This guidance does not consider other aspects of multitenancy such as data partitioning, per-tenant configuration,
and so forth.
Next
The Tailspin scenario
11/2/2020 • 2 minutes to read • Edit Online

Sample code
Tailspin is a fictional company that is developing a SaaS application named Surveys. This application enables
organizations to create and publish online surveys.
An organization can sign up for the application.
After the organization is signed up, users can sign into the application with their organizational credentials.
Users can create, edit, and publish surveys.

NOTE
To get started with the application, see the GitHub readme.

Users can create, edit, and view surveys


An authenticated user can view all the surveys that he or she has created or has contributor rights to, and create
new surveys. Notice that the user is signed in with his organizational identity, [email protected] .

This screenshot shows the Edit Survey page:


Users can also view any surveys created by other users within the same tenant.

Survey owners can invite contributors


When a user creates a survey, he or she can invite other people to be contributors on the survey. Contributors can
edit the survey, but cannot delete or publish it.
A user can add contributors from other tenants, which enables cross-tenant sharing of resources. In this
screenshot, Bob ( [email protected] ) is adding Alice ( [email protected] ) as a contributor to a survey that Bob
created.
When Alice logs in, she sees the survey listed under "Surveys I can contribute to".

Note that Alice signs into her own tenant, not as a guest of the Contoso tenant. Alice has contributor permissions
only for that survey — she cannot view other surveys from the Contoso tenant.

Architecture
The Surveys application consists of a web front end and a web API backend. Both are implemented using ASP.NET
Core.
The web application uses Azure Active Directory (Azure AD) to authenticate users. The web application also calls
Azure AD to get OAuth 2 access tokens for the Web API. Access tokens are cached in Azure Cache for Redis. The
cache enables multiple instances to share the same token cache (for example, in a server farm).
The diagram shows components in boxes, interacting with other components via two-way arrows. The Surveys
web application authenticates with Azure AD to get access tokens for the web API, and caches the tokens in the
Azure Cache for Redis access token cache.
Next
Authenticate using Azure AD and OpenID Connect
11/2/2020 • 7 minutes to read • Edit Online

Sample code
The Surveys application uses the OpenID Connect (OIDC) protocol to authenticate users with Azure Active
Directory (Azure AD). The Surveys application uses ASP.NET Core, which has built-in middleware for OIDC. The
following diagram shows what happens when the user signs in, at a high level.

1. The user clicks the "sign in" button in the app. This action is handled by an MVC controller.
2. The MVC controller returns a ChallengeResult action.
3. The middleware intercepts the ChallengeResult and creates a 302 response, which redirects the user to the
Azure AD sign-in page.
4. The user authenticates with Azure AD.
5. Azure AD sends an ID token to the application.
6. The middleware validates the ID token. At this point, the user is now authenticated inside the application.
7. The middleware redirects the user back to application.

Register the app with Azure AD


To enable OpenID Connect, the SaaS provider registers the application inside their own Azure AD tenant.
To register the application, follow the steps in Quickstart: Register an application with the Microsoft identity
platform.
To enable this functionality in the sample Surveys application, see the GitHub readme. Note the following:
For a multitenant application, you must configure the multitenanted option explicitly. This enables other
organizations to access the application.
The reply URL is the URL where Azure AD will send OAuth 2.0 responses. When using the ASP.NET Core, this
needs to match the path that you configure in the authentication middleware (see next section).

Configure the auth middleware


This section describes how to configure the authentication middleware in ASP.NET Core for multitenant
authentication with OpenID Connect.
In your startup class, add the OpenID Connect middleware:

app.AddAuthentication().AddOpenIdConnect(options => {
options.ClientId = configOptions.AzureAd.ClientId;
options.ClientSecret = configOptions.AzureAd.ClientSecret; // for code flow
options.Authority = Constants.AuthEndpointPrefix;
options.ResponseType = OpenIdConnectResponseType.CodeIdToken;
options.PostLogoutRedirectUri = configOptions.AzureAd.PostLogoutRedirectUri;
options.SignInScheme = CookieAuthenticationDefaults.AuthenticationScheme;
options.TokenValidationParameters = new TokenValidationParameters { ValidateIssuer = false };
options.Events = new SurveyAuthenticationEvents(configOptions.AzureAd, loggerFactory);
});

Notice that some of the settings are taken from runtime configuration options. Here's what the middleware options
mean:
ClientId . The application's client ID, which you got when you registered the application in Azure AD.
Authority . For a multitenant application, set this to https://login.microsoftonline.com/common/ . This is the URL
for the Azure AD common endpoint, which enables users from any Azure AD tenant to sign in. For more
information about the common endpoint, see this blog post.
In TokenValidationParameters , set ValidateIssuer to false. That means the app will be responsible for
validating the issuer value in the ID token. (The middleware still validates the token itself.) For more information
about validating the issuer, see Issuer validation.
PostLogoutRedirectUri . Specify a URL to redirect users after the sign out. This should be a page that allows
anonymous requests — typically the home page.
SignInScheme . Set this to CookieAuthenticationDefaults.AuthenticationScheme . This setting means that after
the user is authenticated, the user claims are stored locally in a cookie. This cookie is how the user stays logged
in during the browser session.
Events. Event callbacks; see Authentication events.
Also add the Cookie Authentication middleware to the pipeline. This middleware is responsible for writing the user
claims to a cookie, and then reading the cookie during subsequent page loads.

app.AddAuthentication().AddCookie(options => {
options.AutomaticAuthenticate = true;
options.AutomaticChallenge = true;
options.AccessDeniedPath = "/Home/Forbidden";
options.CookieSecure = CookieSecurePolicy.Always;

// The default setting for cookie expiration is 14 days. SlidingExpiration is set to true by default
options.ExpireTimeSpan = TimeSpan.FromHours(1);
options.SlidingExpiration = true;
});

Initiate the authentication flow


To start the authentication flow in ASP.NET MVC, return a ChallengeResult from the controller:
[AllowAnonymous]
public IActionResult SignIn()
{
return new ChallengeResult(
OpenIdConnectDefaults.AuthenticationScheme,
new AuthenticationProperties
{
IsPersistent = true,
RedirectUri = Url.Action("SignInCallback", "Account")
});
}

This causes the middleware to return a 302 (Found) response that redirects to the authentication endpoint.

User login sessions


As mentioned, when the user first signs in, the Cookie Authentication middleware writes the user claims to a
cookie. After that, HTTP requests are authenticated by reading the cookie.
By default, the cookie middleware writes a session cookie, which gets deleted once the user closes the browser. The
next time the user next visits the site, they will have to sign in again. However, if you set IsPersistent to true in the
ChallengeResult , the middleware writes a persistent cookie, so the user stays logged in after closing the browser.
You can configure the cookie expiration; see Controlling cookie options. Persistent cookies are more convenient for
the user, but may be inappropriate for some applications (say, a banking application) where you want the user to
sign in every time.

About the OpenID Connect middleware


The OpenID Connect middleware in ASP.NET hides most of the protocol details. This section contains some notes
about the implementation, that may be useful for understanding the protocol flow.
First, let's examine the authentication flow in terms of ASP.NET (ignoring the details of the OIDC protocol flow
between the app and Azure AD). The following diagram shows the process.

In this diagram, there are two MVC controllers. The Account controller handles sign-in requests, and the Home
controller serves up the home page.
Here is the authentication process:
1. The user clicks the "Sign in" button, and the browser sends a GET request. For example: GET /Account/SignIn/ .
2. The account controller returns a ChallengeResult .
3. The OIDC middleware returns an HTTP 302 response, redirecting to Azure AD.
4. The browser sends the authentication request to Azure AD
5. The user signs in to Azure AD, and Azure AD sends back an authentication response.
6. The OIDC middleware creates a claims principal and passes it to the Cookie Authentication middleware.
7. The cookie middleware serializes the claims principal and sets a cookie.
8. The OIDC middleware redirects to the application's callback URL.
9. The browser follows the redirect, sending the cookie in the request.
10. The cookie middleware deserializes the cookie to a claims principal and sets HttpContext.User equal to the
claims principal. The request is routed to an MVC controller.
Authentication ticket
If authentication succeeds, the OIDC middleware creates an authentication ticket, which contains a claims principal
that holds the user's claims.

NOTE
Until the entire authentication flow is completed, HttpContext.User still holds an anonymous principal, not the
authenticated user. The anonymous principal has an empty claims collection. After authentication completes and the app
redirects, the cookie middleware deserializes the authentication cookie and sets HttpContext.User to a claims principal that
represents the authenticated user.

Authentication events
During the authentication process, the OpenID Connect middleware raises a series of events:
RedirectToIdentityProvider . Called right before the middleware redirects to the authentication endpoint. You
can use this event to modify the redirect URL; for example, to add request parameters. See Adding the admin
consent prompt for an example.
AuthorizationCodeReceived . Called with the authorization code.
TokenResponseReceived . Called after the middleware gets an access token from the IDP, but before it is
validated. Applies only to authorization code flow.
TokenValidated . Called after the middleware validates the ID token. At this point, the application has a set of
validated claims about the user. You can use this event to perform additional validation on the claims, or to
transform claims. See Working with claims.
UserInformationReceived . Called if the middleware gets the user profile from the user info endpoint. Applies
only to authorization code flow, and only when GetClaimsFromUserInfoEndpoint = true in the middleware
options.
TicketReceived . Called when authentication is completed. This is the last event, assuming that authentication
succeeds. After this event is handled, the user is signed into the app.
AuthenticationFailed . Called if authentication fails. Use this event to handle authentication failures — for
example, by redirecting to an error page.
To provide callbacks for these events, set the Events option on the middleware. There are two different ways to
declare the event handlers: Inline with lambdas, or in a class that derives from OpenIdConnectEvents . The
second approach is recommended if your event callbacks have any substantial logic, so they don't clutter your
startup class. Our reference implementation uses this approach.
OpenID Connect endpoints
Azure AD supports OpenID Connect Discovery, wherein the identity provider (IDP) returns a JSON metadata
document from a well-known endpoint. The metadata document contains information such as:
The URL of the authorization endpoint. This is where the app redirects to authenticate the user.
The URL of the "end session" endpoint, where the app goes to log out the user.
The URL to get the signing keys, which the client uses to validate the OIDC tokens that it gets from the IDP.
By default, the OIDC middleware knows how to fetch this metadata. Set the Authority option in the middleware,
and the middleware constructs the URL for the metadata. (You can override the metadata URL by setting the
MetadataAddress option.)
OpenID Connect flows
By default, the OIDC middleware uses hybrid flow with form post response mode.
Hybrid flow means the client can get an ID token and an authorization code in the same round-trip to the
authorization server.
Form post response mode means the authorization server uses an HTTP POST request to send the ID token and
authorization code to the app. The values are form-urlencoded (content type = "application/x-www-form-
urlencoded").
When the OIDC middleware redirects to the authorization endpoint, the redirect URL includes all of the query
string parameters needed by OIDC. For hybrid flow:
client_id. This value is set in the ClientId option.
scope = "openid profile", which means it's an OIDC request and we want the user's profile.
response_type = "code id_token". This specifies hybrid flow.
response_mode = "form_post". This specifies form post response.
To specify a different flow, set the ResponseType property on the options. For example:

app.AddAuthentication().AddOpenIdConnect(options =>
{
options.ResponseType = "code"; // Authorization code flow

// Other options
}

Next
Work with claims-based identities
11/2/2020 • 5 minutes to read • Edit Online

Sample code

Claims in Azure AD
When a user signs in, Azure AD sends an ID token that contains a set of claims about the user. A claim is simply a
piece of information, expressed as a key/value pair. For example, email = [email protected] . Claims have an issuer
— in this case, Azure AD — which is the entity that authenticates the user and creates the claims. You trust the
claims because you trust the issuer. (Conversely, if you don't trust the issuer, don't trust the claims!)
At a high level:
1. The user authenticates.
2. The Identity Provider (IDP) sends a set of claims.
3. The app normalizes or augments the claims (optional).
4. The app uses the claims to make authorization decisions.
In OpenID Connect, the set of claims that you get is controlled by the scope parameter of the authentication
request. However, Azure AD issues a limited set of claims through OpenID Connect; see Supported Token and
Claim Types. If you want more information about the user, you'll need to use the Azure AD Graph API.
Here are some of the claims from Azure AD that an app might typically care about:

C L A IM T Y P E IN ID TO K EN DESC RIP T IO N

aud Who the token was issued for. This will be the application's
client ID. Generally, you shouldn't need to worry about this
claim, because the middleware automatically validates it.
Example: "91464657-d17a-4327-91f3-2ed99386406f"

groups A list of Azure AD groups of which the user is a member.


Example:
["93e8f556-8661-4955-87b6-890bc043c30f", "fc781505-
18ef-4a31-a7d5-7d931d7b857e"]

iss The issuer of the OIDC token. Example:


https://sts.windows.net/b9bd2162-77ac-4fb2-8254-
5c36e9c0a9c4/

name The user's display name. Example: "Alice A."

oid The object identifier for the user in Azure AD. This value is the
immutable and non-reusable identifier of the user. Use this
value, not email, as a unique identifier for users; email
addresses can change. If you use the Azure AD Graph API in
your app, object ID is that value used to query profile
information. Example:
"59f9d2dc-995a-4ddf-915e-b3bb314a7fa4"

roles A list of app roles for the user. Example: ["SurveyCreator"]


C L A IM T Y P E IN ID TO K EN DESC RIP T IO N

tid Tenant ID. This value is a unique identifier for the tenant in
Azure AD. Example:
"b9bd2162-77ac-4fb2-8254-5c36e9c0a9c4"

unique_name A human readable display name of the user. Example:


"[email protected]"

upn User principal name. Example: "[email protected]"

This table lists the claim types as they appear in the ID token. In ASP.NET Core, the OpenID Connect middleware
converts some of the claim types when it populates the Claims collection for the user principal:
oid > http://schemas.microsoft.com/identity/claims/objectidentifier
tid > http://schemas.microsoft.com/identity/claims/tenantid
unique_name > http://schemas.xmlsoap.org/ws/2005/05/identity/claims/name
upn > http://schemas.xmlsoap.org/ws/2005/05/identity/claims/upn

Claims transformations
During the authentication flow, you might want to modify the claims that you get from the IDP. In ASP.NET Core,
you can perform claims transformation inside of the AuthenticationValidated event from the OpenID Connect
middleware. (See Authentication events.)
Any claims that you add during AuthenticationValidated are stored in the session authentication cookie. They
don't get pushed back to Azure AD.
Here are some examples of claims transformation:
Claims normalization , or making claims consistent across users. This is particularly relevant if you are
getting claims from multiple IDPs, which might use different claim types for similar information. For
example, Azure AD sends a "upn" claim that contains the user's email. Other IDPs might send an "email"
claim. The following code converts the "upn" claim into an "email" claim:

var email = principal.FindFirst(ClaimTypes.Upn)?.Value;


if (!string.IsNullOrWhiteSpace(email))
{
identity.AddClaim(new Claim(ClaimTypes.Email, email));
}

Add default claim values for claims that aren't present — for example, assigning a user to a default role.
In some cases this can simplify authorization logic.
Add custom claim types with application-specific information about the user. For example, you might
store some information about the user in a database. You could add a custom claim with this information to
the authentication ticket. The claim is stored in a cookie, so you only need to get it from the database once
per login session. On the other hand, you also want to avoid creating excessively large cookies, so you need
to consider the trade-off between cookie size versus database lookups.
After the authentication flow is complete, the claims are available in HttpContext.User . At that point, you should
treat them as a read-only collection—for example, using them to make authorization decisions.

Issuer validation
In OpenID Connect, the issuer claim ("iss") identifies the IDP that issued the ID token. Part of the OIDC
authentication flow is to verify that the issuer claim matches the actual issuer. The OIDC middleware handles this
for you.
In Azure AD, the issuer value is unique per AD tenant ( https://sts.windows.net/<tenantID> ). Therefore, an
application should do an additional check, to make sure the issuer represents a tenant that is allowed to sign in to
the app.
For a single-tenant application, you can just check that the issuer is your own tenant. In fact, the OIDC middleware
does this automatically by default. In a multitenant app, you need to allow for multiple issuers, corresponding to
the different tenants. Here is a general approach to use:
In the OIDC middleware options, set ValidateIssuer to false. This turns off the automatic check.
When a tenant signs up, store the tenant and the issuer in your user DB.
Whenever a user signs in, look up the issuer in the database. If the issuer isn't found, it means that tenant hasn't
signed up. You can redirect them to a sign up page.
You could also block certain tenants; for example, for customers that didn't pay their subscription.
For a more detailed discussion, see Sign-up and tenant onboarding in a multitenant application.

Using claims for authorization


With claims, a user's identity is no longer a monolithic entity. For example, a user might have an email address,
phone number, birthday, gender, etc. Maybe the user's IDP stores all of this information. But when you authenticate
the user, you'll typically get a subset of these as claims. In this model, the user's identity is simply a bundle of
claims. When you make authorization decisions about a user, you will look for particular sets of claims. In other
words, the question "Can user X perform action Y" ultimately becomes "Does user X have claim Z".
Here are some basic patterns for checking claims.
To check that the user has a particular claim with a particular value:

if (User.HasClaim(ClaimTypes.Role, "Admin")) { ... }

This code checks whether the user has a Role claim with the value "Admin". It correctly handles the case
where the user has no Role claim or multiple Role claims.
The ClaimTypes class defines constants for commonly used claim types. However, you can use any string
value for the claim type.
To get a single value for a claim type, when you expect there to be at most one value:

string email = User.FindFirst(ClaimTypes.Email)?.Value;

To get all the values for a claim type:

IEnumerable<Claim> groups = User.FindAll("groups");

For more information, see Role-based and resource-based authorization in multitenant applications.
Next
Tenant sign-up and onboarding
11/2/2020 • 6 minutes to read • Edit Online

Sample code
This article describes how to implement a sign-up process in a multitenant application, which allows a customer to
sign up their organization for your application.
There are several reasons to implement a sign-up process:
Allow an AD admin to consent for the customer's entire organization to use the application.
Collect credit card payment or other customer information.
Perform any one-time per-tenant setup needed by your application.

Admin consent and Azure AD permissions


In order to authenticate with Azure AD, an application needs access to the user's directory. At a minimum, the
application needs permission to read the user's profile. The first time that a user signs in, Azure AD shows a
consent page that lists the permissions being requested. By clicking Accept , the user grants permission to the
application.
By default, consent is granted on a per-user basis. Every user who signs in sees the consent page. However, Azure
AD also supports admin consent, which allows an AD administrator to consent for an entire organization.
When the admin consent flow is used, the consent page states that the AD admin is granting permission on behalf
of the entire tenant:

After the admin clicks Accept , other users within the same tenant can sign in, and Azure AD will skip the consent
screen.
Only an AD administrator can give admin consent, because it grants permission on behalf of the entire
organization. If a non-administrator tries to authenticate with the admin consent flow, Azure AD displays an error:
If the application requires additional permissions at a later point, the customer will need to sign up again and
consent to the updated permissions.

Implementing tenant sign-up


For the Tailspin Surveys application, we defined several requirements for the sign-up process:
A tenant must sign up before users can sign in.
Sign-up uses the admin consent flow.
Sign-up adds the user's tenant to the application database.
After a tenant signs up, the application shows an onboarding page.
In this section, we'll walk through our implementation of the sign-up process. It's important to understand that
"sign up" versus "sign in" is an application concept. During the authentication flow, Azure AD does not inherently
know whether the user is in process of signing up. It's up to the application to keep track of the context.
When an anonymous user visits the Surveys application, the user is shown two buttons, one to sign in, and one to
"enroll your company" (sign up).

These buttons invoke actions in the AccountController class.


The SignIn action returns a ChallengeResult , which causes the OpenID Connect middleware to redirect to the
authentication endpoint. This is the default way to trigger authentication in ASP.NET Core.
[AllowAnonymous]
public IActionResult SignIn()
{
return new ChallengeResult(
OpenIdConnectDefaults.AuthenticationScheme,
new AuthenticationProperties
{
IsPersistent = true,
RedirectUri = Url.Action("SignInCallback", "Account")
});
}

Now compare the SignUp action:

[AllowAnonymous]
public IActionResult SignUp()
{
var state = new Dictionary<string, string> { { "signup", "true" }};
return new ChallengeResult(
OpenIdConnectDefaults.AuthenticationScheme,
new AuthenticationProperties(state)
{
RedirectUri = Url.Action(nameof(SignUpCallback), "Account")
});
}

Like SignIn , the SignUp action also returns a ChallengeResult . But this time, we add a piece of state information
to the AuthenticationProperties in the ChallengeResult :
signup: A Boolean flag, indicating that the user has started the sign-up process.
The state information in AuthenticationProperties gets added to the OpenID Connect state parameter, which
round trips during the authentication flow.

After the user authenticates in Azure AD and gets redirected back to the application, the authentication ticket
contains the state. We are using this fact to make sure the "signup" value persists across the entire authentication
flow.

Adding the admin consent prompt


In Azure AD, the admin consent flow is triggered by adding a "prompt" parameter to the query string in the
authentication request:
/authorize?prompt=admin_consent&...

The Surveys application adds the prompt during the RedirectToAuthenticationEndpoint event. This event is called
right before the middleware redirects to the authentication endpoint.

public override Task RedirectToAuthenticationEndpoint(RedirectContext context)


{
if (context.IsSigningUp())
{
context.ProtocolMessage.Prompt = "admin_consent";
}

_logger.RedirectToIdentityProvider();
return Task.FromResult(0);
}

Setting ProtocolMessage.Prompt tells the middleware to add the "prompt" parameter to the authentication request.
Note that the prompt is only needed during sign-up. Regular sign-in should not include it. To distinguish between
them, we check for the signup value in the authentication state. The following extension method checks for this
condition:

internal static bool IsSigningUp(this BaseControlContext context)


{
Guard.ArgumentNotNull(context, nameof(context));

string signupValue;
// Check the HTTP context and convert to string
if ((context.Ticket == null) ||
(!context.Ticket.Properties.Items.TryGetValue("signup", out signupValue)))
{
return false;
}

// We have found the value, so see if it's valid


bool isSigningUp;
if (!bool.TryParse(signupValue, out isSigningUp))
{
// The value for signup is not a valid boolean, throw

throw new InvalidOperationException($"'{signupValue}' is an invalid boolean value");


}

return isSigningUp;
}

Registering a tenant
The Surveys application stores some information about each tenant and user in the application database.
In the Tenant table, IssuerValue is the value of the issuer claim for the tenant. For Azure AD, this is
https://sts.windows.net/<tentantID> and gives a unique value per tenant.

When a new tenant signs up, the Surveys application writes a tenant record to the database. This happens inside
the AuthenticationValidated event. (Don't do it before this event, because the ID token won't be validated yet, so
you can't trust the claim values. See Authentication.
Here is the relevant code from the Surveys application:

public override async Task TokenValidated(TokenValidatedContext context)


{
var principal = context.AuthenticationTicket.Principal;
var userId = principal.GetObjectIdentifierValue();
var tenantManager = context.HttpContext.RequestServices.GetService<TenantManager>();
var userManager = context.HttpContext.RequestServices.GetService<UserManager>();
var issuerValue = principal.GetIssuerValue();
_logger.AuthenticationValidated(userId, issuerValue);

// Normalize the claims first.


NormalizeClaims(principal);
var tenant = await tenantManager.FindByIssuerValueAsync(issuerValue)
.ConfigureAwait(false);

if (context.IsSigningUp())
{
if (tenant == null)
{
tenant = await SignUpTenantAsync(context, tenantManager)
.ConfigureAwait(false);
}

// In this case, we need to go ahead and set up the user signing us up.
await CreateOrUpdateUserAsync(context.Ticket, userManager, tenant)
.ConfigureAwait(false);
}
else
{
if (tenant == null)
{
_logger.UnregisteredUserSignInAttempted(userId, issuerValue);
throw new SecurityTokenValidationException($"Tenant {issuerValue} is not registered");
}

await CreateOrUpdateUserAsync(context.Ticket, userManager, tenant)


.ConfigureAwait(false);
}
}

This code does the following:


1. Check if the tenant's issuer value is already in the database. If the tenant has not signed up,
FindByIssuerValueAsync returns null.
2. If the user is signing up:
a. Add the tenant to the database ( SignUpTenantAsync ).
b. Add the authenticated user to the database ( CreateOrUpdateUserAsync ).
3. Otherwise complete the normal sign-in flow:
a. If the tenant's issuer was not found in the database, it means the tenant is not registered, and the
customer needs to sign up. In that case, throw an exception to cause the authentication to fail.
b. Otherwise, create a database record for this user, if there isn't one already ( CreateOrUpdateUserAsync ).

Here is the SignUpTenantAsync method that adds the tenant to the database.
private async Task<Tenant> SignUpTenantAsync(BaseControlContext context, TenantManager tenantManager)
{
Guard.ArgumentNotNull(context, nameof(context));
Guard.ArgumentNotNull(tenantManager, nameof(tenantManager));

var principal = context.Ticket.Principal;


var issuerValue = principal.GetIssuerValue();
var tenant = new Tenant
{
IssuerValue = issuerValue,
Created = DateTimeOffset.UtcNow
};

try
{
await tenantManager.CreateAsync(tenant)
.ConfigureAwait(false);
}
catch(Exception ex)
{
_logger.SignUpTenantFailed(principal.GetObjectIdentifierValue(), issuerValue, ex);
throw;
}

return tenant;
}

Here is a summary of the entire sign-up flow in the Surveys application:


1. The user clicks the Sign Up button.
2. The AccountController.SignUp action returns a challenge result. The authentication state includes "signup"
value.
3. In the RedirectToAuthenticationEndpoint event, add the admin_consent prompt.
4. The OpenID Connect middleware redirects to Azure AD and the user authenticates.
5. In the AuthenticationValidated event, look for the "signup" state.
6. Add the tenant to the database.
Next
Application roles
11/2/2020 • 5 minutes to read • Edit Online

Sample code
Application roles are used to assign permissions to users. For example, the Tailspin Surveys application defines the
following roles:
Administrator. Can perform all CRUD operations on any survey that belongs to that tenant.
Creator. Can create new surveys.
Reader. Can read any surveys that belong to that tenant.
You can see that roles ultimately get translated into permissions, during authorization. But the first question is how
to assign and manage roles. We identified three main options:
Azure AD App Roles
Azure AD security groups
Application role manager.

Roles using Azure AD App Roles


This is the approach that we used in the Tailspin Surveys app.
In this approach, The SaaS provider defines the application roles by adding them to the application manifest. After
a customer signs up, an admin for the customer's AD directory assigns users to the roles. When a user signs in, the
user's assigned roles are sent as claims.

NOTE
If the customer has Azure AD Premium, the admin can assign a security group to a role, and user members of the group will
inherit the app role. This is a convenient way to manage roles, because the group owner doesn't need to be an admin or app
owner.

Advantages of this approach:


Simple programming model.
Roles are specific to the application. The role claims for one application are not sent to another application.
If the customer removes the application from their Azure AD tenant, the roles go away.
The application doesn't need any extra Azure AD permissions, other than reading the user's profile.
Drawbacks:
Customers without Azure AD Premium cannot assign app roles to security groups. For these customers, all app
role assignments to users must be done by individually, by an administrator or an owner of the app.
If you have a backend web API which is separate from the web app, the app role assignments for the web app
don't apply to the web API. For more discussion of this point, see Securing a backend web API.
Implementation
Define the roles. The SaaS provider declares the app roles in the application manifest. For example, here is the
manifest entry for the Surveys app:
"appRoles": [
{
"allowedMemberTypes": [
"User"
],
"description": "Creators can create Surveys",
"displayName": "SurveyCreator",
"id": "1b4f816e-5eaf-48b9-8613-7923830595ad",
"isEnabled": true,
"value": "SurveyCreator"
},
{
"allowedMemberTypes": [
"User"
],
"description": "Administrators can manage the Surveys in their tenant",
"displayName": "SurveyAdmin",
"id": "c20e145e-5459-4a6c-a074-b942bbd4cfe1",
"isEnabled": true,
"value": "SurveyAdmin"
}
],

The value property appears in the role claim. The id property is the unique identifier for the defined role. Always
generate a new GUID value for id .
Assign users . When a new customer signs up, the application is registered in the customer's Azure AD tenant. At
this point, an Azure AD admin for that tenant or an app owner (under Enterprise apps) can assign app roles to
users.

NOTE
As noted earlier, customers with Azure AD Premium can also assign app roles to security groups.

The following screenshot from the Azure portal shows users and groups for the Survey application. Admin and
Creator are groups, assigned the SurveyAdmin and SurveyCreator app roles, respectively. Alice is a user who was
assigned the SurveyAdmin app role directly. Bob and Charles are users that have not been directly assigned an app
role.

As shown in the following screenshot, Charles is part of the Admin group, so he inherits the SurveyAdmin role. In
the case of Bob, he has not been assigned an app role yet.
NOTE
An alternative approach is for the application to assign app roles programmatically, using the Azure AD Graph API. However,
this requires the application to obtain write permissions for the customer's Azure AD directory, which is a high privilege that
is usually unnecessary.

Get role claims . When a user signs in, the application receives the user's assigned role(s) in a claim with type
http://schemas.microsoft.com/ws/2008/06/identity/claims/role (the roles claim in a JWT token).

A user can be assigned multiple roles, or no role. In your authorization code, don't assume the user has exactly one
role claim. Instead, write code that checks whether a particular claim value is present:

if (context.User.HasClaim(ClaimTypes.Role, "Admin")) { ... }

Roles using Azure AD security groups


In this approach, roles are represented as Azure AD security groups. The application assigns permissions to users
based on their security group memberships.
Advantages:
For customers who do not have Azure AD Premium, this approach enables the customer to use security groups
to manage role assignments.
Disadvantages:
Complexity. Because every tenant sends different group claims, the app must keep track of which security
groups correspond to which application roles, for each tenant.
As users belong to more groups, access tokens grow to include more claims. After a certain limit, Azure AD
includes an "overage" claim to limit the token size; see Microsoft identity platform access tokens. Application
roles avoid this issue because they are scoped to the specific application.
Implementation
In the application manifest, set the groupMembershipClaims property to "SecurityGroup". This is needed to get group
membership claims from Azure AD.

{
// ...
"groupMembershipClaims": "SecurityGroup",
}

When a new customer signs up, the application instructs the customer to create security groups for the roles
needed by the application. The customer then needs to enter the group object IDs into the application. The
application stores these in a table that maps group IDs to application roles, per tenant.

NOTE
Alternatively, the application could create the groups programmatically, using the Microsoft Graph API. This could be less
error prone, but requires the application to obtain privileged read/write permissions for the customer's directory. Many
customers might be unwilling to grant this level of access.

When a user signs in:


1. The application receives the user's groups as claims. The value of each claim is the object ID of a group.
2. Azure AD limits the number of groups sent in the token. If the number of groups exceeds this limit, Azure AD
sends a special "overage" claim. If that claim is present, the application must query the Azure AD Graph API to
get all of the groups to which that user belongs. For details, see [Authorization in Cloud Applications using AD
Groups], under the section titled "Groups claim overage".
3. The application looks up the object IDs in its own database, to find the corresponding application roles to assign
to the user.
4. The application adds a custom claim value to the user principal that expresses the application role. For example:
survey_role = "SurveyAdmin".

Authorization policies should use the custom role claim, not the group claim.

Roles using an application role manager


With this approach, application roles are not stored in Azure AD at all. Instead, the application stores the role
assignments for each user in its own DB — for example, using the RoleManager class in ASP.NET Identity.
Advantages:
The app has full control over the roles and user assignments.
Drawbacks:
More complex, harder to maintain.
Cannot use Azure AD security groups to manage role assignments.
Stores user information in the application database, where it can get out of sync with the tenant's Azure AD
directory, as users are added or removed.
Next
Role-based and resource-based authorization
11/2/2020 • 5 minutes to read • Edit Online

Sample code
Our reference implementation is an ASP.NET Core application. In this article we'll look at two general approaches
to authorization, using the authorization APIs provided in ASP.NET Core.
Role-based authorization . Authorizing an action based on the roles assigned to a user. For example, some
actions require an administrator role.
Resource-based authorization . Authorizing an action based on a particular resource. For example, every
resource has an owner. The owner can delete the resource; other users cannot.
A typical app will employ a mix of both. For example, to delete a resource, the user must be the resource owner or
an admin.

Role-based authorization
The Tailspin Surveys application defines the following roles:
Administrator. Can perform all CRUD operations on any survey that belongs to that tenant.
Creator. Can create new surveys
Reader. Can read any surveys that belong to that tenant
Roles apply to users of the application. In the Surveys application, a user is either an administrator, creator, or
reader.
For a discussion of how to define and manage roles, see Application roles.
Regardless of how you manage the roles, your authorization code will look similar. ASP.NET Core has an
abstraction called authorization policies. With this feature, you define authorization policies in code, and then apply
those policies to controller actions. The policy is decoupled from the controller.
Create policies
To define a policy, first create a class that implements IAuthorizationRequirement . It's easiest to derive from
AuthorizationHandler . In the Handle method, examine the relevant claim(s).

Here is an example from the Tailspin Surveys application:

public class SurveyCreatorRequirement : AuthorizationHandler<SurveyCreatorRequirement>,


IAuthorizationRequirement
{
protected override Task HandleRequirementAsync(AuthorizationHandlerContext context,
SurveyCreatorRequirement requirement)
{
if (context.User.HasClaim(ClaimTypes.Role, Roles.SurveyAdmin) ||
context.User.HasClaim(ClaimTypes.Role, Roles.SurveyCreator))
{
context.Succeed(requirement);
}
return Task.FromResult(0);
}
}
This class defines the requirement for a user to create a new survey. The user must be in the SurveyAdmin or
SurveyCreator role.
In your startup class, define a named policy that includes one or more requirements. If there are multiple
requirements, the user must meet every requirement to be authorized. The following code defines two policies:

services.AddAuthorization(options =>
{
options.AddPolicy(PolicyNames.RequireSurveyCreator,
policy =>
{
policy.AddRequirements(new SurveyCreatorRequirement());
policy.RequireAuthenticatedUser(); // Adds DenyAnonymousAuthorizationRequirement
// By adding the CookieAuthenticationDefaults.AuthenticationScheme, if an authenticated
// user is not in the appropriate role, they will be redirected to a "forbidden" page.
policy.AddAuthenticationSchemes(CookieAuthenticationDefaults.AuthenticationScheme);
});

options.AddPolicy(PolicyNames.RequireSurveyAdmin,
policy =>
{
policy.AddRequirements(new SurveyAdminRequirement());
policy.RequireAuthenticatedUser();
policy.AddAuthenticationSchemes(CookieAuthenticationDefaults.AuthenticationScheme);
});
});

This code also sets the authentication scheme, which tells ASP.NET which authentication middleware should run if
authorization fails. In this case, we specify the cookie authentication middleware, because the cookie authentication
middleware can redirect the user to a "Forbidden" page. The location of the Forbidden page is set in the
AccessDeniedPath option for the cookie middleware; see Configuring the authentication middleware.

Authorize controller actions


Finally, to authorize an action in an MVC controller, set the policy in the Authorize attribute:

[Authorize(Policy = PolicyNames.RequireSurveyCreator)]
public IActionResult Create()
{
var survey = new SurveyDTO();
return View(survey);
}

In earlier versions of ASP.NET, you would set the Roles property on the attribute:

// old way
[Authorize(Roles = "SurveyCreator")]

This is still supported in ASP.NET Core, but it has some drawbacks compared with authorization policies:
It assumes a particular claim type. Policies can check for any claim type. Roles are just a type of claim.
The role name is hard-coded into the attribute. With policies, the authorization logic is all in one place, making
it easier to update or even load from configuration settings.
Policies enable more complex authorization decisions (for example, age >= 21) that can't be expressed by
simple role membership.

Resource-based authorization
Resource based authorization occurs whenever the authorization depends on a specific resource that will be
affected by an operation. In the Tailspin Surveys application, every survey has an owner and zero-to-many
contributors.
The owner can read, update, delete, publish, and unpublish the survey.
The owner can assign contributors to the survey.
Contributors can read and update the survey.
Note that "owner" and "contributor" are not application roles; they are stored per survey, in the application
database. To check whether a user can delete a survey, for example, the app checks whether the user is the owner
for that survey.
In ASP.NET Core, implement resource-based authorization by deriving from AuthorizationHandler and
overriding the Handle method.

public class SurveyAuthorizationHandler : AuthorizationHandler<OperationAuthorizationRequirement, Survey>


{
protected override void HandleRequirementAsync(AuthorizationHandlerContext context,
OperationAuthorizationRequirement operation, Survey resource)
{
}
}

Notice that this class is strongly typed for Survey objects. Register the class for DI on startup:

services.AddSingleton<IAuthorizationHandler>(factory =>
{
return new SurveyAuthorizationHandler();
});

To perform authorization checks, use the IAuthorizationSer vice interface, which you can inject into your
controllers. The following code checks whether a user can read a survey:

if (await _authorizationService.AuthorizeAsync(User, survey, Operations.Read) == false)


{
return StatusCode(403);
}

Because we pass in a Survey object, this call will invoke the SurveyAuthorizationHandler .
In your authorization code, a good approach is to aggregate all of the user's role-based and resource-based
permissions, then check the aggregate set against the desired operation. Here is an example from the Surveys app.
The application defines several permission types:
Admin
Contributor
Creator
Owner
Reader
The application also defines a set of possible operations on surveys:
Create
Read
Update
Delete
Publish
Unpublish
The following code creates a list of permissions for a particular user and survey. Notice that this code looks at both
the user's app roles, and the owner/contributor fields in the survey.

public class SurveyAuthorizationHandler : AuthorizationHandler<OperationAuthorizationRequirement, Survey>


{
protected override Task HandleRequirementAsync(AuthorizationHandlerContext context,
OperationAuthorizationRequirement requirement, Survey resource)
{
var permissions = new List<UserPermissionType>();
int surveyTenantId = context.User.GetSurveyTenantIdValue();
int userId = context.User.GetSurveyUserIdValue();
string user = context.User.GetUserName();

if (resource.TenantId == surveyTenantId)
{
// Admin can do anything, as long as the resource belongs to the admin's tenant.
if (context.User.HasClaim(ClaimTypes.Role, Roles.SurveyAdmin))
{
context.Succeed(requirement);
return Task.FromResult(0);
}

if (context.User.HasClaim(ClaimTypes.Role, Roles.SurveyCreator))
{
permissions.Add(UserPermissionType.Creator);
}
else
{
permissions.Add(UserPermissionType.Reader);
}

if (resource.OwnerId == userId)
{
permissions.Add(UserPermissionType.Owner);
}
}
if (resource.Contributors != null && resource.Contributors.Any(x => x.UserId == userId))
{
permissions.Add(UserPermissionType.Contributor);
}

if (ValidateUserPermissions[requirement](permissions))
{
context.Succeed(requirement);
}
return Task.FromResult(0);
}
}

In a multitenant application, you must ensure that permissions don't "leak" to another tenant's data. In the Surveys
app, the Contributor permission is allowed across tenants—you can assign someone from another tenant as a
contributor. The other permission types are restricted to resources that belong to that user's tenant. To enforce this
requirement, the code checks the tenant ID before granting the permission. (The TenantId field as assigned when
the survey is created.)
The next step is to check the operation (such as read, update, or delete) against the permissions. The Surveys app
implements this step by using a lookup table of functions:
static readonly Dictionary<OperationAuthorizationRequirement, Func<List<UserPermissionType>, bool>>
ValidateUserPermissions
= new Dictionary<OperationAuthorizationRequirement, Func<List<UserPermissionType>, bool>>

{
{ Operations.Create, x => x.Contains(UserPermissionType.Creator) },

{ Operations.Read, x => x.Contains(UserPermissionType.Creator) ||


x.Contains(UserPermissionType.Reader) ||
x.Contains(UserPermissionType.Contributor) ||
x.Contains(UserPermissionType.Owner) },

{ Operations.Update, x => x.Contains(UserPermissionType.Contributor) ||


x.Contains(UserPermissionType.Owner) },

{ Operations.Delete, x => x.Contains(UserPermissionType.Owner) },

{ Operations.Publish, x => x.Contains(UserPermissionType.Owner) },

{ Operations.UnPublish, x => x.Contains(UserPermissionType.Owner) }


};

Next
Secure a backend web API for multitenant
applications
11/2/2020 • 8 minutes to read • Edit Online

Sample code
The Tailspin Surveys application uses a backend web API to manage CRUD operations on surveys. For example,
when a user clicks "My Surveys", the web application sends an HTTP request to the web API:

GET /users/{userId}/surveys

The web API returns a JSON object:

{
"Published":[],
"Own":[
{"Id":1,"Title":"Survey 1"},
{"Id":3,"Title":"Survey 3"},
],
"Contribute": [{"Id":8,"Title":"My survey"}]
}

The web API does not allow anonymous requests, so the web app must authenticate itself using OAuth 2 bearer
tokens.

NOTE
This is a server-to-server scenario. The application does not make any AJAX calls to the API from the browser client.

There are two main approaches you can take:


Delegated user identity. The web application authenticates with the user's identity.
Application identity. The web application authenticates with its client ID, using OAuth 2 client credential flow.
The Tailspin application implements delegated user identity. Here are the main differences:
Delegated user identity:
The bearer token sent to the web API contains the user identity.
The web API makes authorization decisions based on the user identity.
The web application needs to handle 403 (Forbidden) errors from the web API, if the user is not authorized to
perform an action.
Typically, the web application still makes some authorization decisions that affect UI, such as showing or hiding
UI elements).
The web API can potentially be used by untrusted clients, such as a JavaScript application or a native client
application.
Application identity:
The web API does not get information about the user.
The web API cannot perform any authorization based on the user identity. All authorization decisions are made
by the web application.
The web API cannot be used by an untrusted client (JavaScript or native client application).
This approach may be somewhat simpler to implement, because there is no authorization logic in the Web API.
In either approach, the web application must get an access token, which is the credential needed to call the web
API.
For delegated user identity, the token has to come from an identity provider (IDP), such as Azure Active
Directory, which can issue a token on behalf of the user.
For client credentials, an application might get the token from the IDP or host its own token server. (But don't
write a token server from scratch; use a well-tested framework like IdentityServer4.) If you authenticate with
Azure AD, it's strongly recommended to get the access token from Azure AD, even with client credential flow.
The rest of this article assumes the application is authenticating with Azure AD.

A diagram that shows the web application requesting an access token from Azure AD and sending the token to the
web API.

Register the web API in Azure AD


In order for Azure AD to issue a bearer token for the web API, you need to configure some things in Azure AD.
1. Register the web API in Azure AD.
2. Add the client ID of the web app to the web API application manifest, in the knownClientApplications
property. See the GitHub readme for more information.
3. Give the web application permission to call the web API. In the Azure portal, you can set two types of
permissions: "Application Permissions" for application identity (client credential flow), or "Delegated
Permissions" for delegated user identity.

A screenshot of the Azure portal that shows the application permissions and delegated permissions.
Getting an access token
Before calling the web API, the web application gets an access token from Azure AD. In a .NET application, use the
Azure AD Authentication Library (ADAL) for .NET.
In the OAuth 2 authorization code flow, the application exchanges an authorization code for an access token. The
following code uses ADAL to get the access token. This code is called during the AuthorizationCodeReceived event.

// The OpenID Connect middleware sends this event when it gets the authorization code.
public override async Task AuthorizationCodeReceived(AuthorizationCodeReceivedContext context)
{
string authorizationCode = context.ProtocolMessage.Code;
string authority = "https://login.microsoftonline.com/" + tenantID
string resourceID = "https://tailspin.onmicrosoft.com/surveys.webapi" // App ID URI
ClientCredential credential = new ClientCredential(clientId, clientSecret);

AuthenticationContext authContext = new AuthenticationContext(authority, tokenCache);


AuthenticationResult authResult = await authContext.AcquireTokenByAuthorizationCodeAsync(
authorizationCode, new Uri(redirectUri), credential, resourceID);

// If successful, the token is in authResult.AccessToken


}

Here are the various parameters that are needed:


authority . Derived from the tenant ID of the signed in user. (Not the tenant ID of the SaaS provider)
authorizationCode . the auth code that you got back from the IDP.
clientId . The web application's client ID.
clientSecret . The web application's client secret.
redirectUri . The redirect URI that you set for OpenID Connect. This is where the IDP calls back with the token.
resourceID . The App ID URI of the web API, which you created when you registered the web API in Azure AD
tokenCache . An object that caches the access tokens. See Token caching.

If AcquireTokenByAuthorizationCodeAsync succeeds, ADAL caches the token. Later, you can get the token from the
cache by calling AcquireTokenSilentAsync:

AuthenticationContext authContext = new AuthenticationContext(authority, tokenCache);


var result = await authContext.AcquireTokenSilentAsync(resourceID, credential, new UserIdentifier(userId,
UserIdentifierType.UniqueId));

where userId is the user's object ID, which is found in the


http://schemas.microsoft.com/identity/claims/objectidentifier claim.

Using the access token to call the web API


Once you have the token, send it in the Authorization header of the HTTP requests to the web API.

Authorization: Bearer xxxxxxxxxx

The following extension method from the Surveys application sets the Authorization header on an HTTP request,
using the HttpClient class.
public static async Task<HttpResponseMessage> SendRequestWithBearerTokenAsync(this HttpClient httpClient,
HttpMethod method, string path, object requestBody, string accessToken, CancellationToken ct)
{
var request = new HttpRequestMessage(method, path);
if (requestBody != null)
{
var json = JsonConvert.SerializeObject(requestBody, Formatting.None);
var content = new StringContent(json, Encoding.UTF8, "application/json");
request.Content = content;
}

request.Headers.Authorization = new AuthenticationHeaderValue("Bearer", accessToken);


request.Headers.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));

var response = await httpClient.SendAsync(request, ct);


return response;
}

Authenticating in the web API


The web API has to authenticate the bearer token. In ASP.NET Core, you can use the
Microsoft.AspNet.Authentication.JwtBearer package. This package provides middleware that enables the
application to receive OpenID Connect bearer tokens.
Register the middleware in your web API Startup class.

public void Configure(IApplicationBuilder app, IHostingEnvironment env, ApplicationDbContext dbContext,


ILoggerFactory loggerFactory)
{
// ...

app.UseJwtBearerAuthentication(new JwtBearerOptions {
Audience = configOptions.AzureAd.WebApiResourceId,
Authority = Constants.AuthEndpointPrefix,
TokenValidationParameters = new TokenValidationParameters {
ValidateIssuer = false
},
Events= new SurveysJwtBearerEvents(loggerFactory.CreateLogger<SurveysJwtBearerEvents>())
});

// ...
}

Audience . Set this to the App ID URL for the web API, which you created when you registered the web API with
Azure AD.
Authority . For a multitenant application, set this to https://login.microsoftonline.com/common/ .
TokenValidationParameters . For a multitenant application, set ValidateIssuer to false. That means the
application will validate the issuer.
Events is a class that derives from JwtBearerEvents .
Issuer validation
Validate the token issuer in the JwtBearerEvents.TokenValidated event. The issuer is sent in the "iss" claim.
In the Surveys application, the web API doesn't handle tenant sign-up. Therefore, it just checks if the issuer is
already in the application database. If not, it throws an exception, which causes authentication to fail.
public override async Task TokenValidated(TokenValidatedContext context)
{
var principal = context.Ticket.Principal;
var tenantManager = context.HttpContext.RequestServices.GetService<TenantManager>();
var userManager = context.HttpContext.RequestServices.GetService<UserManager>();
var issuerValue = principal.GetIssuerValue();
var tenant = await tenantManager.FindByIssuerValueAsync(issuerValue);

if (tenant == null)
{
// The caller was not from a trusted issuer. Throw to block the authentication flow.
throw new SecurityTokenValidationException();
}

var identity = principal.Identities.First();

// Add new claim for survey_userid


var registeredUser = await userManager.FindByObjectIdentifier(principal.GetObjectIdentifierValue());
identity.AddClaim(new Claim(SurveyClaimTypes.SurveyUserIdClaimType, registeredUser.Id.ToString()));
identity.AddClaim(new Claim(SurveyClaimTypes.SurveyTenantIdClaimType,
registeredUser.TenantId.ToString()));

// Add new claim for Email


var email = principal.FindFirst(ClaimTypes.Upn)?.Value;
if (!string.IsNullOrWhiteSpace(email))
{
identity.AddClaim(new Claim(ClaimTypes.Email, email));
}
}

As this example shows, you can also use the TokenValidated event to modify the claims. Remember that the
claims come directly from Azure AD. If the web application modifies the claims that it gets, those changes won't
show up in the bearer token that the web API receives. For more information, see Claims transformations.

Authorization
For a general discussion of authorization, see Role-based and resource-based authorization.
The JwtBearer middleware handles the authorization responses. For example, to restrict a controller action to
authenticated users, use the [Authorize] attribute and specify JwtBearerDefaults.AuthenticationScheme as
the authentication scheme:

[Authorize(ActiveAuthenticationSchemes = JwtBearerDefaults.AuthenticationScheme)]

This returns a 401 status code if the user is not authenticated.


To restrict a controller action by authorization policy, specify the policy name in the [Authorize] attribute:

[Authorize(Policy = PolicyNames.RequireSurveyCreator)]

This returns a 401 status code if the user is not authenticated, and 403 if the user is authenticated but not
authorized. Register the policy on startup:
public void ConfigureServices(IServiceCollection services)
{
services.AddAuthorization(options =>
{
options.AddPolicy(PolicyNames.RequireSurveyCreator,
policy =>
{
policy.AddRequirements(new SurveyCreatorRequirement());
policy.RequireAuthenticatedUser(); // Adds DenyAnonymousAuthorizationRequirement
policy.AddAuthenticationSchemes(JwtBearerDefaults.AuthenticationScheme);
});
options.AddPolicy(PolicyNames.RequireSurveyAdmin,
policy =>
{
policy.AddRequirements(new SurveyAdminRequirement());
policy.RequireAuthenticatedUser(); // Adds DenyAnonymousAuthorizationRequirement
policy.AddAuthenticationSchemes(JwtBearerDefaults.AuthenticationScheme);
});
});

// ...
}

Protecting application secrets


It's common to have application settings that are sensitive and must be protected, such as:
Database connection strings
Passwords
Cryptographic keys
As a security best practice, you should never store these secrets in source control. It's too easy for them to leak —
even if your source code repository is private. And it's not just about keeping secrets from the general public. On
larger projects, you might want to restrict which developers and operators can access the production secrets.
(Settings for test or development environments are different.)
A more secure option is to store these secrets in Azure Key Vault. Key Vault is a cloud-hosted service for managing
cryptographic keys and other secrets. This article shows how to use Key Vault to store configuration settings for
your app.
In the Tailspin Surveys application, the following settings are secret:
The database connection string.
The Redis connection string.
The client secret for the web application.
The Surveys application loads configuration settings from the following places:
The appsettings.json file
The user secrets store (development environment only; for testing)
The hosting environment (app settings in Azure web apps)
Key Vault (when enabled)
Each of these overrides the previous one, so any settings stored in Key Vault take precedence.
NOTE
By default, the Key Vault configuration provider is disabled. It's not needed for running the application locally. You would
enable it in a production deployment.

At startup, the application reads settings from every registered configuration provider, and uses them to populate
a strongly typed options object. For more information, see Using Options and configuration objects.
Next
Cache access tokens
11/2/2020 • 4 minutes to read • Edit Online

Sample code
It's relatively expensive to get an OAuth access token, because it requires an HTTP request to the token endpoint.
Therefore, it's good to cache tokens whenever possible. The Azure AD Authentication Library (ADAL) automatically
caches tokens obtained from Azure AD, including refresh tokens.
ADAL provides a default token cache implementation. However, this token cache is intended for native client apps,
and is not suitable for web apps:
It is a static instance, and not thread safe.
It doesn't scale to large numbers of users, because tokens from all users go into the same dictionary.
It can't be shared across web servers in a farm.
Instead, you should implement a custom token cache that derives from the ADAL TokenCache class but is suitable
for a server environment and provides the desirable level of isolation between tokens for different users.
The TokenCache class stores a dictionary of tokens, indexed by issuer, resource, client ID, and user. A custom token
cache should write this dictionary to a backing store, such as a Redis cache.
In the Tailspin Surveys application, the DistributedTokenCache class implements the token cache. This
implementation uses the IDistributedCache abstraction from ASP.NET Core. That way, any IDistributedCache
implementation can be used as a backing store.
By default, the Surveys app uses a Redis cache.
For a single-instance web server, you could use the ASP.NET Core in-memory cache. (This is also a good option
for running the app locally during development.)
DistributedTokenCache stores the cache data as key/value pairs in the backing store. The key is the user ID plus
client ID, so the backing store holds separate cache data for each unique combination of user/client.
The backing store is partitioned by user. For each HTTP request, the tokens for that user are read from the backing
store and loaded into the TokenCache dictionary. If Redis is used as the backing store, every server instance in a
server farm reads/writes to the same cache, and this approach scales to many users.

Encrypting cached tokens


Tokens are sensitive data, because they grant access to a user's resources. (Moreover, unlike a user's password, you
can't just store a hash of the token.) Therefore, it's critical to protect tokens from being compromised. The Redis-
backed cache is protected by a password, but if someone obtains the password, they could get all of the cached
access tokens. For that reason, the DistributedTokenCache encrypts everything that it writes to the backing store.
Encryption is done using the ASP.NET Core data protection APIs.

NOTE
If you deploy to Azure Web Sites, the encryption keys are backed up to network storage and synchronized across all
machines (see Key management and lifetime). By default, keys are not encrypted when running in Azure Web Sites, but you
can enable encryption using an X.509 certificate.

DistributedTokenCache implementation
The DistributedTokenCache class derives from the ADAL TokenCache class.
In the constructor, the DistributedTokenCache class creates a key for the current user and loads the cache from the
backing store:

public DistributedTokenCache(
ClaimsPrincipal claimsPrincipal,
IDistributedCache distributedCache,
ILoggerFactory loggerFactory,
IDataProtectionProvider dataProtectionProvider)
: base()
{
_claimsPrincipal = claimsPrincipal;
_cacheKey = BuildCacheKey(_claimsPrincipal);
_distributedCache = distributedCache;
_logger = loggerFactory.CreateLogger<DistributedTokenCache>();
_protector = dataProtectionProvider.CreateProtector(typeof(DistributedTokenCache).FullName);
AfterAccess = AfterAccessNotification;
LoadFromCache();
}

The key is created by concatenating the user ID and client ID. Both of these are taken from claims found in the
user's ClaimsPrincipal :

private static string BuildCacheKey(ClaimsPrincipal claimsPrincipal)


{
string clientId = claimsPrincipal.FindFirstValue("aud", true);
return string.Format(
"UserId:{0}::ClientId:{1}",
claimsPrincipal.GetObjectIdentifierValue(),
clientId);
}

To load the cache data, read the serialized blob from the backing store, and call TokenCache.Deserialize to convert
the blob into cache data.
private void LoadFromCache()
{
byte[] cacheData = _distributedCache.Get(_cacheKey);
if (cacheData != null)
{
this.Deserialize(_protector.Unprotect(cacheData));
}
}

Whenever ADAL access the cache, it fires an AfterAccess event. If the cache data has changed, the
HasStateChanged property is true. In that case, update the backing store to reflect the change, and then set
HasStateChanged to false.

public void AfterAccessNotification(TokenCacheNotificationArgs args)


{
if (this.HasStateChanged)
{
try
{
if (this.Count > 0)
{
_distributedCache.Set(_cacheKey, _protector.Protect(this.Serialize()));
}
else
{
// There are no tokens for this user/client, so remove the item from the cache.
_distributedCache.Remove(_cacheKey);
}
this.HasStateChanged = false;
}
catch (Exception exp)
{
_logger.WriteToCacheFailed(exp);
throw;
}
}
}

TokenCache sends two other events:


BeforeWrite . Called immediately before ADAL writes to the cache. You can use this to implement a concurrency
strategy
BeforeAccess . Called immediately before ADAL reads from the cache. Here you can reload the cache to get the
latest version.
In our case, we decided not to handle these two events.
For concurrency, last write wins. That's OK, because tokens are stored independently for each user + client, so a
conflict would only happen if the same user had two concurrent login sessions.
For reading, we load the cache on every request. Requests are short lived. If the cache gets modified in that
time, the next request will pick up the new value.
Next
Use client assertion to get access tokens from Azure
AD
11/2/2020 • 2 minutes to read • Edit Online

Sample code
This article describes how to add client assertion to the Tailspin Surveys sample application.

Understanding client assertion in OpenID Connect


When using authorization code flow or hybrid flow in OpenID Connect, the client exchanges an authorization code
for an access token. During this step, the client has to authenticate itself to the server.

One way to authenticate the client is by using a client secret. That's how the Tailspin Surveys application is
configured by default.
Here is an example request from the client to the IDP, requesting an access token. Note the client_secret
parameter.

POST https://login.microsoftonline.com/b9bd2162xxx/oauth2/token HTTP/1.1


Content-Type: application/x-www-form-urlencoded

resource=https://tailspin.onmicrosoft.com/surveys.webapi
&client_id=87df91dc-63de-4765-8701-b59cc8bd9e11
&client_secret=i3Bf12Dn...
&grant_type=authorization_code
&code=PG8wJG6Y...

The secret is just a string, so you have to make sure not to leak the value. The best practice is to keep the client
secret out of source control. When you deploy to Azure, store the secret in an app setting.
However, anyone with access to the Azure subscription can view the app settings. Furthermore, there is always a
temptation to check secrets into source control (for example, in deployment scripts), share them by email, and so
on.
For additional security, you can use client assertion instead of a client secret. With client assertion, the client uses
an X.509 certificate to prove the token request came from the client. The client certificate is installed on the web
server. Generally, it will be easier to restrict access to the certificate, than to ensure that nobody inadvertently
reveals a client secret. For more information about configuring certificates in a web app, see Using Certificates in
Azure Websites Applications
Here is a token request using client assertion:

POST https://login.microsoftonline.com/b9bd2162xxx/oauth2/token HTTP/1.1


Content-Type: application/x-www-form-urlencoded

resource=https://tailspin.onmicrosoft.com/surveys.webapi
&client_id=87df91dc-63de-4765-8701-b59cc8bd9e11
&client_assertion_type=urn:ietf:params:oauth:client-assertion-type:jwt-bearer
&client_assertion=eyJhbGci...
&grant_type=authorization_code
&code= PG8wJG6Y...

Notice that the client_secret parameter is no longer used. Instead, the client_assertion parameter contains a
JWT token that was signed using the client certificate. The client_assertion_type parameter specifies the type of
assertion — in this case, JWT token. The server validates the JWT token. If the JWT token is invalid, the token
request returns an error.

NOTE
X.509 certificates are not the only form of client assertion; we focus on it here because it is supported by Azure AD.

At run time, the web application reads the certificate from the certificate store. The certificate must be installed on
the same machine as the web app.

Implementing client assertion


The Surveys application includes a helper class that creates a ClientAssertionCertificate that you can pass to the
AuthenticationContext.AcquireTokenSilentAsync method to acquire a token from Azure AD.

public class CertificateCredentialService : ICredentialService


{
private Lazy<Task<AdalCredential>> _credential;

public CertificateCredentialService(IOptions<ConfigurationOptions> options)


{
var aadOptions = options.Value?.AzureAd;
_credential = new Lazy<Task<AdalCredential>>(() =>
{
X509Certificate2 cert = CertificateUtility.FindCertificateByThumbprint(
aadOptions.Asymmetric.StoreName,
aadOptions.Asymmetric.StoreLocation,
aadOptions.Asymmetric.CertificateThumbprint,
aadOptions.Asymmetric.ValidationRequired);
string password = null;
var certBytes = CertificateUtility.ExportCertificateWithPrivateKey(cert, out password);
return Task.FromResult(new AdalCredential(new ClientAssertionCertificate(aadOptions.ClientId, new
X509Certificate2(certBytes, password))));
});
}

public async Task<AdalCredential> GetCredentialsAsync()


{
return await _credential.Value;
}
}
NOTE
For more information on ID Tokens, please see this document.

Next
Federate with a customer's AD FS
11/2/2020 • 6 minutes to read • Edit Online

This article describes how a multitenant SaaS application can support authentication via Active Directory
Federation Services (AD FS), in order to federate with a customer's AD FS.

Overview
Azure Active Directory (Azure AD) makes it easy to sign in users from Azure AD tenants, including Office365 and
Dynamics CRM Online customers. But what about customers who use on-premises Active Directory on a corporate
intranet?
One option is for these customers to sync their on-premises AD with Azure AD, using Azure AD Connect. However,
some customers may be unable to use this approach, due to corporate IT policy or other reasons. In that case,
another option is to federate through Active Directory Federation Services (AD FS).
To enable this scenario:
The customer must have an Internet-facing AD FS farm.
The SaaS provider deploys their own AD FS farm.
The customer and the SaaS provider must set up federation trust. This is a manual process.
There are three main roles in the trust relation:
The customer's AD FS is the account partner, responsible for authenticating users from the customer's AD,
and creating security tokens with user claims.
The SaaS provider's AD FS is the resource partner, which trusts the account partner and receives the user
claims.
The application is configured as a relying party (RP) in the SaaS provider's AD FS.

NOTE
In this article, we assume the application uses OpenID Connect as the authentication protocol. Another option is to use WS-
Federation.
For OpenID Connect, the SaaS provider must use AD FS 2016, running in Windows Server 2016. AD FS 3.0 does not
support OpenID Connect.
For an example of using WS-Federation with ASP.NET 4, see the active-directory-dotnet-webapp-wsfederation
sample.

Authentication flow
1. When the user clicks "sign in", the application redirects to an OpenID Connect endpoint on the SaaS provider's
AD FS.
2. The user enters his or her organizational user name (" [email protected] "). AD FS uses home realm
discovery to redirect to the customer's AD FS, where the user enters their credentials.
3. The customer's AD FS sends user claims to the SaaS provider's AD FS, using WF-Federation (or SAML).
4. Claims flow from AD FS to the app, using OpenID Connect. This requires a protocol transition from WS-
Federation.

Limitations
By default, the relying party application receives only a fixed set of claims available in the id_token, shown in the
following table. With AD FS 2016, you can customize the id_token in OpenID Connect scenarios. For more
information, see Custom ID Tokens in AD FS.

C L A IM DESC RIP T IO N

aud Audience. The application for which the claims were issued.

authenticationinstant Authentication instant. The time at which authentication


occurred.

c_hash Code hash value. This is a hash of the token contents.

exp Expiration time. The time after which the token will no longer
be accepted.

iat Issued at. The time when the token was issued.

iss Issuer. The value of this claim is always the resource partner's
AD FS.

name User name. Example: [email protected]

nameidentifier Name identifier. The identifier for the name of the entity for
which the token was issued.

nonce Session nonce. A unique value generated by AD FS to help


prevent replay attacks.

upn User principal name (UPN). Example:


[email protected]

pwd_exp Password expiration period. The number of seconds until the


user's password or a similar authentication secret, such as a
PIN. expires.
NOTE
The "iss" claim contains the AD FS of the partner (typically, this claim will identify the SaaS provider as the issuer). It does not
identify the customer's AD FS. You can find the customer's domain as part of the UPN.

The rest of this article describes how to set up the trust relationship between the RP (the app) and the account
partner (the customer).

AD FS deployment
The SaaS provider can deploy AD FS either on-premises or on Azure VMs. For security and availability, the
following guidelines are important:
Deploy at least two AD FS servers and two AD FS proxy servers to achieve the best availability of the AD FS
service.
Domain controllers and AD FS servers should never be exposed directly to the Internet and should be in a
virtual network with direct access to them.
Web application proxies (previously AD FS proxies) must be used to publish AD FS servers to the Internet.
To set up a similar topology in Azure requires the use of virtual networks, network security groups, virtual
machines, and availability sets. For more details, see Guidelines for deploying Windows Server Active Directory on
Azure Virtual Machines.

Configure OpenID Connect authentication with AD FS


The SaaS provider must enable OpenID Connect between the application and AD FS. To do so, add an application
group in AD FS. You can find detailed instructions in this blog post, under "Setting up a Web App for OpenId
Connect sign in AD FS."
Next, configure the OpenID Connect middleware. The metadata endpoint is
https://domain/adfs/.well-known/openid-configuration , where domain is the SaaS provider's AD FS domain.

Typically you might combine this with other OpenID Connect endpoints (such as Azure AD). You'll need two
different sign-in buttons or some other way to distinguish them, so that the user is sent to the correct
authentication endpoint.

Configure the AD FS Resource Partner


The SaaS provider must do the following for each customer that wants to connect via AD FS:
1. Add a claims provider trust.
2. Add claims rules.
3. Enable home-realm discovery.
Here are the steps in more detail.
Add the claims provider trust
1. In Server Manager, click Tools , and then select AD FS Management .
2. In the console tree, under AD FS , right click Claims Provider Trusts . Select Add Claims Provider Trust .
3. Click Star t to start the wizard.
4. Select the option "Import data about the claims provider published online or on a local network". Enter the URI
of the customer's federation metadata endpoint. (Example:
https://contoso.com/FederationMetadata/2007-06/FederationMetadata.xml .) You will need to get this from the
customer.
5. Complete the wizard using the default options.
Edit claims rules
1. Right-click the newly added claims provider trust, and select Edit Claims Rules .
2. Click Add Rule .
3. Select "Pass Through or Filter an Incoming Claim" and click Next .

4. Enter a name for the rule.


5. Under "Incoming claim type", select UPN .
6. Select "Pass through all claim values".
7. Click Finish .
8. Repeat steps 2 - 7, and specify Anchor Claim Type for the incoming claim type.
9. Click OK to complete the wizard.
Enable home -realm discovery
Run the following PowerShell script:

Set-AdfsClaimsProviderTrust -TargetName "name" -OrganizationalAccountSuffix @("suffix")

where "name" is the friendly name of the claims provider trust, and "suffix" is the UPN suffix for the customer's AD
(example, "corp.fabrikam.com").
With this configuration, end users can type in their organizational account, and AD FS automatically selects the
corresponding claims provider. See Customizing the AD FS Sign-in Pages, under the section "Configure Identity
Provider to use certain email suffixes".

Configure the AD FS Account Partner


The customer must do the following:
1. Add a relying party (RP) trust.
2. Adds claims rules.
Add the RP trust
1. In Server Manager, click Tools , and then select AD FS Management .
2. In the console tree, under AD FS , right click Relying Par ty Trusts . Select Add Relying Par ty Trust .
3. Select Claims Aware and click Star t .
4. On the Select Data Source page, select the option "Import data about the claims provider published online or
on a local network". Enter the URI of the SaaS provider's federation metadata endpoint.

5. On the Specify Display Name page, enter any name.


6. On the Choose Access Control Policy page, choose a policy. You could permit everyone in the organization,
or choose a specific security group.
7. Enter any parameters required in the Policy box.
8. Click Next to complete the wizard.
Add claims rules
1. Right-click the newly added relying party trust, and select Edit Claim Issuance Policy .
2. Click Add Rule .
3. Select "Send LDAP Attributes as Claims" and click Next .
4. Enter a name for the rule, such as "UPN".
5. Under Attribute store , select Active Director y .
6. In the Mapping of LDAP attributes section:
Under LDAP Attribute , select User-Principal-Name .
Under Outgoing Claim Type , select UPN .
7. Click Finish .
8. Click Add Rule again.
9. Select "Send Claims Using a Custom Rule" and click Next .
10. Enter a name for the rule, such as "Anchor Claim Type".
11. Under Custom rule , enter the following:

EXISTS([Type == "http://schemas.microsoft.com/ws/2014/01/identity/claims/anchorclaimtype"])=>
issue (Type = "http://schemas.microsoft.com/ws/2014/01/identity/claims/anchorclaimtype",
Value = "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/upn");

This rule issues a claim of type anchorclaimtype . The claim tells the relying party to use UPN as the user's
immutable ID.
12. Click Finish .
13. Click OK to complete the wizard.
Create an Active Directory Domain Services (AD DS)
resource forest in Azure
11/2/2020 • 6 minutes to read • Edit Online

This reference architecture shows how to create a separate Active Directory domain in Azure that is trusted by
domains in your on-premises AD forest. Deploy this solution .

Download a Visio file of this architecture.


Active Directory Domain Services (AD DS) stores identity information in a hierarchical structure. The top node in
the hierarchical structure is known as a forest. A forest contains domains, and domains contain other types of
objects. This reference architecture creates an AD DS forest in Azure with a one-way outgoing trust relationship
with an on-premises domain. The forest in Azure contains a domain that does not exist on-premises. Because of the
trust relationship, logons made against on-premises domains can be trusted for access to resources in the separate
Azure domain.
Typical uses for this architecture include maintaining security separation for objects and identities held in the cloud,
and migrating individual domains from on-premises to the cloud.
For additional considerations, see Choose a solution for integrating on-premises Active Directory with Azure.

Architecture
The architecture has the following components.
On-premises network . The on-premises network contains its own Active Directory forest and domains.
Active Director y ser vers . These are domain controllers implementing domain services running as VMs in the
cloud. These servers host a forest containing one or more domains, separate from those located on-premises.
One-way trust relationship . The example in the diagram shows a one-way trust from the domain in Azure to
the on-premises domain. This relationship enables on-premises users to access resources in the domain in
Azure, but not the other way around. It is possible to create a two-way trust if cloud users also require access to
on-premises resources.
Active Director y subnet . The AD DS servers are hosted in a separate subnet. Network security group (NSG)
rules protect the AD DS servers and provide a firewall against traffic from unexpected sources.
Azure gateway . The Azure gateway provides a connection between the on-premises network and the Azure
VNet. This can be a VPN connection or Azure ExpressRoute. For more information, see Connect an on-premises
network to Azure using a VPN gateway.

Recommendations
For specific recommendations on implementing Active Directory in Azure, see Extending Active Directory Domain
Services (AD DS) to Azure.
Trust
The on-premises domains are contained within a different forest from the domains in the cloud. To enable
authentication of on-premises users in the cloud, the domains in Azure must trust the logon domain in the on-
premises forest. Similarly, if the cloud provides a logon domain for external users, it may be necessary for the on-
premises forest to trust the cloud domain.
You can establish trusts at the forest level by creating forest trusts, or at the domain level by creating external
trusts. A forest level trust creates a relationship between all domains in two forests. An external domain level trust
only creates a relationship between two specified domains. You should only create external domain level trusts
between domains in different forests.
Trusts can be unidirectional (one-way) or bidirectional (two-way):
A one-way trust enables users in one domain or forest (known as the incoming domain or forest) to access the
resources held in another (the outgoing domain or forest).
A two-way trust enables users in either domain or forest to access resources held in the other.
The following table summarizes trust configurations for some simple scenarios:

SC EN A RIO O N - P REM ISES T RUST C LO UD T RUST

On-premises users require access to One-way, incoming One-way, outgoing


resources in the cloud, but not vice
versa

Users in the cloud require access to One-way, outgoing One-way, incoming


resources located on-premises, but not
vice versa

Users in the cloud and on-premises Two-way, incoming and outgoing Two-way, incoming and outgoing
both requires access to resources held
in the cloud and on-premises

Scalability considerations
Active Directory is automatically scalable for domain controllers that are part of the same domain. Requests are
distributed across all controllers within a domain. You can add another domain controller, and it synchronizes
automatically with the domain. Do not configure a separate load balancer to direct traffic to controllers within the
domain. Ensure that all domain controllers have sufficient memory and storage resources to handle the domain
database. Make all domain controller VMs the same size.

Availability considerations
Provision at least two domain controllers for each domain. This enables automatic replication between servers.
Create an availability set for the VMs acting as Active Directory servers handling each domain. Put at least two
servers in this availability set.
Also, consider designating one or more servers in each domain as standby operations masters in case connectivity
to a server acting as a flexible single master operation (FSMO) role fails.

Manageability considerations
For information about management and monitoring considerations, see Extending Active Directory to Azure.
For additional information, see Monitoring Active Directory. You can install tools such as Microsoft Systems Center
on a monitoring server in the management subnet to help perform these tasks.

Security considerations
Forest level trusts are transitive. If you establish a forest level trust between an on-premises forest and a forest in
the cloud, this trust is extended to other new domains created in either forest. If you use domains to provide
separation for security purposes, consider creating trusts at the domain level only. Domain level trusts are non-
transitive.
For Active Directory-specific security considerations, see the security considerations section in Extending Active
Directory to Azure.

DevOps considerations
For DevOps considerations, see DevOps: Extending Active Directory Domain Services (AD DS) to Azure.

Cost considerations
Use the Azure pricing calculator to estimate costs. Other considerations are described in the Cost section in
Microsoft Azure Well-Architected Framework.
Here are cost considerations for the services used in this architecture.
AD Domain Services
Consider having Active Directory Domain Services as a shared service that is consumed by multiple workloads to
lower costs. For more information, see Active Directory Domain Services pricing.
Azure VPN Gateway
The main component of this architecture is the VPN gateway service. You are charged based on the amount of time
that the gateway is provisioned and available.
All inbound traffic is free, all outbound traffic is charged. Internet bandwidth costs are applied to VPN outbound
traffic.
For more information, see VPN Gateway Pricing.

Deploy the solution


A deployment for this architecture is available on GitHub. Note that the entire deployment can take up to two
hours, which includes creating the VPN gateway and running the scripts that configure AD DS.
Prerequisites
1. Clone, fork, or download the zip file for the GitHub repository.
2. Install Azure CLI 2.0.
3. Install the Azure building blocks npm package.
npm install -g @mspnp/azure-building-blocks

4. From a command prompt, bash prompt, or PowerShell prompt, sign into your Azure account as follows:

az login

Deploy the simulated on-premises datacenter


1. Navigate to the identity/adds-forest folder of the GitHub repository.
2. Open the onprem.json file. Search for instances of adminPassword and Password and add values for the
passwords.
3. Run the following command and wait for the deployment to finish:

azbb -s <subscription_id> -g <resource group> -l <location> -p onprem.json --deploy

Deploy the Azure VNet


1. Open the azure.json file. Search for instances of adminPassword and Password and add values for the
passwords.
2. In the same file, search for instances of sharedKey and enter shared keys for the VPN connection.

"sharedKey": "",

3. Run the following command and wait for the deployment to finish.

azbb -s <subscription_id> -g <resource group> -l <location> -p azure.json --deploy

Deploy to the same resource group as the on-premises VNet.


Test the AD trust relation
1. Use the Azure portal, navigate to the resource group that you created.
2. Use the Azure portal to find the VM named ra-adt-mgmt-vm1 .
3. Click Connect to open a remote desktop session to the VM. The username is contoso\testuser , and the
password is the one that you specified in the onprem.json parameter file.
4. From inside your remote desktop session, open another remote desktop session to 192.168.0.4, which is the
IP address of the VM named ra-adtrust-onpremise-ad-vm1 . The username is contoso\testuser , and the
password is the one that you specified in the azure.json parameter file.
5. From inside the remote desktop session for ra-adtrust-onpremise-ad-vm1 , go to Ser ver Manager and click
Tools > Active Director y Domains and Trusts .
6. In the left pane, right-click on the contoso.com and select Proper ties .
7. Click the Trusts tab. You should see treyresearch.net listed as an incoming trust.
Next steps
Learn the best practices for extending your on-premises AD DS domain to Azure
Learn the best practices for creating an AD FS infrastructure in Azure.
Extend your on-premises Active Directory domain to
Azure
11/2/2020 • 9 minutes to read • Edit Online

This architecture shows how to extend an on-premises Active Directory domain to Azure to provide distributed
authentication services. Deploy this solution .

Download a Visio file of this architecture.


If your application is hosted partly on-premises and partly in Azure, it may be more efficient to replicate Active
Directory Domain Services (AD DS) in Azure. This can reduce the latency caused by sending authentication
requests from the cloud back to AD DS running on-premises.
This architecture is commonly used when the on-premises network and the Azure virtual network are connected
by a VPN or ExpressRoute connection. This architecture also supports bidirectional replication, meaning changes
can be made either on-premises or in the cloud, and both sources will be kept consistent. Typical uses for this
architecture include hybrid applications in which functionality is distributed between on-premises and Azure, and
applications and services that perform authentication using Active Directory.
For additional considerations, see Choose a solution for integrating on-premises Active Directory with Azure.

Architecture
This architecture extends the hybrid network architecture shown in Connect an on-premises network to Azure
using a VPN gateway. It has the following components.
On-premises network . The on-premises network includes local Active Directory servers that can perform
authentication and authorization for components located on-premises.
Active Director y ser vers . These are domain controllers implementing directory services (AD DS) running as
VMs in the cloud. These servers can provide authentication of components running in your Azure virtual
network.
Active Director y subnet . The AD DS servers are hosted in a separate subnet. Network security group (NSG)
rules protect the AD DS servers and provide a firewall against traffic from unexpected sources.
Azure Gateway and Active Director y synchronization . The Azure gateway provides a connection
between the on-premises network and the Azure VNet. This can be a VPN connection or Azure ExpressRoute.
All synchronization requests between the Active Directory servers in the cloud and on-premises pass through
the gateway. User-defined routes (UDRs) handle routing for on-premises traffic that passes to Azure.

Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have a
specific requirement that overrides them.
VM recommendations
Determine your VM size requirements based on the expected volume of authentication requests. Use the
specifications of the machines hosting AD DS on premises as a starting point, and match them with the Azure VM
sizes. Once deployed, monitor utilization and scale up or down based on the actual load on the VMs. For more
information about sizing AD DS domain controllers, see Capacity Planning for Active Directory Domain Services.
Create a separate virtual data disk for storing the database, logs, and sysvol folder for Active Directory. Do not
store these items on the same disk as the operating system. By default, data disks that are attached to a VM use
write-through caching. However, this form of caching can conflict with the requirements of AD DS. For this reason,
set the Host Cache Preference setting on the data disk to None.
Deploy at least two VMs running AD DS as domain controllers and add them to an availability set.
Networking recommendations
Configure the VM network interface (NIC) for each AD DS server with a static private IP address for full domain
name service (DNS) support. For more information, see How to set a static private IP address in the Azure portal.

NOTE
Do not configure the VM NIC for any AD DS with a public IP address. See Security considerations for more details.

The Active Directory subnet NSG requires rules to permit incoming traffic from on-premises and outgoing traffic
to on-premises. For detailed information on the ports used by AD DS, see Active Directory and Active Directory
Domain Services Port Requirements.
Active Directory site
In AD DS, a site represents a physical location, network, or collection of devices. AD DS sites are used to manage
AD DS database replication by grouping together AD DS objects that are located close to one another and are
connected by a high-speed network. AD DS includes logic to select the best strategy for replicating the AD DS
database between sites.
We recommend that you create an AD DS site including the subnets defined for your application in Azure. Then,
configure a site link between your on-premises AD DS sites, and AD DS will automatically perform the most
efficient database replication possible. This database replication requires little beyond the initial configuration.
Active Directory operations masters
The operations masters role can be assigned to AD DS domain controllers to support consistency checking
between instances of replicated AD DS databases. There are five operations master roles: schema master, domain
naming master, relative identifier master, primary domain controller master emulator, and infrastructure master.
For more information about these roles, see What are Operations Masters?.
We recommend you do not assign operations masters roles to the domain controllers deployed in Azure.
Monitoring
Monitor the resources of the domain controller VMs as well as the AD DS Services and create a plan to quickly
correct any problems. For more information, see Monitoring Active Directory. You can also install tools such as
Microsoft Systems Center on the monitoring server (see the architecture diagram) to help perform these tasks.
Scalability considerations
AD DS is designed for scalability. You don't need to configure a load balancer or traffic controller to direct requests
to AD DS domain controllers. The only scalability consideration is to configure the VMs running AD DS with the
correct size for your network load requirements, monitor the load on the VMs, and scale up or down as necessary.

Availability considerations
Deploy the VMs running AD DS into an availability set. Also, consider assigning the role of standby operations
master to at least one server, and possibly more depending on your requirements. A standby operations master is
an active copy of the operations master that can be used in place of the primary operations masters server during
failover.

Manageability considerations
Perform regular AD DS backups. Don't copy the VHD files of domain controllers instead of performing regular
backups, because the AD DS database file on the VHD may not be in a consistent state when it's copied, making it
impossible to restart the database.
Do not shut down a domain controller VM using Azure portal. Instead, shut down and restart from the guest
operating system. Shutting down through the portal causes the VM to be deallocated, which resets both the
VM-GenerationID and the invocationID of the Active Directory repository. This discards the AD DS relative
identifier (RID) pool and marks the sysvol folder as nonauthoritative, and may require reconfiguration of the
domain controller.

Security considerations
AD DS servers provide authentication services and are an attractive target for attacks. To secure them, prevent
direct Internet connectivity by placing the AD DS servers in a separate subnet with an NSG acting as a firewall.
Close all ports on the AD DS servers except those necessary for authentication, authorization, and server
synchronization. For more information, see Active Directory and Active Directory Domain Services Port
Requirements.
Use either BitLocker or Azure disk encryption to encrypt the disk hosting the AD DS database.

DevOps considerations
Use Infrastructure as Code (IaC) practice, to provision and configure the network and security
infrastructure. This reference architecture uses an Azure Building Blocks (AZBB) template. Another option is
Azure Resource Manager templates.
Isolate workloads to enable DevOps to do continuous integration and continuous delivery (CI/CD), because
every workload is associated and managed by its corresponding DevOps team.
In this architecture the entire virtual network that includes the different application tiers, management
jumpbox, and Azure AD Domain Services is identified as a single isolated workload. That workload is
declared in a single AZBB template.
Virtual machines are configured by using Virtual Machine Extensions and other tools such as Desired State
Configuration (DSC), used to configure ADDS on the virtual machines.
Consider using Azure DevOps or any other CI/CD solutions to automate your deployments. Azure Pipelines
is the recommended component of Azure DevOps Services that brings automation for solution builds and
deployments, it's also highly integrated in the Azure ecosystem.
Use Azure Monitor to analyze the performance of your infrastructure. It also allows you to monitor and
diagnose networking issues without logging into your virtual machines. Application Insights provides rich
metrics and logs to verify the state of your infrastructure.
For more information, see the DevOps section in Microsoft Azure Well-Architected Framework.

Cost considerations
Use the Azure pricing calculator to estimate costs. Other considerations are described in the Cost section in
Microsoft Azure Well-Architected Framework.
Here are cost considerations for the services used in this architecture.
AD Domain Services
Consider having Active Directory Domain Services as a shared service that is consumed by multiple workloads to
lower costs. For more information, see Active Directory Domain Services pricing.
Azure VPN Gateway
The main component of this architecture is the VPN gateway service. You are charged based on the amount of
time that the gateway is provisioned and available.
All inbound traffic is free, all outbound traffic is charged. Internet bandwidth costs are applied to VPN outbound
traffic.
For more information, see VPN Gateway Pricing.
Azure Virtual Network
Azure Virtual Network is free. Every subscription is allowed to create up to 50 virtual networks across all regions.
All traffic that occurs within the boundaries of a virtual network is free. So, communication between two VMs in
the same virtual network is free.

Deploy the solution


A deployment for this architecture is available on GitHub. The entire deployment can take up to two hours, which
includes creating the VPN gateway and running the scripts that configure AD DS.
Prerequisites
1. Clone, fork, or download the zip file for the GitHub repository.
2. Install Azure CLI 2.0.
3. Install the Azure building blocks npm package.

npm install -g @mspnp/azure-building-blocks

4. From a command prompt, bash prompt, or PowerShell prompt, sign into your Azure account as follows:

az login

Deploy the simulated on-premises datacenter


1. Navigate to the identity/adds-extend-domain folder of the GitHub repository.
2. Open the onprem.json file. Search for instances of adminPassword and Password and add values for the
passwords.
3. Run the following command and wait for the deployment to finish:
azbb -s <subscription_id> -g <resource group> -l <location> -p onprem.json --deploy

Deploy the Azure VNet


1. Open the azure.json file. Search for instances of adminPassword and Password and add values for the
passwords.
2. In the same file, search for instances of sharedKey and enter shared keys for the VPN connection.

"sharedKey": "",

3. Run the following command and wait for the deployment to finish.

azbb -s <subscription_id> -g <resource group> -l <location> -p azure.json --deploy

Deploy to the same resource group as the on-premises VNet.


Test connectivity with the Azure VNet
After deployment completes, you can test connectivity from the simulated on-premises environment to the Azure
VNet.
1. Use the Azure portal, navigate to the resource group that you created.
2. Find the VM named ra-onpremise-mgmt-vm1 .
3. Click Connect to open a remote desktop session to the VM. The username is contoso\testuser , and the
password is the one that you specified in the onprem.json parameter file.
4. From inside your remote desktop session, open another remote desktop session to 10.0.4.4, which is the IP
address of the VM named adds-vm1 . The username is contoso\testuser , and the password is the one that
you specified in the azure.json parameter file.
5. From inside the remote desktop session for adds-vm1 , go to Ser ver Manager and click Add other
ser vers to manage .
6. In the Active Director y tab, click Find now . You should see a list of the AD, AD DS, and Web VMs.
Next steps
Learn the best practices for creating an AD DS resource forest in Azure.
Learn the best practices for creating an Active Directory Federation Services (AD FS) infrastructure in Azure.
Extend Active Directory Federation Services (AD FS)
to Azure
11/2/2020 • 15 minutes to read • Edit Online

This reference architecture implements a secure hybrid network that extends your on-premises network to Azure
and uses Active Directory Federation Services (AD FS) to perform federated authentication and authorization for
components running in Azure. Deploy this solution .

Download a Visio file of this architecture.


AD FS can be hosted on-premises, but if your application is a hybrid in which some parts are implemented in
Azure, it may be more efficient to replicate AD FS in the cloud.
The diagram shows the following scenarios:
Application code from a partner organization accesses a web application hosted inside your Azure VNet.
An external, registered user with credentials stored inside Active Directory Domain Services (DS) accesses a
web application hosted inside your Azure VNet.
A user connected to your VNet using an authorized device executes a web application hosted inside your Azure
VNet.
Typical uses for this architecture include:
Hybrid applications where workloads run partly on-premises and partly in Azure.
Solutions that use federated authorization to expose web applications to partner organizations.
Systems that support access from web browsers running outside of the organizational firewall.
Systems that enable users to access to web applications by connecting from authorized external devices such as
remote computers, notebooks, and other mobile devices.
This reference architecture focuses on passive federation, in which the federation servers decide how and when to
authenticate a user. The user provides sign in information when the application is started. This mechanism is most
commonly used by web browsers and involves a protocol that redirects the browser to a site where the user
authenticates. AD FS also supports active federation, where an application takes on responsibility for supplying
credentials without further user interaction, but that scenario is outside the scope of this architecture.
For additional considerations, see Choose a solution for integrating on-premises Active Directory with Azure.

Architecture
This architecture extends the implementation described in Extending AD DS to Azure. It contains the following
components.
AD DS subnet . The AD DS servers are contained in their own subnet with network security group (NSG)
rules acting as a firewall.
AD DS ser vers . Domain controllers running as VMs in Azure. These servers provide authentication of local
identities within the domain.
AD FS subnet . The AD FS servers are located within their own subnet with NSG rules acting as a firewall.
AD FS ser vers . The AD FS servers provide federated authorization and authentication. In this architecture,
they perform the following tasks:
Receiving security tokens containing claims made by a partner federation server on behalf of a
partner user. AD FS verifies that the tokens are valid before passing the claims to the web application
running in Azure to authorize requests.
The application running in Azure is the relying party. The partner federation server must issue claims
that are understood by the web application. The partner federation servers are referred to as account
partners, because they submit access requests on behalf of authenticated accounts in the partner
organization. The AD FS servers are called resource partners because they provide access to
resources (the web application).
Authenticating and authorizing incoming requests from external users running a web browser or
device that needs access to web applications, by using AD DS and the Active Directory Device
Registration Service.
The AD FS servers are configured as a farm accessed through an Azure load balancer. This implementation
improves availability and scalability. The AD FS servers are not exposed directly to the Internet. All Internet
traffic is filtered through AD FS web application proxy servers and a DMZ (also referred to as a perimeter
network).
For more information about how AD FS works, see Active Directory Federation Services Overview. Also, the
article AD FS deployment in Azure contains a detailed step-by-step introduction to implementation.
AD FS proxy subnet . The AD FS proxy servers can be contained within their own subnet, with NSG rules
providing protection. The servers in this subnet are exposed to the Internet through a set of network virtual
appliances that provide a firewall between your Azure virtual network and the Internet.
AD FS web application proxy (WAP) ser vers . These VMs act as AD FS servers for incoming requests
from partner organizations and external devices. The WAP servers act as a filter, shielding the AD FS servers
from direct access from the Internet. As with the AD FS servers, deploying the WAP servers in a farm with
load balancing gives you greater availability and scalability than deploying a collection of stand-alone
servers.

NOTE
For detailed information about installing WAP servers, see Install and Configure the Web Application Proxy Server

Par tner organization . A partner organization running a web application that requests access to a web
application running in Azure. The federation server at the partner organization authenticates requests
locally, and submits security tokens containing claims to AD FS running in Azure. AD FS in Azure validates
the security tokens, and if valid can pass the claims to the web application running in Azure to authorize
them.

NOTE
You can also configure a VPN tunnel using Azure gateway to provide direct access to AD FS for trusted partners.
Requests received from these partners do not pass through the WAP servers.

Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have a
specific requirement that overrides them.
Networking recommendations
Configure the network interface for each of the VMs hosting AD FS and WAP servers with static private IP
addresses.
Do not give the AD FS VMs public IP addresses. For more information, see the Security considerations section.
Set the IP address of the preferred and secondary domain name service (DNS) servers for the network interfaces
for each AD FS and WAP VM to reference the Active Directory DS VMs. The Active Directory DS VMs should be
running DNS. This step is necessary to enable each VM to join the domain.
AD FS installation
The article Deploying a Federation Server Farm provides detailed instructions for installing and configuring AD FS.
Perform the following tasks before configuring the first AD FS server in the farm:
1. Obtain a publicly trusted certificate for performing server authentication. The subject name must contain
the name clients use to access the federation service. This can be the DNS name registered for the load
balancer, for example, adfs.contoso.com (avoid using wildcard names such as *.contoso.com, for security
reasons). Use the same certificate on all AD FS server VMs. You can purchase a certificate from a trusted
certification authority, but if your organization uses Active Directory Certificate Services you can create your
own.
The subject alternative name is used by the device registration service (DRS) to enable access from external
devices. This should be of the form enterpriseregistration.contoso.com.
For more information, see Obtain and Configure a Secure Sockets Layer (SSL) Certificate for AD FS.
2. On the domain controller, generate a new root key for the Key Distribution Service. Set the effective time to
the current time minus 10 hours (this configuration reduces the delay that can occur in distributing and
synchronizing keys across the domain). This step is necessary to support creating the group service account
that is used to run the AD FS service. The following PowerShell command shows an example of how to do
this:
Add-KdsRootKey -EffectiveTime (Get-Date).AddHours(-10)

3. Add each AD FS server VM to the domain.

NOTE
To install AD FS, the domain controller running the primary domain controller (PDC) emulator flexible single master
operation (FSMO) role for the domain must be running and accessible from the AD FS VMs.

AD FS trust
Establish federation trust between your AD FS installation, and the federation servers of any partner organizations.
Configure any claims filtering and mapping required.
DevOps staff at each partner organization must add a relying party trust for the web applications accessible
through your AD FS servers.
DevOps staff in your organization must configure claims-provider trust to enable your AD FS servers to trust
the claims that partner organizations provide.
DevOps staff in your organization must also configure AD FS to pass claims on to your organization's web
applications.
For more information, see Establishing Federation Trust.
Publish your organization's web applications and make them available to external partners by using
preauthentication through the WAP servers. For more information, see Publish Applications using AD FS
Preauthentication
AD FS supports token transformation and augmentation. Azure Active Directory does not provide this feature.
With AD FS, when you set up the trust relationships, you can:
Configure claim transformations for authorization rules. For example, you can map group security from a
representation used by a non-Microsoft partner organization to something that Active Directory DS can
authorize in your organization.
Transform claims from one format to another. For example, you can map from SAML 2.0 to SAML 1.1 if your
application only supports SAML 1.1 claims.
AD FS monitoring
The Microsoft System Center Management Pack for Active Directory Federation Services 2012 R2 provides both
proactive and reactive monitoring of your AD FS deployment for the federation server. This management pack
monitors:
Events that the AD FS service records in its event logs.
The performance data that the AD FS performance counters collect.
The overall health of the AD FS system and web applications (relying parties), and provides alerts for critical
issues and warnings.

Scalability considerations
The following considerations, summarized from the article Plan your AD FS deployment, give a starting point for
sizing AD FS farms:
If you have fewer than 1000 users, do not create dedicated servers, but instead install AD FS on each of the
Active Directory DS servers in the cloud. Make sure that you have at least two Active Directory DS servers to
maintain availability. Create a single WAP server.
If you have between 1000 and 15000 users, create two dedicated AD FS servers and two dedicated WAP
servers.
If you have between 15000 and 60000 users, create between three and five dedicated AD FS servers and at
least two dedicated WAP servers.
These considerations assume that you are using dual quad-core VM (Standard D4_v2, or better) sizes in Azure.
If you are using the Windows Internal Database to store AD FS configuration data, you are limited to eight AD FS
servers in the farm. If you anticipate that you will need more in the future, use SQL Server. For more information,
see The Role of the AD FS Configuration Database.

Availability considerations
Create an AD FS farm with at least two servers to increase availability of the service. Use different storage accounts
for each AD FS VM in the farm. This approach helps to ensure that a failure in a single storage account does not
make the entire farm inaccessible.
Create separate Azure availability sets for the AD FS and WAP VMs. Ensure that there are at least two VMs in each
set. Each availability set must have at least two update domains and two fault domains.
Configure the load balancers for the AD FS VMs and WAP VMs as follows:
Use an Azure load balancer to provide external access to the WAP VMs, and an internal load balancer to
distribute the load across the AD FS servers in the farm.
Only pass traffic appearing on port 443 (HTTPS) to the AD FS/WAP servers.
Give the load balancer a static IP address.
Create a health probe using HTTP against /adfs/probe . For more information, see Hardware Load Balancer
Health Checks and Web Application Proxy / AD FS 2012 R2.

NOTE
AD FS servers use the Server Name Indication (SNI) protocol, so attempting to probe using an HTTPS endpoint from
the load balancer fails.

Add a DNS A record to the domain for the AD FS load balancer. Specify the IP address of the load balancer,
and give it a name in the domain (such as adfs.contoso.com). This is the name clients and the WAP servers
use to access the AD FS server farm.
You can use either SQL Server or the Windows Internal Database to hold AD FS configuration information. The
Windows Internal Database provides basic redundancy. Changes are written directly to only one of the AD FS
databases in the AD FS cluster, while the other servers use pull replication to keep their databases up to date. Using
SQL Server can provide full database redundancy and high availability using failover clustering or mirroring.

Manageability considerations
DevOps staff should be prepared to perform the following tasks:
Managing the federation servers, including managing the AD FS farm, managing trust policy on the federation
servers, and managing the certificates used by the federation services.
Managing the WAP servers including managing the WAP farm and certificates.
Managing web applications including configuring relying parties, authentication methods, and claims
mappings.
Backing up AD FS components.
Security considerations
AD FS uses HTTPS, so make sure that the NSG rules for the subnet containing the web tier VMs permit HTTPS
requests. These requests can originate from the on-premises network, the subnets containing the web tier,
business tier, data tier, private DMZ, public DMZ, and the subnet containing the AD FS servers.
Prevent direct exposure of the AD FS servers to the Internet. AD FS servers are domain-joined computers that have
full authorization to grant security tokens. If a server is compromised, a malicious user can issue full access tokens
to all web applications and to all federation servers that are protected by AD FS. If your system must handle
requests from external users not connecting from trusted partner sites, use WAP servers to handle these requests.
For more information, see Where to Place a Federation Server Proxy.
Place AD FS servers and WAP servers in separate subnets with their own firewalls. You can use NSG rules to define
firewall rules. All firewalls should allow traffic on port 443 (HTTPS).
Restrict direct sign in access to the AD FS and WAP servers. Only DevOps staff should be able to connect. Do not
join the WAP servers to the domain.
Consider using a set of network virtual appliances that logs detailed information on traffic traversing the edge of
your virtual network for auditing purposes.

DevOps considerations
For DevOps considerations, see DevOps: Extending Active Directory Domain Services (AD DS) to Azure.

Cost considerations
Use the Azure pricing calculator to estimate costs. Other considerations are described in the Cost section in
Microsoft Azure Well-Architected Framework.
Here are cost considerations for the services used in this architecture.
AD Domain Services
Consider having Active Directory Domain Services as a shared service that is consumed by multiple workloads to
lower costs. For more information, see Active Directory Domain Services pricing.
Azure AD Federation Services
For information about the editions offered by Azure Active Directory, see Azure AD pricing. The AD Federation
Services feature is available in all editions.

Deploy the solution


A deployment for this architecture is available on GitHub. Note that the entire deployment can take up to two
hours, which includes creating the VPN gateway and running the scripts that configure Active Directory and AD FS.
Prerequisites
1. Clone, fork, or download the zip file for the GitHub repository.
2. Install Azure CLI 2.0.
3. Install the Azure building blocks npm package.

npm install -g @mspnp/azure-building-blocks

4. From a command prompt, bash prompt, or PowerShell prompt, sign into your Azure account as follows:
az login

Deploy the simulated on-premises datacenter


1. Navigate to the adfs folder of the GitHub repository.
2. Open the onprem.json file. Search for instances of adminPassword , Password , and SafeModeAdminPassword
and update the passwords.
3. Run the following command and wait for the deployment to finish:

azbb -s <subscription_id> -g <resource group> -l <location> -p onprem.json --deploy

Deploy the Azure infrastructure


1. Open the azure.json file. Search for instances of adminPassword and Password and add values for the
passwords.
2. Run the following command and wait for the deployment to finish:

azbb -s <subscription_id> -g <resource group> -l <location> -p azure.json --deploy

Set up the AD FS farm


1. Open the adfs-farm-first.json file. Search for AdminPassword and replace the default password.
2. Run the following command:

azbb -s <subscription_id> -g <resource group> -l <location> -p adfs-farm-first.json --deploy

3. Open the adfs-farm-rest.json file. Search for AdminPassword and replace the default password.
4. Run the following command and wait for the deployment to finish:

azbb -s <subscription_id> -g <resource group> -l <location> -p adfs-farm-rest.json --deploy

Configure AD FS (part 1)
1. Open a remote desktop session to the VM named ra-adfs-jb-vm1 , which is the jumpbox VM. The user
name is testuser .
2. From the jumpbox, open a remote desktop session to the VM named ra-adfs-proxy-vm1 . The private IP
address is 10.0.6.4.
3. From this remote desktop session, run the PowerShell ISE.
4. In PowerShell, navigate to the following directory:

C:\Packages\Plugins\Microsoft.Powershell.DSC\2.77.0.0\DSCWork\adfs-v2.0

5. Paste the following code into a script pane and run it:
. .\adfs-webproxy.ps1
$cd = @{
AllNodes = @(
@{
NodeName = 'localhost'
PSDscAllowPlainTextPassword = $true
PSDscAllowDomainUser = $true
}
)
}

$c1 = Get-Credential -UserName testuser -Message "Enter password"


InstallWebProxyApp -DomainName contoso.com -FederationName adfs.contoso.com -WebApplicationProxyName
"Contoso App" -AdminCreds $c1 -ConfigurationData $cd
Start-DscConfiguration .\InstallWebProxyApp

At the Get-Credential prompt, enter the password that you specified in the deployment parameter file.
6. Run the following command to monitor the progress of the DSC configuration:

Get-DscConfigurationStatus

It can take several minutes to reach consistency. During this time, you may see errors from the command.
When the configuration succeeds, the output should look similar to the following:

PS C:\Packages\Plugins\Microsoft.Powershell.DSC\2.77.0.0\DSCWork\adfs-v2.0> Get-DscConfigurationStatus

Status StartDate Type Mode RebootRequested NumberOfResources


------ --------- ---- ---- --------------- -----------------
Success 12/17/2018 8:21:09 PM Consistency PUSH True 4

Configure AD FS (part 2)
1. From the jumpbox, open a remote desktop session to the VM named ra-adfs-proxy-vm2 . The private IP
address is 10.0.6.5.
2. From this remote desktop session, run the PowerShell ISE.
3. Navigate to the following directory:

C:\Packages\Plugins\Microsoft.Powershell.DSC\2.77.0.0\DSCWork\adfs-v2.0

4. Past the following in a script pane and run the script:

. .\adfs-webproxy-rest.ps1
$cd = @{
AllNodes = @(
@{
NodeName = 'localhost'
PSDscAllowPlainTextPassword = $true
PSDscAllowDomainUser = $true
}
)
}

$c1 = Get-Credential -UserName testuser -Message "Enter password"


InstallWebProxy -DomainName contoso.com -FederationName adfs.contoso.com -WebApplicationProxyName
"Contoso App" -AdminCreds $c1 -ConfigurationData $cd
Start-DscConfiguration .\InstallWebProxy
At the Get-Credential prompt, enter the password that you specified in the deployment parameter file.
5. Run the following command to monitor the progress of the DSC configuration:

Get-DscConfigurationStatus

It can take several minutes to reach consistency. During this time, you may see errors from the command.
When the configuration succeeds, the output should look similar to the following:

PS C:\Packages\Plugins\Microsoft.Powershell.DSC\2.77.0.0\DSCWork\adfs-v2.0> Get-DscConfigurationStatus

Status StartDate Type Mode RebootRequested NumberOfResources


------ --------- ---- ---- --------------- -----------------
Success 12/17/2018 8:21:09 PM Consistency PUSH True 4

Sometimes this DSC fails. If the status check shows Status=Failure and Type=Consistency , try re-running
step 4.
Sign into AD FS
1. From the jumpbox, open a remote desktop session to the VM named ra-adfs-adfs-vm1 . The private IP
address is 10.0.5.4.
2. Follow the steps in Enable the Idp-Initiated Sign on page to enable the sign-on page.
3. From the jump box, browse to https://adfs.contoso.com/adfs/ls/idpinitiatedsignon.htm . You may receive a
certificate warning that you can ignore for this test.
4. Verify that the Contoso Corporation sign-in page appears. Sign in as contoso\testuser .
Hybrid Identity
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

The need to keep application components on-premises doesn't have to be a barrier to adopting cloud technologies.
With Azure Stack Hub, app components can reside on-premises while interacting with components running in
Azure public cloud. This blueprint enables teams to manage identity for users as well as applications in a way that is
consistent across clouds.

Architecture

Download an SVG of this architecture.


Data Flow
1. Set up an Azure Active Directory tenant.
2. Create users.
3. Deploy, manage, and operate application resources on Azure and Azure Stack Hub.
4. Create service principles.
5. Deploy with service principles.
6. Application resources can communicate over network.

Components
Azure Stack Hub is a hybrid cloud platform that lets you use Azure services on-premises.
Virtual Machines: Provision Windows and Linux virtual machines in seconds.
Learn how to synchronize directories and enable single sign-on with Azure Active Directory (Azure AD).

Next steps
Azure Stack Hub User Documentation
Virtual Machines Overview
Azure Active Directory Documentation
Integrate on-premises Active Directory domains with
Azure Active Directory
11/2/2020 • 16 minutes to read • Edit Online

Azure Active Directory (Azure AD) is a cloud-based multi-tenant directory and identity service. This reference
architecture shows best practices for integrating on-premises Active Directory domains with Azure AD to provide
cloud-based identity authentication. Deploy this solution .

Download a Visio file of this architecture.

NOTE
For simplicity, this diagram only shows the connections directly related to Azure AD, and not protocol-related traffic that may
occur as part of authentication and identity federation. For example, a web application may redirect the web browser to
authenticate the request through Azure AD. Once authenticated, the request can be passed back to the web application, with
the appropriate identity information.

Typical uses for this reference architecture include:


Web applications deployed in Azure that provide access to remote users who belong to your organization.
Implementing self-service capabilities for end-users, such as resetting their passwords, and delegating group
management. This requires Azure AD Premium edition.
Architectures in which the on-premises network and the application's Azure VNet are not connected using a
VPN tunnel or ExpressRoute circuit.

NOTE
Azure AD can authenticate the identity of users and applications that exist in an organization's directory. Some applications
and services, such as SQL Server, may require computer authentication, in which case this solution is not appropriate.

For additional considerations, see Choose a solution for integrating on-premises Active Directory with Azure.

Architecture
The architecture has the following components.
Azure AD tenant . An instance of Azure AD created by your organization. It acts as a directory service for
cloud applications by storing objects copied from the on-premises Active Directory and provides identity
services.
Web tier subnet . This subnet holds VMs that run a web application. Azure AD can act as an identity broker
for this application.
On-premises AD DS ser ver . An on-premises directory and identity service. The AD DS directory can be
synchronized with Azure AD to enable it to authenticate on-premises users.
Azure AD Connect sync ser ver . An on-premises computer that runs the Azure AD Connect sync service.
This service synchronizes information held in the on-premises Active Directory to Azure AD. For example, if
you provision or deprovision groups and users on-premises, these changes propagate to Azure AD.

NOTE
For security reasons, Azure AD stores user's passwords as a hash. If a user requires a password reset, this must be
performed on-premises and the new hash must be sent to Azure AD. Azure AD Premium editions include features
that can automate this task to enable users to reset their own passwords.

VMs for N-tier application . The deployment includes infrastructure for an N-tier application. For more
information about these resources, see Run VMs for an N-tier architecture.

Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have a
specific requirement that overrides them.
Azure AD Connect sync service
The Azure AD Connect sync service ensures that identity information stored in the cloud is consistent with that
held on-premises. You install this service using the Azure AD Connect software.
Before implementing Azure AD Connect sync, determine the synchronization requirements of your organization.
For example, what to synchronize, from which domains, and how frequently. For more information, see Determine
directory synchronization requirements.
You can run the Azure AD Connect sync service on a VM or a computer hosted on-premises. Depending on the
volatility of the information in your Active Directory directory, the load on the Azure AD Connect sync service is
unlikely to be high after the initial synchronization with Azure AD. Running the service on a VM makes it easier to
scale the server if needed. Monitor the activity on the VM as described in the Monitoring considerations section to
determine whether scaling is necessary.
If you have multiple on-premises domains in a forest, we recommend storing and synchronizing information for
the entire forest to a single Azure AD tenant. Filter information for identities that occur in more than one domain,
so that each identity appears only once in Azure AD, rather than being duplicated. Duplication can lead to
inconsistencies when data is synchronized. For more information, see the Topology section below.
Use filtering so that only necessary data is stored in Azure AD. For example, your organization might not want to
store information about inactive accounts in Azure AD. Filtering can be group-based, domain-based, organization
unit (OU)-based, or attribute-based. You can combine filters to generate more complex rules. For example, you
could synchronize objects held in a domain that have a specific value in a selected attribute. For detailed
information, see Azure AD Connect sync: Configure Filtering.
To implement high availability for the AD Connect sync service, run a secondary staging server. For more
information, see the Topology recommendations section.
Security recommendations
User password management . The Azure AD Premium editions support password writeback, enabling your on-
premises users to perform self-service password resets from within the Azure portal. This feature should be
enabled only after reviewing your organization's password security policy. For example, you can restrict which
users can change their passwords, and you can tailor the password management experience. For more
information, see Customizing Password Management to fit your organization's needs.
Protect on-premises applications that can be accessed externally. Use the Azure AD Application Proxy to
provide controlled access to on-premises web applications for external users through Azure AD. Only users that
have valid credentials in your Azure directory have permission to use the application. For more information, see
the article Enable Application Proxy in the Azure portal.
Actively monitor Azure AD for signs of suspicious activity. Consider using Azure AD Premium P2 edition,
which includes Azure AD Identity Protection. Identity Protection uses adaptive machine learning algorithms and
heuristics to detect anomalies and risk events that may indicate that an identity has been compromised. For
example, it can detect potentially unusual activity such as irregular sign-in activities, sign-ins from unknown
sources or from IP addresses with suspicious activity, or sign-ins from devices that may be infected. Using this data,
Identity Protection generates reports and alerts that enables you to investigate these risk events and take
appropriate action. For more information, see Azure Active Directory Identity Protection.
You can use the reporting feature of Azure AD in the Azure portal to monitor security-related activities occurring in
your system. For more information about using these reports, see Azure Active Directory Reporting Guide.
Topology recommendations
Configure Azure AD Connect to implement a topology that most closely matches the requirements of your
organization. Topologies that Azure AD Connect supports include:
Single forest, single Azure AD director y . In this topology, Azure AD Connect synchronizes objects and
identity information from one or more domains in a single on-premises forest into a single Azure AD tenant.
This is the default topology implemented by the express installation of Azure AD Connect.

NOTE
Don't use multiple Azure AD Connect sync servers to connect different domains in the same on-premises forest to
the same Azure AD tenant, unless you are running a server in staging mode, described below.

Multiple forests, single Azure AD director y . In this topology, Azure AD Connect synchronizes objects
and identity information from multiple forests into a single Azure AD tenant. Use this topology if your
organization has more than one on-premises forest. You can consolidate identity information so that each
unique user is represented once in the Azure AD directory, even if the same user exists in more than one
forest. All forests use the same Azure AD Connect sync server. The Azure AD Connect sync server does not
have to be part of any domain, but it must be reachable from all forests.

NOTE
In this topology, don't use separate Azure AD Connect sync servers to connect each on-premises forest to a single
Azure AD tenant. This can result in duplicated identity information in Azure AD if users are present in more than one
forest.

Multiple forests, separate topologies . This topology merges identity information from separate forests
into a single Azure AD tenant, treating all forests as separate entities. This topology is useful if you are
combining forests from different organizations and the identity information for each user is held in only one
forest.
NOTE
If the global address lists (GAL) in each forest are synchronized, a user in one forest may be present in another as a
contact. This can occur if your organization has implemented GALSync with Forefront Identity manager 2010 or
Microsoft Identity Manager 2016. In this scenario, you can specify that users should be identified by their Mail
attribute. You can also match identities using the ObjectSID and msExchMasterAccountSID attributes. This is useful if
you have one or more resource forests with disabled accounts.

Staging ser ver . In this configuration, you run a second instance of the Azure AD Connect sync server in
parallel with the first. This structure supports scenarios such as:
High availability.
Testing and deploying a new configuration of the Azure AD Connect sync server.
Introducing a new server and decommissioning an old configuration.
In these scenarios, the second instance runs in staging mode. The server records imported objects
and synchronization data in its database, but does not pass the data to Azure AD. If you disable
staging mode, the server starts writing data to Azure AD, and also starts performing password write-
backs into the on-premises directories where appropriate. For more information, see Azure AD
Connect sync: Operational tasks and considerations.
Multiple Azure AD directories . It is recommended that you create a single Azure AD directory for an
organization, but there may be situations where you need to partition information across separate Azure AD
directories. In this case, avoid synchronization and password write-back issues by ensuring that each object
from the on-premises forest appears in only one Azure AD directory. To implement this scenario, configure
separate Azure AD Connect sync servers for each Azure AD directory, and use filtering so that each Azure
AD Connect sync server operates on a mutually exclusive set of objects.
For more information about these topologies, see Topologies for Azure AD Connect.
User authentication
By default, the Azure AD Connect sync server configures password hash synchronization between the on-premises
domain and Azure AD, and the Azure AD service assumes that users authenticate by providing the same password
that they use on-premises. For many organizations, this is appropriate, but you should consider your organization's
existing policies and infrastructure. For example:
The security policy of your organization may prohibit synchronizing password hashes to the cloud. In this case,
your organization should consider pass-through authentication.
You might require that users experience seamless single sign-on (SSO) when accessing cloud resources from
domain-joined machines on the corporate network.
Your organization might already have Active Directory Federation Services (AD FS) or a third-party federation
provider deployed. You can configure Azure AD to use this infrastructure to implement authentication and SSO
rather than by using password information held in the cloud.
For more information, see Azure AD Connect User Sign-on options.
Azure AD application proxy
Use Azure AD to provide access to on-premises applications.
Expose your on-premises web applications using application proxy connectors managed by the Azure AD
application proxy component. The application proxy connector opens an outbound network connection to the
Azure AD application proxy, and remote users' requests are routed back from Azure AD through this connection to
the web apps. This removes the need to open inbound ports in the on-premises firewall and reduces the attack
surface exposed by your organization.
For more information, see Publish applications using Azure AD Application proxy.
Object synchronization
The default configuration for Azure AD Connect synchronizes objects from your local Active Directory directory
based on the rules specified in the article Azure AD Connect sync: Understanding the default configuration. Objects
that satisfy these rules are synchronized while all other objects are ignored. Some example rules:
User objects must have a unique sourceAnchor attribute and the accountEnabled attribute must be populated.
User objects must have a sAMAccountName attribute and cannot start with the text Azure AD_ or MSOL_.
Azure AD Connect applies several rules to User, Contact, Group, ForeignSecurityPrincipal, and Computer objects.
Use the Synchronization Rules Editor installed with Azure AD Connect if you need to modify the default set of rules.
For more information, see Azure AD Connect sync: Understanding the default configuration).
You can also define your own filters to limit the objects to be synchronized by domain or OU. Alternatively, you can
implement more complex custom filtering such as that described in Azure AD Connect sync: Configure Filtering.
Monitoring
Health monitoring is performed by the following agents installed on-premises:
Azure AD Connect installs an agent that captures information about synchronization operations. Use the Azure
AD Connect Health blade in the Azure portal to monitor its health and performance. For more information, see
Using Azure AD Connect Health for sync.
To monitor the health of the AD DS domains and directories from Azure, install the Azure AD Connect Health for
AD DS agent on a machine within the on-premises domain. Use the Azure Active Directory Connect Health
blade in the Azure portal for health monitoring. For more information, see Using Azure AD Connect Health with
AD DS
Install the Azure AD Connect Health for AD FS agent to monitor the health of services running on on-premises,
and use the Azure Active Directory Connect Health blade in the Azure portal to monitor AD FS. For more
information, see Using Azure AD Connect Health with AD FS
For more information on installing the AD Connect Health agents and their requirements, see Azure AD Connect
Health Agent Installation.

Scalability considerations
The Azure AD service supports scalability based on replicas, with a single primary replica that handles write
operations plus multiple read-only secondary replicas. Azure AD transparently redirects attempted writes made
against secondary replicas to the primary replica and provides eventual consistency. All changes made to the
primary replica are propagated to the secondary replicas. This architecture scales well because most operations
against Azure AD are reads rather than writes. For more information, see Azure AD: Under the hood of our geo-
redundant, highly available, distributed cloud directory.
For the Azure AD Connect sync server, determine how many objects you are likely to synchronize from your local
directory. If you have less than 100,000 objects, you can use the default SQL Server Express LocalDB software
provided with Azure AD Connect. If you have a larger number of objects, you should install a production version of
SQL Server and perform a custom installation of Azure AD Connect, specifying that it should use an existing
instance of SQL Server.

Availability considerations
The Azure AD service is geo-distributed and runs in multiple datacenters spread around the world with automated
failover. If a datacenter becomes unavailable, Azure AD ensures that your directory data is available for instance
access in at least two more regionally dispersed datacenters.
NOTE
The service level agreement (SLA) for the Microsoft 365 Apps AD tier and Premium services guarantees at least 99.9%
availability. There is no SLA for the Free tier of Azure AD. For more information, see SLA for Azure Active Directory.

Consider provisioning a second instance of Azure AD Connect sync server in staging mode to increase availability,
as discussed in the topology recommendations section.
If you are not using the SQL Server Express LocalDB instance that comes with Azure AD Connect, consider using
SQL clustering to achieve high availability. Solutions such as mirroring and Always On are not supported by Azure
AD Connect.
For additional considerations about achieving high availability of the Azure AD Connect sync server and also how
to recover after a failure, see Azure AD Connect sync: Operational tasks and considerations - Disaster Recovery.

Manageability considerations
There are two aspects to managing Azure AD:
Administering Azure AD in the cloud.
Maintaining the Azure AD Connect sync servers.
Azure AD provides the following options for managing domains and directories in the cloud:
Azure Active Director y PowerShell Module . Use this module if you need to script common Azure AD
administrative tasks such as user management, domain management, and configuring single sign-on.
Azure AD management blade in the Azure por tal . This blade provides an interactive management view of
the directory, and enables you to control and configure most aspects of Azure AD.
Azure AD Connect installs the following tools to maintain Azure AD Connect sync services from your on-premises
machines:
Microsoft Azure Active Director y Connect console . This tool enables you to modify the configuration of
the Azure AD Sync server, customize how synchronization occurs, enable or disable staging mode, and switch
the user sign-in mode. You can enable Active Directory FS sign-in using your on-premises infrastructure.
Synchronization Ser vice Manager . Use the Operations tab in this tool to manage the synchronization
process and detect whether any parts of the process have failed. You can trigger synchronizations manually
using this tool. The Connectors tab enables you to control the connections for the domains that the
synchronization engine is attached to.
Synchronization Rules Editor . Use this tool to customize the way objects are transformed when they are
copied between an on-premises directory and Azure AD. This tool enables you to specify additional attributes
and objects for synchronization, then executes filters to determine which objects should or should not be
synchronized. For more information, see the Synchronization Rule Editor section in the document Azure AD
Connect sync: Understanding the default configuration.
For more information and tips for managing Azure AD Connect, see Azure AD Connect sync: Best practices for
changing the default configuration.

Security considerations
Use conditional access control to deny authentication requests from unexpected sources:
Trigger Azure Multi-Factor Authentication (MFA) if a user attempts to connect from a untrusted location such
as across the Internet instead of a trusted network.
Use the device platform type of the user (iOS, Android, Windows Mobile, Windows) to determine access
policy to applications and features.
Record the enabled/disabled state of users' devices, and incorporate this information into the access policy
checks. For example, if a user's phone is lost or stolen it should be recorded as disabled to prevent it from
being used to gain access.
Control user access to resources based on group membership. Use Azure AD dynamic membership rules to
simplify group administration. For a brief overview of how this works, see Introduction to Dynamic
Memberships for Groups.
Use conditional access risk policies with Azure AD Identity Protection to provide advanced protection based
on unusual sign-in activities or other events.
For more information, see Azure Active Directory conditional access.

DevOps considerations
For DevOps considerations, see DevOps: Extending Active Directory Domain Services (AD DS) to Azure.

Cost considerations
Use the Azure pricing calculator to estimate costs. Other considerations are described in the Cost section in
Microsoft Azure Well-Architected Framework.
Here are cost considerations for the services used in this architecture.
Azure AD Connect
For information about the editions offered by Azure Active Directory, see Azure AD pricing. The AD Connect sync
feature is available in all editions.
VMs for N -Tier application
The deployment includes infrastructure for an N-tier application. For cost information about these resources, Run
VMs for an N-tier architecture.

Deploy the solution


A deployment for a reference architecture that implements these recommendations and considerations is available
on GitHub. This reference architecture deploys a simulated on-premises network in Azure that you can use to test
and experiment. To deploy the solution, see the readme on GitHub.
Choose a solution for integrating on-premises Active
Directory with Azure
11/2/2020 • 4 minutes to read • Edit Online

This article compares options for integrating your on-premises Active Directory (AD) environment with an Azure
network. For each option, a more detailed reference architecture is available.
Many organizations use Active Directory Domain Services (AD DS) to authenticate identities associated with
users, computers, applications, or other resources that are included in a security boundary. Directory and identity
services are typically hosted on-premises, but if your application is hosted partly on-premises and partly in Azure,
there may be latency sending authentication requests from Azure back to on-premises. Implementing directory
and identity services in Azure can reduce this latency.
Azure provides two solutions for implementing directory and identity services in Azure:
Use Azure AD to create an Active Directory domain in the cloud and connect it to your on-premises Active
Directory domain. Azure AD Connect integrates your on-premises directories with Azure AD.
Extend your existing on-premises Active Directory infrastructure to Azure, by deploying a VM in Azure that
runs AD DS as a Domain Controller. This architecture is more common when the on-premises network and
the Azure virtual network (VNet) are connected by a VPN or ExpressRoute connection. Several variations of
this architecture are possible:
Create a domain in Azure and join it to your on-premises AD forest.
Create a separate forest in Azure that is trusted by domains in your on-premises forest.
Replicate an Active Directory Federation Services (AD FS) deployment to Azure.
The next sections describe each of these options in more detail.

Integrate your on-premises domains with Azure AD


Use Azure Active Directory (Azure AD) to create a domain in Azure and link it to an on-premises AD domain.
The Azure AD directory is not an extension of an on-premises directory. Rather, it's a copy that contains the same
objects and identities. Changes made to these items on-premises are copied to Azure AD, but changes made in
Azure AD are not replicated back to the on-premises domain.
You can also use Azure AD without using an on-premises directory. In this case, Azure AD acts as the primary
source of all identity information, rather than containing data replicated from an on-premises directory.
Benefits
You don't need to maintain an AD infrastructure in the cloud. Azure AD is entirely managed and maintained by
Microsoft.
Azure AD provides the same identity information that is available on-premises.
Authentication can happen in Azure, reducing the need for external applications and users to contact the on-
premises domain.
Challenges
You must configure connectivity with your on-premises domain to keep the Azure AD directory synchronized.
Applications may need to be rewritten to enable authentication through Azure AD.
If you wish to authenticate service and computer accounts, you will have to also deploy Azure Active Directory
Domain Services.
Reference architecture
Integrate on-premises Active Directory domains with Azure Active Directory

AD DS in Azure joined to an on-premises forest


Deploy AD Domain Services (AD DS) servers to Azure. Create a domain in Azure and join it to your on-premises
AD forest.
Consider this option if you need to use AD DS features that are not currently implemented by Azure AD.
Benefits
Provides access to the same identity information that is available on-premises.
You can authenticate user, service, and computer accounts on-premises and in Azure.
You don't need to manage a separate AD forest. The domain in Azure can belong to the on-premises forest.
You can apply group policy defined by on-premises Group Policy Objects to the domain in Azure.
Challenges
You must deploy and manage your own AD DS servers and domain in the cloud.
There may be some synchronization latency between the domain servers in the cloud and the servers running
on-premises.
Reference architecture
Extend Active Directory Domain Services (AD DS) to Azure

AD DS in Azure with a separate forest


Deploy AD Domain Services (AD DS) servers to Azure, but create a separate Active Directory forest that is
separate from the on-premises forest. This forest is trusted by domains in your on-premises forest.
Typical uses for this architecture include maintaining security separation for objects and identities held in the
cloud, and migrating individual domains from on-premises to the cloud.
Benefits
You can implement on-premises identities and separate Azure-only identities.
You don't need to replicate from the on-premises AD forest to Azure.
Challenges
Authentication within Azure for on-premises identities requires extra network hops to the on-premises AD
servers.
You must deploy your own AD DS servers and forest in the cloud, and establish the appropriate trust
relationships between forests.
Reference architecture
Create an Active Directory Domain Services (AD DS) resource forest in Azure

Extend AD FS to Azure
Replicate an Active Directory Federation Services (AD FS) deployment to Azure, to perform federated
authentication and authorization for components running in Azure.
Typical uses for this architecture:
Authenticate and authorize users from partner organizations.
Allow users to authenticate from web browsers running outside of the organizational firewall.
Allow users to connect from authorized external devices such as mobile devices.
Benefits
You can leverage claims-aware applications.
Provides the ability to trust external partners for authentication.
Compatibility with large set of authentication protocols.
Challenges
You must deploy your own AD DS, AD FS, and AD FS Web Application Proxy servers in Azure.
This architecture can be complex to configure.
Reference architecture
Extend Active Directory Federation Services (AD FS) to Azure
Basic enterprise integration on Azure
11/2/2020 • 10 minutes to read • Edit Online

This reference architecture uses Azure Integration Services to orchestrate calls to enterprise backend systems. The
backend systems may include software as a service (SaaS) systems, Azure services, and existing web services in
your enterprise.
Azure Integration Services is a collection of services for integrating applications and data. This architecture uses
two of those services: Logic Apps to orchestrate workflows, and API Management to create catalogs of APIs. This
architecture is sufficient for basic integration scenarios where the workflow is triggered by synchronous calls to
backend services. A more sophisticated architecture using queues and events builds on this basic architecture.

Architecture
The architecture has the following components:
Backend systems . The right-hand side of the diagram shows the various backend systems that the
enterprise has deployed or relies on. These might include SaaS systems, other Azure services, or web
services that expose REST or SOAP endpoints.
Azure Logic Apps . Logic Apps is a serverless platform for building enterprise workflows that integrate
applications, data, and services. In this architecture, the logic apps are triggered by HTTP requests. You can
also nest workflows for more complex orchestration. Logic Apps uses connectors to integrate with
commonly used services. Logic Apps offers hundreds of connectors, and you can create custom connectors.
Azure API Management . API Management is a managed service for publishing catalogs of HTTP APIs, to
promote reuse and discoverability. API Management consists of two related components:
API gateway . The API gateway accepts HTTP calls and routes them to the backend.
Developer por tal . Each instance of Azure API Management provides access to a developer portal.
This portal gives your developers access to documentation and code samples for calling the APIs. You
can also test APIs in the developer portal.
In this architecture, composite APIs are built by importing logic apps as APIs. You can also import existing
web services by importing OpenAPI (Swagger) specifications or importing SOAP APIs from WSDL
specifications.
The API gateway helps to decouple front-end clients from the back end. For example, it can rewrite URLs, or
transform requests before they reach the backend. It also handles many cross-cutting concerns such as
authentication, cross-origin resource sharing (CORS) support, and response caching.
Azure DNS . Azure DNS is a hosting service for DNS domains. Azure DNS provides name resolution by
using the Microsoft Azure infrastructure. By hosting your domains in Azure, you can manage your DNS
records by using the same credentials, APIs, tools, and billing that you use for your other Azure services. To
use a custom domain name, such as contoso.com, create DNS records that map the custom domain name
to the IP address. For more information, see Configure a custom domain name in API Management.
Azure Active Director y (Azure AD) . Use Azure AD to authenticate clients that call the API gateway. Azure
AD supports the OpenID Connect (OIDC) protocol. Clients obtain an access token from Azure AD, and API
Gateway validates the token to authorize the request. When using the Standard or Premium tier of API
Management, Azure AD can also secure access to the developer portal.

Recommendations
Your specific requirements might differ from the generic architecture shown here. Use the recommendations in
this section as a starting point.
API Management
Use the API Management Basic, Standard, or Premium tiers. These tiers offer a production service level agreement
(SLA) and support scale out within the Azure region. Throughput capacity for API Management is measured in
units. Each pricing tier has a maximum scale-out. The Premium tier also supports scale out across multiple Azure
regions. Choose your tier based on your feature set and the level of required throughput. For more information,
see API Management pricing and Capacity of an Azure API Management instance.
Each Azure API Management instance has a default domain name, which is a subdomain of azure-api.net , for
example, contoso.azure-api.net . Consider configuring a custom domain for your organization.
Logic Apps
Logic Apps works best in scenarios that don't require low latency for a response, such as asynchronous or semi
long-running API calls. If low latency is required, for example in a call that blocks a user interface, use a different
technology. For example, use Azure Functions or a web API deployed to Azure App Service. Use API Management
to front the API to your API consumers.
Region
To minimize network latency, put API Management and Logic Apps in the same region. In general, choose the
region that's closest to your users (or closest to your backend services).
The resource group also has a region. This region specifies where to store deployment metadata and where to
execute the deployment template. To improve availability during deployment, put the resource group and
resources in the same region.

Scalability considerations
To increase the scalability of API Management, add caching policies where appropriate. Caching also helps reduce
the load on back-end services.
To offer greater capacity, you can scale out Azure API Management Basic, Standard, and Premium tiers in an Azure
region. To analyze the usage for your service, on the Metrics menu, select the Capacity Metric option and then
scale up or scale down as appropriate. The upgrade or scale process can take from 15 to 45 minutes to apply.
Recommendations for scaling an API Management service:
Consider traffic patterns when scaling. Customers with more volatile traffic patterns need more capacity.
Consistent capacity that's greater than 66% might indicate a need to scale up.
Consistent capacity that's under 20% might indicate an opportunity to scale down.
Before you enable the load in production, always load-test your API Management service with a
representative load.
With the Premium tier, you can scale an API Management instance across multiple Azure regions. This makes API
Management eligible for a higher SLA, and lets you provision services near users in multiple regions.
The Logic Apps serverless model means administrators don't have to plan for service scalability. The service
automatically scales to meet demand.

Availability considerations
Review the SLA for each service:
API Management SLA
Logic Apps SLA
If you deploy API Management across two or more regions with Premium tier, it is eligible for a higher SLA. See
API Management pricing.
Backups
Regularly back up your API Management configuration. Store your backup files in a location or Azure region that
differs from the region where the service is deployed. Based on your RTO, choose a disaster recovery strategy:
In a disaster recovery event, provision a new API Management instance, restore the backup to the new
instance, and repoint the DNS records.
Keep a passive instance of the API Management service in another Azure region. Regularly restore backups
to that instance, to keep it in sync with the active service. To restore the service during a disaster recovery
event, you need only repoint the DNS records. This approach incurs additional cost because you pay for the
passive instance, but reduces the time to recover.
For logic apps, we recommend a configuration-as-code approach to backing up and restoring. Because logic apps
are serverless, you can quickly recreate them from Azure Resource Manager templates. Save the templates in
source control, integrate the templates with your continuous integration/continuous deployment (CI/CD) process.
In a disaster recovery event, deploy the template to a new region.
If you deploy a logic app to a different region, update the configuration in API Management. You can update the
API's Backend property by using a basic PowerShell script.

DevOps considerations
Create separate resource groups for production, development, and test environments. Separate resource groups
make it easier to manage deployments, delete test deployments, and assign access rights.
When you assign resources to resource groups, consider these factors:
Lifecycle . In general, put resources that have the same lifecycle in the same resource group.
Access . To apply access policies to the resources in a group, you can use role-based access control (RBAC).
Billing . You can view rollup costs for the resource group.
Pricing tier for API Management . Use the Developer tier for development and test environments. To
minimize costs during preproduction, deploy a replica of your production environment, run your tests, and
then shut down.
Deployment
Use Azure Resource Manager templates to deploy the Azure resources, follow the infrastructure as Code (IaC)
Process. Templates make it easier to automate deployments using Azure DevOps Services, or other CI/CD
solutions.
Staging
Consider staging your workloads, which means deploying to various stages and running validations at each stage
before moving on to the next one; that way you can push updates to your production environments in a highly
controlled way and minimize unanticipated deployment issues. Blue-green deployment and Canary releases are
recommended deployment strategies for updating live production environments. Also consider having a good
rollback strategy for when a deployment fails; for example you could automatically redeploy an earlier, successful
deployment from your deployment history, the --rollback-on-error flag parameter in Azure CLI is good example.
Workload isolation
Put API Management and any individual logic apps in their own separate Resource Manager templates. By using
separate templates, you can store the resources in source control systems. You can deploy the templates together
or individually as part of a CI/CD process.
Versions
Each time you change a logic app's configuration or deploy an update through a Resource Manager template,
Azure keeps a copy of that version and keeps all versions that have a run history. You can use these versions to
track historical changes or promote a version as the logic app's current configuration. For example, you can roll
back a logic app to a previous version.
API Management supports two distinct but complementary versioning concepts:
Versions allow API consumers to choose an API version based on their needs, for example, v1, v2, beta, or
production.
Revisions allow API administrators to make non-breaking changes in an API and deploy those changes,
along with a change log to inform API consumers about the changes.
You can make a revision in a development environment and deploy that change in other environments by using
Resource Manager templates. For more information, see Publish multiple versions of your API
You can also use revisions to test an API before making the changes current and accessible to users. However, this
method isn't recommended for load testing or integration testing. Use separate test or preproduction
environments instead.

Diagnostics and monitoring


Use Azure Monitor for operational monitoring in both API Management and Logic Apps. Azure Monitor provides
information based on the metrics configured for each service and is enabled by default. For more information, see:
Monitor published APIs
Monitor status, set up diagnostics logging, and turn on alerts for Azure Logic Apps
Each service also has these options:
For deeper analysis and dashboarding, send Logic Apps logs to Azure Log Analytics.
For DevOps monitoring, configure Azure Application Insights for API Management.
API Management supports the Power BI solution template for custom API analytics. You can use this
solution template for creating your own analytics solution. For business users, Power BI makes reports
available.

Security considerations
Although this list doesn't completely describe all security best practices, here are some security considerations that
apply specifically to this architecture:
The Azure API Management service has a fixed public IP address. Restrict access for calling Logic Apps
endpoints to only the IP address of API Management. For more information, see Restrict inbound IP
addresses.
To make sure users have appropriate access levels, use role-based access control (RBAC).
Secure public API endpoints in API Management by using OAuth or OpenID Connect. To secure public API
endpoints, configure an identity provider, and add a JSON Web Token (JWT) validation policy. For more
information, see Protect an API by using OAuth 2.0 with Azure Active Directory and API Management.
Connect to back-end services from API Management by using mutual certificates.
Enforce HTTPS on the API Management APIs.
Storing secrets
Never check passwords, access keys, or connection strings into source control. If these values are required, secure
and deploy these values by using the appropriate techniques.
If a logic app requires any sensitive values that you can't create within a connector, store those values in Azure Key
Vault and reference them from a Resource Manager template. Use deployment template parameters and
parameter files for each environment. For more information, see Secure parameters and inputs within a workflow.
API Management manages secrets by using objects called named values or properties. These objects securely store
values that you can access through API Management policies. For more information, see How to use Named Values
in Azure API Management policies.

Cost considerations
In general, use the Azure pricing calculator to estimate costs. Here are some other considerations.
API Management
You are charged for all API Management instances when they are running. If you have scaled up and don't need
that level of performance all the time, manually scale down or configure autoscaling.
Logic Apps
Logic Apps uses a serverless model. Billing is calculated based on action and connector execution. For more
information, see Logic Apps pricing.
For more information, see the cost section in Microsoft Azure Well-Architected Framework.

Next steps
For greater reliability and scalability, use message queues and events to decouple the backend systems. This
pattern is shown in the next reference architecture in this series: Enterprise integration using message queues and
events.
Enterprise BI in Azure with Azure Synapse Analytics
11/2/2020 • 14 minutes to read • Edit Online

This reference architecture implements an extract, load, and transform (ELT) pipeline that moves data from an on-
premises SQL Server database into Azure Synapse and transforms the data for analysis.

A reference implementation for this architecture is available on GitHub.

Scenario : An organization has a large OLTP data set stored in a SQL Server database on premises. The
organization wants to use Azure Synapse to perform analysis using Power BI.
This reference architecture is designed for one-time or on-demand jobs. If you need to move data on a continuing
basis (hourly or daily), we recommend using Azure Data Factory to define an automated workflow. For a
reference architecture that uses Data Factory, see Automated enterprise BI with Azure Synapse and Azure Data
Factory.

Architecture
The architecture consists of the following components.
Data source
SQL Ser ver . The source data is located in a SQL Server database on premises. To simulate the on-premises
environment, the deployment scripts for this architecture provision a VM in Azure with SQL Server installed. The
Wide World Importers OLTP sample database is used as the source data.
Ingestion and data storage
Blob Storage . Blob storage is used as a staging area to copy the data before loading it into Azure Synapse.
Azure Synapse . Azure Synapse is a distributed system designed to perform analytics on large data. It supports
massive parallel processing (MPP), which makes it suitable for running high-performance analytics.
Analysis and reporting
Azure Analysis Ser vices . Analysis Services is a fully managed service that provides data modeling capabilities.
Use Analysis Services to create a semantic model that users can query. Analysis Services is especially useful in a
BI dashboard scenario. In this architecture, Analysis Services reads data from the data warehouse to process the
semantic model, and efficiently serves dashboard queries. It also supports elastic concurrency, by scaling out
replicas for faster query processing.
Currently, Azure Analysis Services supports tabular models but not multidimensional models. Tabular models use
relational modeling constructs (tables and columns), whereas multidimensional models use OLAP modeling
constructs (cubes, dimensions, and measures). If you require multidimensional models, use SQL Server Analysis
Services (SSAS). For more information, see Comparing tabular and multidimensional solutions.
Power BI . Power BI is a suite of business analytics tools to analyze data for business insights. In this architecture,
it queries the semantic model stored in Analysis Services.
Authentication
Azure Active Director y (Azure AD) authenticates users who connect to the Analysis Services server through
Power BI.

Data pipeline
This reference architecture uses the WorldWideImporters sample database as a data source. The data pipeline has
the following stages:
1. Export the data from SQL Server to flat files (bcp utility).
2. Copy the flat files to Azure Blob Storage (AzCopy).
3. Load the data into Azure Synapse (PolyBase).
4. Transform the data into a star schema (T-SQL).
5. Load a semantic model into Analysis Services (SQL Server Data Tools).

NOTE
For steps 1 – 3, consider using Redgate Data Platform Studio. Data Platform Studio applies the most appropriate
compatibility fixes and optimizations, so it's the quickest way to get started with Azure Synapse. For more information, see
Load data with Redgate Data Platform Studio.

The next sections describe these stages in more detail.


Export data from SQL Server
The bcp (bulk copy program) utility is a fast way to create flat text files from SQL tables. In this step, you select the
columns that you want to export, but don't transform the data. Any data transformations should happen in Azure
Synapse.
Recommendations:
If possible, schedule data extraction during off-peak hours, to minimize resource contention in the production
environment.
Avoid running bcp on the database server. Instead, run it from another machine. Write the files to a local drive.
Ensure that you have sufficient I/O resources to handle the concurrent writes. For best performance, export the
files to dedicated fast storage drives.
You can speed up the network transfer by saving the exported data in Gzip compressed format. However, loading
compressed files into the warehouse is slower than loading uncompressed files, so there is a tradeoff between
faster network transfer versus faster loading. If you decide to use Gzip compression, don't create a single Gzip file.
Instead, split the data into multiple compressed files.
Copy flat files into blob storage
The AzCopy utility is designed for high-performance copying of data into Azure blob storage.
Recommendations:
Create the storage account in a region near the location of the source data. Deploy the storage account and the
Azure Synapse instance in the same region.
Don't run AzCopy on the same machine that runs your production workloads, because the CPU and I/O
consumption can interfere with the production workload.
Test the upload first to see what the upload speed is like. You can use the /NC option in AzCopy to specify the
number of concurrent copy operations. Start with the default value, then experiment with this setting to tune the
performance. In a low-bandwidth environment, too many concurrent operations can overwhelm the network
connection and prevent the operations from completing successfully.
AzCopy moves data to storage over the public internet. If this isn't fast enough, consider setting up an
ExpressRoute circuit. ExpressRoute is a service that routes your data through a dedicated private connection to
Azure. Another option, if your network connection is too slow, is to physically ship the data on disk to an Azure
datacenter. For more information, see Transferring data to and from Azure.
During a copy operation, AzCopy creates a temporary journal file, which enables AzCopy to restart the operation
if it gets interrupted (for example, due to a network error). Make sure there is enough disk space to store the
journal files. You can use the /Z option to specify where the journal files are written.
Load data into Azure Synapse
Use PolyBase to load the files from blob storage into the data warehouse. PolyBase is designed to leverage the
MPP (Massively Parallel Processing) architecture of Azure Synapse, which makes it the fastest way to load data
into Azure Synapse.
Loading the data is a two-step process:
1. Create a set of external tables for the data. An external table is a table definition that points to data stored
outside of the warehouse — in this case, the flat files in blob storage. This step does not move any data into
the warehouse.
2. Create staging tables, and load the data into the staging tables. This step copies the data into the warehouse.
Recommendations:
Consider Azure Synapse when you have large amounts of data (more than 1 TB) and are running an analytics
workload that will benefit from parallelism. Azure Synapse is not a good fit for OLTP workloads or smaller data
sets (less than 250 GB). For data sets less than 250 GB, consider Azure SQL Database or SQL Server. For more
information, see Data warehousing.
Create the staging tables as heap tables, which are not indexed. The queries that create the production tables will
result in a full table scan, so there is no reason to index the staging tables.
PolyBase automatically takes advantage of parallelism in the warehouse. The load performance scales as you
increase DWUs. For best performance, use a single load operation. There is no performance benefit to breaking
the input data into chunks and running multiple concurrent loads.
PolyBase can read Gzip compressed files. However, only a single reader is used per compressed file, because
uncompressing the file is a single-threaded operation. Therefore, avoid loading a single large compressed file.
Instead, split the data into multiple compressed files, in order to take advantage of parallelism.
Be aware of the following limitations:
PolyBase supports a maximum column size of varchar(8000) , nvarchar(4000) , or varbinary(8000) . If you
have data that exceeds these limits, one option is to break the data up into chunks when you export it, and
then reassemble the chunks after import.
PolyBase uses a fixed row terminator of \n or newline. This can cause problems if newline characters
appear in the source data.
Your source data schema might contain data types that are not supported in Azure Synapse.
To work around these limitations, you can create a stored procedure that performs the necessary conversions.
Reference this stored procedure when you run bcp. Alternatively, Redgate Data Platform Studio automatically
converts data types that aren't supported in Azure Synapse.
For more information, see the following articles:
Best practices for loading data into Azure Synapse.
Migrate your schemas to Azure Synapse
Guidance for defining data types for tables in Azure Synapse
Transform the data
Transform the data and move it into production tables. In this step, the data is transformed into a star schema
with dimension tables and fact tables, suitable for semantic modeling.
Create the production tables with clustered columnstore indexes, which offer the best overall query performance.
Columnstore indexes are optimized for queries that scan many records. Columnstore indexes don't perform as
well for singleton lookups (that is, looking up a single row). If you need to perform frequent singleton lookups,
you can add a non-clustered index to a table. Singleton lookups can run significantly faster using a non-clustered
index. However, singleton lookups are typically less common in data warehouse scenarios than OLTP workloads.
For more information, see Indexing tables in Azure Synapse.

NOTE
Clustered columnstore tables do not support varchar(max) , nvarchar(max) , or varbinary(max) data types. In that
case, consider a heap or clustered index. You might put those columns into a separate table.

Because the sample database is not very large, we created replicated tables with no partitions. For production
workloads, using distributed tables is likely to improve query performance. See Guidance for designing
distributed tables in Azure Synapse. Our example scripts run the queries using a static resource class.
Load the semantic model
Load the data into a tabular model in Azure Analysis Services. In this step, you create a semantic data model by
using SQL Server Data Tools (SSDT). You can also create a model by importing it from a Power BI Desktop file.
Because Azure Synapse does not support foreign keys, you must add the relationships to the semantic model, so
that you can join across tables.
Use Power BI to visualize the data
Power BI supports two options for connecting to Azure Analysis Services:
Import. The data is imported into the Power BI model.
Live Connection. Data is pulled directly from Analysis Services.
We recommend Live Connection because it doesn't require copying data into the Power BI model. Also, using
DirectQuery ensures that results are always consistent with the latest source data. For more information, see
Connect with Power BI.
Recommendations:
Avoid running BI dashboard queries directly against the data warehouse. BI dashboards require very low
response times, which direct queries against the warehouse may be unable to satisfy. Also, refreshing the
dashboard will count against the number of concurrent queries, which could impact performance.
Azure Analysis Services is designed to handle the query requirements of a BI dashboard, so the recommended
practice is to query Analysis Services from Power BI.

Scalability considerations
Azure Synapse
With Azure Synapse, you can scale out your compute resources on demand. The query engine optimizes queries
for parallel processing based on the number of compute nodes, and moves data between nodes as necessary. For
more information, see Manage compute in Azure Synapse.
Analysis Services
For production workloads, we recommend the Standard Tier for Azure Analysis Services, because it supports
partitioning and DirectQuery. Within a tier, the instance size determines the memory and processing power.
Processing power is measured in Query Processing Units (QPUs). Monitor your QPU usage to select the
appropriate size. For more information, see Monitor server metrics.
Under high load, query performance can become degraded due to query concurrency. You can scale out Analysis
Services by creating a pool of replicas to process queries, so that more queries can be performed concurrently.
The work of processing the data model always happens on the primary server. By default, the primary server also
handles queries. Optionally, you can designate the primary server to run processing exclusively, so that the query
pool handles all queries. If you have high processing requirements, you should separate the processing from the
query pool. If you have high query loads, and relatively light processing, you can include the primary server in
the query pool. For more information, see Azure Analysis Services scale-out.
To reduce the amount of unnecessary processing, consider using partitions to divide the tabular model into
logical parts. Each partition can be processed separately. For more information, see Partitions.

Security considerations
IP allow list of Analysis Services clients
Consider using the Analysis Services firewall feature to allow list client IP addresses. If enabled, the firewall blocks
all client connections other than those specified in the firewall rules. The default rules allow list the Power BI
service, but you can disable this rule if desired. For more information, see Hardening Azure Analysis Services with
the new firewall capability.
Authorization
Azure Analysis Services uses Azure Active Directory (Azure AD) to authenticate users who connect to an Analysis
Services server. You can restrict what data a particular user is able to view, by creating roles and then assigning
Azure AD users or groups to those roles. For each role, you can:
Protect tables or individual columns.
Protect individual rows based on filter expressions.
For more information, see Manage database roles and users.

DevOps considerations
Create separate resource groups for production, development, and test environments. Separate resource
groups make it easier to manage deployments, delete test deployments, and assign access rights.
Use the Azure Building blocks templates provided in this architecture or create Azure Resource Manager
template to deploy the Azure resources following the infrastructure as Code (IaC) Process. With templates,
automating deployments using Azure DevOps Services, or other CI/CD solutions is easier.
Put each workload in a separate deployment template and store the resources in source control systems.
You can deploy the templates together or individually as part of a CI/CD process, making the automation
process easier.
In this architecture, there are three main workloads:
The data warehouse server, Analysis Services, and related resources.
Azure Data Factory.
An on-premises to cloud simulated scenario.
Each workload has its own deployment template.
The data warehouse server is set up and configured by using Azure CLI commands which follows the
imperative approach of the IaC practice. Consider using deployment scripts and integrate them in the
automation process.
Consider staging your workloads. Deploy to various stages and run validation checks at each stage before
moving to the next stage. That way you can push updates to your production environments in a highly
controlled way and minimize unanticipated deployment issues. Use Blue-green deployment and Canary
releases strategies for updating live production environments.
Have a good rollback strategy for handling failed deployments. For example, you can automatically
redeploy an earlier, successful deployment from your deployment history. See the --rollback-on-error flag
parameter in Azure CLI.
Azure Monitor is the recommended option for analyzing the performance of your data warehouse and the
entire Azure analytics platform for an integrated monitoring experience. Azure Synapse Analytics provides
a monitoring experience within the Azure portal to show insights to your data warehouse workload. The
Azure portal is the recommended tool when monitoring your data warehouse because it provides
configurable retention periods, alerts, recommendations, and customizable charts and dashboards for
metrics and logs.
For more information, see the DevOps section in Microsoft Azure Well-Architected Framework.
Azure Synapse
Choose Compute Optimized Gen1 for frequent scaling operations. This option is priced as pay-as-you-
go, based on Data warehouse units consumption (DWU).
Choose Compute Optimized Gen2 for intensive workloads with higher query performance and
compute scalability needs. You can choose the pay-as-you-go model or use reserved plans of one year
(37% savings) or 3 years (65% savings).
Data storage is charged separately. Other services such as disaster recovery and threat detection are also charged
separately.
For more information, see Azure Synapse Pricing.
Azure Analysis Services
Pricing for Azure Analysis Services depends on the tier. The reference implementation of this architecture uses the
Developer tier, which is recommended for evaluation, development, and test scenarios. Other tiers include, the
Basic tier, which is recommended for small production environment; the Standard tier for mission-critical
production applications. For more information, see The right tier when you need it.
No charges apply when you pause your instance.
For more information, see Azure Analysis Services pricing.
Blob Storage
Consider using the Azure Storage reserved capacity feature to lower cost on storage. With this model, you get a
discount if you can commit to reservation for fixed storage capacity for one or three years. For more information,
see Optimize costs for Blob storage with reserved capacity.
Power BI Embedded
Power BI Embedded is a Platform-as-a-Service (PaaS) solution that offers a set of APIs to enable the integration of
Power BI content into custom apps and websites. Users who publish BI content need to be licensed with Power BI
Pro. For information about pricing, see Power BI Embedded pricing.
For more information, see the Cost section in Microsoft Azure Well-Architected Framework.

Deploy the solution


To the deploy and run the reference implementation, follow the steps in the GitHub readme. It deploys the
following resources:
A Windows VM to simulate an on-premises database server. It includes SQL Server 2017 and related tools,
along with Power BI Desktop.
An Azure storage account that provides Blob storage to hold data exported from the SQL Server database.
An Azure Synapse instance.
An Azure Analysis Services instance.

Next steps
Use Azure Data Factory to automate the ELT pipeline. See Automated enterprise BI with Azure Synapse and
Azure Data Factory.

Related resources
You may want to review the following Azure example scenarios that demonstrate specific solutions using some of
the same technologies:
Data warehousing and analytics for sales and marketing
Hybrid ETL with existing on-premises SSIS and Azure Data Factory
Enterprise integration on Azure using message
queues and events
11/2/2020 • 4 minutes to read • Edit Online

This reference architecture integrates enterprise backend systems, using message queues and events to decouple
services for greater scalability and reliability. The backend systems may include software as a service (SaaS)
systems, Azure services, and existing web services in your enterprise.

Architecture
The architecture shown here builds on a simpler architecture that is shown in Basic enterprise integration. That
architecture uses Logic Apps to orchestrate workflows and API Management to create catalogs of APIs.
This version of the architecture adds two components that help make the system more reliable and scalable:
Azure Ser vice Bus . Service Bus is a secure, reliable message broker.
Azure Event Grid . Event Grid is an event routing service. It uses a publish/subscribe (pub/sub) eventing
model.
Asynchronous communication using a message broker provides a number of advantages over making direct,
synchronous calls to backend services:
Provides load-leveling to handle bursts in workloads, using the Queue-Based Load Leveling pattern.
Reliably tracks the progress of long-running workflows that involve multiple steps or multiple applications.
Helps to decouple applications.
Integrates with existing message-based systems.
Allows work to be queued when a backend system is not available.
Event Grid enables the various components in the system to react to events as they happen, rather than relying on
polling or scheduled tasks. As with a message queue, it helps decouple applications and services. An application or
service can publish events, and any interested subscribers will be notified. New subscribers can be added without
updating the sender.
Many Azure services support sending events to Event Grid. For example, a logic app can listen for an event when
new files are added to a blob store. This pattern enables reactive workflows, where uploading a file or putting a
message on a queue kicks off a series of processes. The processes might be executed in parallel or in a specific
sequence.

Recommendations
The recommendations described in Basic enterprise integration apply to this architecture. The following
recommendations also apply:
Service Bus
Service Bus has two delivery modes, pull or push. In the pull model, the receiver continuously polls for new
messages. Polling can be inefficient, especially if you have many queues that each receive a few messages, or if
there a lot of time between messages. In the push model, Service Bus sends an event through Event Grid when
there are new messages. The receiver subscribes to the event. When the event is triggered, the receiver pulls the
next batch of messages from Service Bus.
When you create a logic app to consume Service Bus messages, we recommend using the push model with Event
Grid integration. It's often more cost efficient, because the logic app doesn't need to poll Service Bus. For more
information, see Azure Service Bus to Event Grid integration overview. Currently, Service Bus Premium tier is
required for Event Grid notifications.
Use PeekLock for accessing a group of messages. When you use PeekLock, the logic app can perform steps to
validate each message before completing or abandoning the message. This approach protects against accidental
message loss.
Event Grid
When an Event Grid trigger fires, it means at least one event happened. For example, when a logic app gets an
Event Grid triggers for a Service Bus message, it should assume that several messages might be available to
process.

Scalability considerations
To achieve higher scalability, the Service Bus Premium tier can scale out the number of messaging units. Premium
tier configurations can have one, two, or four messaging units. For more information about scaling Service Bus,
see Best practices for performance improvements by using Service Bus Messaging.

Availability considerations
Review the SLA for each service:
API Management SLA
Event Grid SLA
Logic Apps SLA
Service Bus SLA
To enable failover if a serious outage occurs, consider implementing geo-disaster recovery in Service Bus
Premium. For more information, see Azure Service Bus geo-disaster recovery.

DevOps considerations
See DevOps considerations in Basic Enterprise Integration reference architecture
Security considerations
To secure Service Bus, use shared access signature (SAS). You can grant a user access to Service Bus resources
with specific rights by using SAS authentication. For more information, see Service Bus authentication and
authorization.
If you need to expose a Service Bus queue as an HTTP endpoint, for example, to post new messages, use API
Management to secure the queue by fronting the endpoint. You can then secure the endpoint with certificates or
OAuth authentication as appropriate. The easiest way to secure an endpoint is using a logic app with an HTTP
request/response trigger as an intermediary.
The Event Grid service secures event delivery through a validation code. If you use Logic Apps to consume the
event, validation is automatically performed. For more information, see Event Grid security and authentication.

Cost Considerations
In general, use the Azure pricing calculator to estimate costs. Here are some other considerations.
API Management
You are charged for all API Management instances when they are running. If you have scaled up and don't need
that level of performance all the time, manually scale down or configure autoscaling.
Logic Apps
Logic Apps uses a serverless model. Billing is calculated based on action and connector execution. For more
information, see Logic Apps pricing.
Service Bus queues
Service Bus queues support both push and pull models for delivering messages. In the pull model, every polling
request is metered as an action. Even with long polling at 30 secs (default), cost can be high. Unless you need real-
time delivery of messages, consider using the push model.
Service Bus queues are included in all tiers (Basic, standard, and premium tiers). For more information, see Azure
Service Bus pricing.
Event Grid
Event Grid uses a serverless model. Billing is calculated based on the number of operations (event executions).
Operations include ingress of events to Domains or Topics, advanced matches, delivery attempts, and
management calls. Usage of up to 100,000 operations is free of charge.
For more information, see Event Grid pricing.
For more information, see the cost section in Microsoft Azure Well-Architected Framework.
Publishing internal APIs to external consumers
11/2/2020 • 8 minutes to read • Edit Online

In this scenario, an organization has hosted multiple APIs using Application Service Environments(ILB ASE) and
would like to consolidate these APIs internally using Azure API Management (APIM) deployed inside a Virtual
Network. The internal API Management instance could also be exposed to external users to allow for utilization of
the full potential of the APIs. This external exposure could be achieved using an Application Gateways forwarding
requests to the internal API Management service, which in turn consumes the APIs deployed in the ASE.

Architecture

The above scenario covers a complete lifecycle of internal APIs getting consumed by the external users.
The data flows as follows:
1. Developers check in code to a GitHub repository connected to CI/CD pipeline Agent installed on an Azure VM
2. The agent pushes the build to the API application hosted on ILB ASE
3. API Management consumes the above APIs via HOST Headers specified in API Management policy
4. API Management uses the App Service Environment's DNS name for all the APIs
5. Application Gateway exposes API Management's developer and API portal
6. Azure Private DNS is used to route the traffic internally between ASE, API Management, and Application
Gateway
7. External Users uses exposed Dev Portal to consume the APIs via Application Gateway's public IP
Components
Azure Virtual Network enables Azure resources to securely communicate with each other, the internet, and on-
premises networks.
Azure Private DNS allows domain names to be resolved in a virtual network without needing to add a custom
DNS solution.
Azure API Management helps organizations publish APIs to external, partner, and internal developers to use
their data and services.
Application Gateway is a web traffic load balancer that enables you to manage traffic to your web applications.
Internal Load Balancer App Service Environment is an Azure App Service feature that provides a fully isolated
and dedicated environment for securely running App Service apps at high scale.
Azure DevOps is a service for managing your development lifecycle and includes features for planning and
project management, code management, build, and release.
Application Insights is an extensible Application Performance Management (APM) service for web developers on
multiple platforms.
Azure Cosmos DB is Microsoft's globally distributed, multi-model database service.
Alternatives
In an Azure lift and shift scenario deployed into an Azure Virtual Network, back-end servers could be directly
addressed through private IP addresses.
If using on-premises resources, the API Management instance could reach back to the internal service privately
via an Azure VPN gateway and site-to-site IPSec VPN connection or ExpressRoute making a hybrid Azure and
on-premises scenario.
Existing or open-source DNS providers could be used instead of the Azure-based DNS Service.
Internal APIs deployed outside of Azure can still benefit by exposing the APIs through API Management Service.

Considerations
The web APIs are hosted over secured HTTPS protocol and will be using a TLS Certificate.
The Application Gateway also is configured over port 443 for secured and reliable outbound calls.
The API Management service is configured to use custom domains using TLS certificates.
Review the suggested network configuration for App Service Environments
There needs to be an explicit mention about port 3443 allowing API Management to manage via the Azure
portal or PowerShell.
Leverage policies within APIM to add a HOST header for the API hosted on ASE. This ensures that the ASE's load
balancer will properly forward the request.
The API Management accepts ASE's DNS entry for all the apps hosted under App Service Environments. Add an
APIM policy to explicitly set the HOST Header to allow the ASE load balancer to differentiate between Apps
under the App Service Environment.
Consider Integrating with Azure Application Insights, which also surfaces metrics through Azure Monitor for
monitoring.
If using CI/CD pipelines for deploying Internal APIs, consider building your own Hosted Agent on a VM inside
the Virtual Network.
Availability
Azure API Management service could be deployed as a Multi-Region deployment for higher availability and also to
reduce latencies. This feature is only available in Premium Mode. The API Management service in this specific
scenario consumes APIs from App Service Environments. One could also use APIM for APIs hosted on the internal
on-premises infrastructure.
App Service Environments could make use of Traffic Manager profiles to distribute the traffic hosted on App
Service Environments for higher scale and availability.
Scalability
API Management instances could be scaled out depending upon a number of factors like number and rate of
concurrent connections, the kind and number of configured policies, request and response sizes, and back-end
latencies on the APIs. Scaling out instance options are available in Basic, Standard, and Premium Tiers but are
bound by an upper scale limit in tiers below premium. The instances are referred to as Units and can be scaled up
to a max of two units in Basic tier, four units in Standard tier and any number of units in the Premium tier. Auto
Scaling options are also available to enable scale out based on rules.
App Service Environments are designed for scale with limits based on the pricing tier and the apps hosted under
the App Service Environments can be configured to scale out (number of instances) or scale up (instance size)
depending upon the requirements of the application.
Azure Application Gateway auto scaling is available as a part of the Zone redundant SKU in all global Azure regions.
See the public preview feature regarding App gateway Auto scaling.
Security
Since the above example scenario is hosted completely on an internal network, API Management and ASE are
already deployed on secured infrastructure (Azure VNet). Application Gateways can be integrated with Azure
Security Center to provide a seamless way to prevent, detect, and respond to threats to the environment. For
general guidance on designing secure solutions, see the Azure Security Documentation.
Resiliency
This example scenario though talks more about configuration, the APIs hosted on the App Service Environments
should be resilient enough to handle errors in the requests, which eventually is managed by the API Management
service and Application Gateway. Consider Retry and Circuit breaker patterns in the API design. For general
guidance on designing resilient solutions, see Designing resilient applications for Azure.

Deploy the scenario


Prerequisites and assumptions
1. A custom domain name will need to be purchased.
2. A TLS certificate (we used a wild-card certificate from Azure Certificates Service) to use one for all our custom
domains. You could also procure a self-signed certificate for Dev Test scenarios.
3. This specific deployment uses the domain name contoso.org and a wild-card TLS certificate for the domain.
4. The deployment is using the resource names and address spaces mentioned in the deployment section, which
can be configured.
Deployment and putting the pieces together

The components deployed using the above Resource Manager template needs to be further configured as below
1. VNet with the following configurations:
Name: ase-internal-vnet
Address space for VNet: 10.0.0.0/16
Four Subnets
backendSubnet for DNS Service: 10.0.0.0/24
apimsubnet for Internal API Management Service: 10.0.1.0/28
asesubnet for ILB ASE: 10.0.2.0/24
VMSubnet for Test VMs and Internal DevOps Hosted Agent VM: 10.0.3.0/24
2. Private DNS service (Public Preview) since adding a DNS service requires the VNet to be empty.
Refer to the deployment guidelines for more information
3. App Service Environment with Internal Load Balancer (ILB) option: aseinternal (DNS:
aseinternal.contoso.org). Once the Deployment is complete, upload the wild-card cert for the ILB
4. App Service Plan with ASE as location
5. An API App (App Services for simplicity) - srasprest (URL: https://srasprest.contoso.org ) – ASP.NET MVC-
based web API. After the deployment, configure
web app to use the TLS certificate
Application Insights to the above apps: api-insights
Create a Cosmos DB service for web APIs hosted internal to VNet: noderestapidb
Create DNS entries on the Private DNS zone created
You could make use of Azure Pipelines to configure the agents on Virtual Machines to deploy the code for
Web App on internal Network
For testing the API App internally, create a test VM within the VNet subnet
6. Creates API Management service: apim-internal
7. Configure the service to connect to internal VNet on Subnet: apimsubnet. After the deployment is complete,
perform the below additional steps
Configure custom domains for APIM Services using TLS
API portal (api.contoso.org)
Dev Portal (portal.contoso.org)
In the APIs section, configure the ASE Apps using ASE's DNS name added Policy for HOST Header
for the Web app
Use the above created test VM to test the API Management service internal on the Virtual Network

NOTE
The testing the APIM APIs from Azure portal will still NOT work as api.contoso.org is not able to be publicly resolved.*

8. Configure Application Gateway (WAF V1) to access the API service: apim-gateway on Port 80. Add TLS certs
to the App Gateway and corresponding Health probes and Http settings. Also configure the Rules and
Listeners to use the TLS cert.
Once the above steps are successfully completed, Configure the DNS entries in GoDaddy CNAME entries of
api.contoso.org and portal.contoso.org with App Gateway's public DNS name: ase-
appgtwy.westus.cloudapp.azure.com and verify if you are able to reach the Dev Portal from Public and are able to
test the APIM services APIs using Azure portal
It is not a good practice to use same URL for Internal and External endpoints for the APIM services (currently in the
above demo, both URLs are same). If we want to choose to have different URLs for internal and external endpoints,
we could make use of App Gateway WAF v2, which supports http redirection and much more.

Pricing
API Management is offered in four tiers: developer, basic, standard, and premium. You can find detailed guidance on
the difference in these tiers at the Azure API Management pricing guidance here.
Customers can scale API Management by adding and removing units. Each unit has capacity that depends on its
tier.

NOTE
The Developer tier can be used for evaluation of the API Management features. The Developer tier should not be used for
production.

To view projected costs and customize to your deployment needs, you can modify the number of scale units and
App Service instances in the Azure Pricing Calculator.
Similarly, the App Service Environments pricing guidance is provided here
Application Gateway pricing can be configured here depending upon the required tier and resources.
Related resources
Check out the related scenario on Migrating legacy web APIs to API Management
Web and Mobile Front Ends
11/13/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Overview
This example scenario demonstrates how you can accelerate frontend and business process development with
drag and drop visual designer. Leverage Azure Functions for low latency processing and quick reads/writes to
Azure Table Storage. Azure Power Apps and Power Automate provide out-of-box connectors for you to tap into data
sources that can live anywhere including 3rd party systems.

Architecture
Data Flow
The data flows through the solution as follows:
1. Power Apps reads data from custom web APIs surfaced through Functions
2. Functions, upon calls from Power Automate, convert raw data into HTML
3. Power Automate reads and writes data from Azure Table Storage and retrieve data from 3rd party auditing
system
4. Telemetry data is written to Azure Table Storage
5. Power Apps presents rich information across iOS, Android and web.

Components
Data is loaded from these different data sources using several Azure components:
Power Apps: Increase agility across your organization by rapidly building low-code apps that modernize
processes and solve tough challenges.
Azure Functions: Accelerate and simplify application development with serverless compute
Power Automate: Streamline repetitive tasks and paperless processes with Microsoft Power Automate—so you
can focus your attention where it's needed most.
Azure Table Storage: A NoSQL key-value store for rapid development using massive semi-structured datasets

Next Steps
Learn more: https://aka.ms/learnpowerapps
Custom Business Processes
11/13/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Overview
This example demonstrates how you can deploy portals that automate manual or paper-based processes and
surface rich user experience. Leverage Azure API management and Azure Functions to connect custom APIs which
tap into your legacy systems. Using Azure managed databases and low-code approach to automate tasks, you can
lower the overall solution costs while easily building apps that are real-time, resilient, and scalable by default.

Architecture

Data Flow
The data flows through the solution as follows:
1. The airline system assigns flights to Teams channel which are also displayed in Power Apps.
2. Custom API Coordinator hosted in Azure API Management receives notifications and handles incoming
messages from the airline system.
3. When a user selects a flight to monitor or the system assigns the user to a flight, Graph API call is queued for
coordinator to process.
4. Azure Functions run the Graph API calls which are processed in Azure Storage.
5. Notifications from airline system are managed by a custom bot messaging service leveraging Azure Bot
Service.
6. Custom bots sends flight updates to users in Teams.
7. Power BI generates reports from the Azure Data Lake, based on Teams activity.

Components
Data is loaded from these different data sources using several Azure components:
Power Apps: Increase agility across your organization by rapidly building low-code apps that modernize
processes and solve tough challenges.
Azure Functions: Accelerate and simplify application development with serverless compute
Azure API management: Hybrid, multi-cloud management platform for APIs across all environments
Azure SQL Database: Build apps that scale with the pace of your business with managed and intelligent SQL in
the cloud
Azure Cosmos DB: Fast NoSQL database with open APIs for any scale
Azure Storage: Massively scalable, secure cloud storage for your data, apps, and workloads
Azure Event Hubs: Simple, secure, and scalable real-time data ingestion
Azure Data Lake Storage: Massively scalable and secure data lake for your high-performance analytics
workloads
Azure Bot Service: A managed service purpose-built for bot development
Azure Data Factory: Hybrid data integration service that simplifies ETL at scale
Power Automate: Streamline repetitive tasks and paperless processes
Power BI is a suite of business analytics tools to analyze data and share insights.

Next Steps
Learn more: https://aka.ms/learnpowerapps
Line of Business Extension
11/13/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Overview
This example shows how you can modernize your legacy systems that cannot support new processes and provide
better user experience. Out of box or custom connectors can help simplify process automation and connect to 3rd
party data sources or legacy systems. Azure Functions can then schedule calculations on a scheduled basis to
simplify your business processes. Power Apps can be used to process and share data with third parties via standard
exports.

Architecture
Data Flow
The data flows through the solution as follows:
1. Supplier data stored in CDS is moved to SQL via Data Factory.
2. Purchase order (PO) data stored in ERP system is sent to Azure SQL database.
3. Azure Functions uses API to surface PO data monthly and creates a task for user to review.
4. Power Apps retrieves data from Azure SQL Database through API.
5. User reviews and updates POs in Power Apps and sends this data to suppliers through CSV export.
6. Power BI reports trends in supplier status.

Components
Data is loaded from these different data sources using several Azure components:
Power Apps: Increase agility across your organization by rapidly building low-code apps that modernize
processes and solve tough challenges.
Azure Functions: Accelerate and simplify application development with serverless compute
Azure API management: Hybrid, multi-cloud management platform for APIs across all environments
Azure SQL Database: Build apps that scale with the pace of your business with managed and intelligent SQL in
the cloud
Azure Data Factory: Hybrid data integration service that simplifies ETL at scale
Power BI : a suite of business analytics tools to analyze data and share insights. Power BI can query a semantic
model stored in Analysis Services, or it can query Azure Synapse directly.

Next Steps
Learn more: https://aka.ms/learnpowerapps
On-premises data gateway for Azure Logic Apps
11/2/2020 • 3 minutes to read • Edit Online

This reference architecture illustrates a logic app that's running in Microsoft Azure, which is triggered by Azure
Spring Cloud. It then connects to on-premises resources such as Microsoft SQL Server and Microsoft SharePoint
Server.

Download a Visio file of this architecture.


Typical uses for this architecture include:
Cloud-based Azure Logic Apps workflows that require data from on-premises software as part of their run.
Extending the capabilities of existing on-premises software by triggering Logic Apps workflows in the cloud.

Architecture
The architecture consists of the following components:
Azure Spring Cloud . Spring Cloud provides a managed service that's designed and optimized specifically for
Spring microservices that are written in Java.
Azure Logic Apps . Logic apps are automated workflows that are provided as a scalable cloud service for
common enterprise orchestration tasks. Logic apps include connectors for many popular cloud services, on-
premises products, or other software as a service applications. The Logic Apps workflow includes the following
features:
Trigger that fires whenever a new Azure Queue storage message is received.
Action to parse the JavaScript Object Notation (JSON) body of the queue message.
Action to query SQL Server for relevant data.
Action to compose a response from the collected data.
Action to persist the composed response to a list on the SharePoint Server.
On-premises data gateway . An on-premises data gateway is bridge software that connects on-premises
data to cloud services. The gateway typically installs on a dedicated on-premises virtual machine.
SQL Ser ver . This is an installation of SQL Server.
SharePoint Ser ver . This is an installation of SharePoint Server.

Recommendations
The following recommendations apply for most scenarios. Follow these recommendations unless you have specific
requirements that override them.
On-premises data gateway
While it's possible to expose your on-premises servers to the public internet, it's preferable to use an on-premises
data gateway. This gateway creates a secure read/write connection between your on-premises data sources and
Azure. The on-premises data gateway is used with Logic Apps connectors to connect local data sources to Azure.
Before installing an on-premises data gateway, consider the following:
You can install an on-premises data gateway on any on-premises virtual machine. You can choose to install it on
the same virtual machine as your server applications or on a co-located virtual machine with the appropriate
network connectivity.
The architecture of the on-premises data gateway is designed to make outbound connections to Azure Service
Bus.
Review the additional considerations for installing an on-premises data gateway.
An on-premises data gateway can be used with more than just Logic Apps. It's possible to use the same data
gateway installation with:
Power BI
Power Automate
Power Apps
Azure Analysis Services

Scalability considerations
As the amount of Logic Apps flows increase, consider the capacity constraints of the on-premises servers. You'll
need to determine if the on-premises servers can handle the increased workload.

Availability considerations
Avoid single point of failure issues for an on-premises data gateway by installing the software on multiple on-
premises virtual machines.
Consider implementing high availability topologies for your on-premises servers by using techniques such as:
High availability architecture for SharePoint Server
SQL Server AlwaysOn

Manageability considerations
Service Bus is used for outbound data gateway communication. This might require configuring your firewall to
allow outbound connections to Azure.
Consider Azure ExpressRoute if you want consistent throughput from your on-premises data sources to Azure.

DevOps considerations
The corresponding Azure resource for an on-premises data gateway should only be created after the
corresponding software is installed on an on-premises virtual machine.
Consider storing workflow configuration as a JSON template within an Azure Resource Manager template to
automate deployment.

Security considerations
While it's possible to expose your on-premises servers to the public internet, it's preferable to use an on-
premises data gateway. This gateway creates a secure read/write connection between your on-premises data
sources and Azure.

Cost considerations
Use the Azure pricing calculator to estimate costs.
This reference assumes that the consumption plan is used to create a global Logic Apps resource.
This reference only uses standard connectors, which are metered at a lower price than enterprise connectors.
Vision with Azure IoT Edge
11/13/2020 • 4 minutes to read • Edit Online

Visual inspection of products, resources, and environments has been a core practice for most enterprises, and was
until recently, a very manual process. An individual, or a group of individuals, would be responsible for manually
inspecting the assets or the environment. Depending on the circumstances, this could become inefficient,
inaccurate, or both, due to human error and limitations.
To improve the efficacy of visual inspection, enterprises began turning to deep learning artificial neural networks
known as convolutional neural networks (or CNNs), to emulate human vision for analysis of images and video.
Today this is commonly called computer vision, or simply Vision AI. Artificial intelligence for image analytics spans
a wide variety of industries, including manufacturing, retail, healthcare, and the public sector, and an equally wide
area of use cases.
Vision for quality assurance - In manufacturing environments, Vision AI can be very helpful with quality
inspection of parts and processes with a high degree of accuracy and velocity. An enterprise pursuing this
path automates the inspection of a product for defects to answer questions such as:
Is the manufacturing process producing consistent results?
Is the product assembled properly?
Can there be an earlier notification of a defect to reduce waste?
How to leverage drift in the computer vision model to prescribe predictive maintenance?
Vision for safety - In any environment, safety is a fundamental concern for every enterprise, and the
reduction of risk is a driving force for adopting Vision AI. Automated monitoring of video feeds to scan for
potential safety issues provides critical time to respond to incidents, and opportunities to reduce exposure to
risk. Enterprises looking at Vision AI for this use case are commonly trying to answer questions such as:
How compliant is the workforce with using personal protective equipment?
How often are people entering unauthorized work zones?
Are products being stored in a safe manner?
Are there unreported close calls in a facility, or pedestrian/equipment “near misses”?

Why vision on the Edge


Over the past decade, computer vision has become a rapidly evolving area of focus for enterprises, as cloud-native
technologies such as containerization, have enabled portability and migration of this technology towards the
network edge. For instance, custom vision inference models trained in the cloud can be easily containerized for use
in an Azure IoT Edge runtime-enabled device.
The rationale behind migrating workloads from the cloud to the edge for Vision AI generally falls into two
categories – performance and cost.
On the performance side of the equation, exfiltrating large quantities of data can cause an unintended performance
strain on existing network infrastructure. Additionally, the latency of sending images and/or video streams to the
cloud to retrieve results may not meet the needs of the use case. For instance, a person straying into an
unauthorized area may require immediate intervention, and that scenario can affect latency when every second
counts. Positioning the inferencing model near the point of ingestion, allows for near real-time scoring of the
image. It also allows alerting to be performed either locally or through the cloud, depending on the network
topology.
In terms of cost, sending all of the data to the cloud for analysis could significantly impact the ROI or return on
investment of a Vision AI initiative. With Azure IoT Edge, a Vision AI module can be designed to only capture the
relevant images with a reasonable confidence level based on the scoring. This significantly limits the amount of
data being sent.

Camera considerations
Camera is understandably a very important component of an Azure IoT Edge Vision solution. To learn what
considerations should be taken for this component, proceed to Camera selection in Azure IoT Edge Vision.

Hardware acceleration
To bring AI to the edge, the edge hardware should be able to run the powerful AI algorithms. To know the hardware
capabilities required for IoT Edge Vision, proceed to Hardware acceleration in Azure IoT Edge Vision.

Machine learning
Machine learning can be challenging for the data on the edge, due to resource restrictions of edge devices, limited
energy budget, and low compute capabilities. See Machine learning and data science in Azure IoT Edge Vision to
understand the key considerations in designing the machine learning capabilities of your IoT Edge Vision solution.

Image storage
Your IoT Edge Vision solution cannot be complete without careful consideration of how and where the images
generated will be stored. Read Image storage and management in Azure IoT Edge Vision for a thorough discussion.

Alerts
Your IoT Edge device may need to respond to various alerts in its environment. See Alert persistence in Azure IoT
Edge Vision to understand the best practices in managing these alerts.

User interface
The user interface or UI of your IoT Edge Vision solution will vary based on the target user. The article User
interface in Azure IoT Edge Vision discusses the main UI considerations.

Next steps
This series of articles demonstrate how to build a complete vision workload using Azure IoT Edge devices. For
further information, you may refer to the product documentation as following:
Azure IoT Edge documentation
Tutorial: Perform image classification at the edge with Custom Vision Service
Azure Machine Learning documentation
Azure Kinect DK documentation
MMdnn tool
ONNX
Camera selection in Azure IoT Edge Vision
11/2/2020 • 8 minutes to read • Edit Online

One of the most critical components in any AI Vision workload is selecting the right camera. The items being
identified by this camera must be presented in such a way that the artificial intelligence or machine learning
models can evaluate them correctly. An in-depth understanding of the different camera types is required to
understand this concept.

NOTE
There are different manufacturers for area , line , and smar t cameras. Instead of recommending any one vendor over
another, Microsoft recommends that you select a vendor that fits your specific needs.

Types of cameras
Area scan cameras
This camera type generates the traditional camera image, where a 2D image is captured and then sent over to the
Edge hardware to be evaluated. This camera typically has a matrix of pixel sensors.
As the name suggests, area scan cameras look at a large area and are great at detecting change in an area.
Examples of workloads that could use an area scan camera would be workplace safety, or detecting or counting
objects (people, animals, cars, and so on) in an environment.
Examples of manufacturers of area scan cameras are Basler, Axis, Sony, Bosch, FLIR, Allied Vision.
Line scan cameras
Unlike the area scan cameras, the line scan camera has a single row of linear pixel sensors. This allows the camera
to take one-pixel width images in quick successions, and then stitches them together into a video stream. This
video stream is then sent over to an Edge device for processing.
Line scan cameras are great for vision workloads where the items to be identified are either moving past the
camera, or need to be rotated to detect defects. The line scan camera would then be able to produce a continuous
image stream for evaluation. Examples of workloads that would work best with a line scan camera are:
an item defect detection on parts that are moved on a conveyer belt,
workloads that require spinning to see a cylindrical object, or
any workload that requires rotation.
Examples of manufacturers of line scan cameras are Basler, Teledyne Dalsa, Hamamatsu Corporation, DataLogic,
Vieworks, and Xenics.
Embedded smart cameras
This type of camera can use either an area scan or a line scan camera for capturing the images, although a line
scan smart camera is rare. An embedded smart camera can not only acquire an image, but can also process that
image as it is a self-contained stand-alone system. They typically have either an RS232 or an Ethernet port output,
which allows them to be integrated directly into a PLC or other IIoT interfaces.
Examples of manufacturers of embedded smart cameras are Basler, Lesuze Electronics.

Camera features
Sensor size
This is one of the most important factors to evaluate in any vision workload. A sensor is the hardware within a
camera that captures the light and converts into signals, which then produce an image. The sensor contains
millions of semiconducting photodetectors called photosites. A higher megapixel count does not always result in a
better image. For example, let’s look at two different sensor sizes for a 12-megapixel camera. Camera A has a ½
inch sensor with 12 million photosites and camera B has a 1-inch sensor with 12 million photosites. In the same
lighting conditions, the camera that has the 1-inch sensor will be cleaner and sharper. Many cameras typically used
in vision workloads have a sensor sized between ¼ inch to 1 inch. In some cases, much larger sensors might be
required.
If a camera has a choice between a larger sensor or a smaller sensor, some factors deciding why you might choose
the larger sensor are:
need for precision measurements,
lower light conditions,
shorter exposure times, or fast-moving items.
Resolution
This is another important factor to both line scan and area scan camera workloads. If your workload must identify
fine features, such as the writing on an IC chip, then you need greater resolution cameras. If your workload is
trying to detect a face, then higher resolution is required. And if you need to identify a vehicle from a distance,
again a higher resolution will be required.
Speed
Sensors come in two types- CCD and CMOS . If the vision workload requires high number of images to be captured
per second, then two factors will come into play. The first is how fast is the connection on the interface of the
camera. The second is what type of sensor it is. CMOS sensors have a direct readout from the photosites, because
of which they typically offer a higher frame rate.

NOTE
There are several other camera features to consider when selecting the correct camera for your vision workload. These
include lens selection, focal length, monochrome, color depth, stereo depth, triggers, physical size, and support. Sensor
manufacturers can help you understand the specific feature that your application may require.

Camera placement
The items that you are capturing in your vision workload will determine the location and angles that the camera
should be placed. The camera location can also affect the sensor type, lens type, and camera body type.
There are several different factors that can weigh into the overall decision for camera placement. Two of the most
critical ones are the lighting and the field of view.
Camera lighting
In a computer vision workload, lighting is a critical component to camera placement. There are several different
lighting conditions. While some of the lighting conditions would be useful for one vision workload, it might
produce an undesirable effect in another. Types of lighting that are commonly used in computer vision workloads
are:
Direct lighting: This is the most commonly used lighting condition. This light source is projected at the
object to be captured for evaluation.
Line lighting: This is a single array of lights that are most used with line scan camera applications. This
creates a single line of light at the focus of the camera.
Diffused lighting: This type of lighting is used to illuminate an object but prevent harsh shadows and is
mostly used around specular objects.
Back lighting: This type of light source is used behind the object, producing a silhouette of the object. This
is most useful when taking measurements, edge detection, or object orientation.
Axial diffused lighting: This type of light source is often used with highly reflective objects, or to prevent
shadows on the part that will be captured for evaluation.
Custom Grid lighting: This is a structured lighting condition that lays out a grid of light on the object, the
intent is to have a known grid projection to then provide more accurate measurements of components,
parts, placement of items, and so on.
Strobe lighting: Strobe lighting is used for high speed moving parts. The strobe must be in sync with the
camera to take a freeze of the object for evaluation, this lighting helps to prevent motion blurring effect.
Dark Field lighting: This type of light source uses several lights in conjunction with different angles to the
part to be captured. For example, if the part is laying flat on a conveyor belt the lights would be placed at a
45-degree angle to it. This type of lighting is most useful when looking at highly reflective clear objects and
is most commonly used with lens scratch detections.
The figure below shows the angular placement of light:

Field of view
In a vision workload, you need to know the distance to the object that you are trying to evaluate. This also will play
a part in the camera selection, sensor selection, and lens configuration. Some of the components that make up the
field of view are:
Distance to object(s): For example, is the object being monitored with computer vision on a conveyor belt
and the camera is two feet above it, or is the object across a parking lot? As the distance changes, so does the
camera’s sensors and lens configurations.
Area of coverage: Is the area that the computer vision is trying to monitor small or large? This has direct
correlation to the camera’s resolution, lens, and sensor type.
Direction of the sun: If the computer vision workload is outside, such as monitoring a job construction site
for worker safety, will the camera be pointed in the sun at any time? Keep in mind that if the sun is casting a
shadow over the object that the vision workload is monitoring, items might be a bit obscured. Also, if the
camera is getting direct sunlight in the lens, the camera might be blinded until the angle of the sun changes.
Camera angle to the object(s): Angle of the camera to the object that the vision workload is monitoring is
also a critical component to think about. If the camera is too high, it might miss the details that the vision
workload is trying to capture, and the same may be true if it is too low.

Communication interface
In building a computer vision workload, it is also important to understand how the system will interact with the
output of the camera. Below are a few of the standard ways that a camera will communicate to IoT Edge:
Real Time Streaming Protocol (RTSP): RTSP is a protocol that transfers real-time video data from a
device (in our case, the camera) to an endpoint device (Edge compute) directly over a TCP/IP connection. It
functions in a client-server application model that is at the application level in the network.
Open Network Video Interface Forum (ONVIF): A global and open industry forum that is developing
open standards for IP-based cameras. This standard is aimed at standardization of communication between
the IP camera and down stream systems, interoperability, and open source.
USB: Unlike RTSP and ONVIF, USB connected cameras connect over the Universal Serial Bus directly on the
Edge compute device. This is less complex, however, it limits the distance that the camera can be placed
away from the Edge compute.
Camera Serial Interface: CSI specification is from Mobile Industry Processor Interface (MIPI). This
interface describes how to communicate between a camera and a host processor.
There are several standards defined for CSI:
CSI-1 : This was the original standard that MIPI started with.
CSI-2 : This standard was released in 2005, and uses either D-PHY or C-PHY as physical layers options.
This is further divided into several layers:
Physical Layer (C-PHY, D-PHY)
Lane Merger layer
Low-Level Protocol Layer
Pixel to Byte Conversion Layer
Application layer
The specification was updated in 2017 to v2, which added support for RAW-24 color depth, Unified Serial
Link, and Smart Region of Interest.

Next steps
Now that you know the camera considerations for your IoT Edge Vision workload, proceed to setting up the right
hardware for your workload. Read Hardware acceleration in Azure IoT Edge Vision for more information.
Hardware acceleration in Azure IoT Edge Vision
11/2/2020 • 2 minutes to read • Edit Online

Along with the camera selection, one of the other critical decisions in Vision on the Edge projects is hardware
acceleration.
The following sections describe the key components of the underlying hardware.

CPU
The Central Processing Unit (CPU) is your default compute for most processes running on a computer. It is
designed for general purpose compute. For some vision workloads where timing is not as critical, this might be a
good option. However, most workloads that involve critical timing, multiple camera streams, and/or high frame
rates, will require more specific hardware acceleration.

GPU
Many people are familiar with the Graphics Processing Unit (GPU) as this is the de-facto processor for any high-
end PC graphics card. In recent years, the GPU has been leveraged in high performance computer (HPC) scenarios,
in data mining, and in computer AI/ML workloads. The GPU’s massive potential of parallel computing can be used
in a vision workload to accelerate the processing of pixel data. The downside to a GPU is its higher power
consumption, which is a critical factor to consider for your vision workload.

FPGA
Field Programmable Gate Arrays are reconfigurable hardware accelerators. These powerful accelerators allow for
the growth of Deep Learning Neural networks, which are still evolving. These accelerators have millions of
programmable gates, hundreds of I/O pins, and an exceptional compute power in the trillions of tera-MAC’s. There
also many different libraries available for FPGAs that are optimized for vision workloads. Some of these libraries
also include preconfigured interfaces to connect to downstream cameras and devices. One area that FPGAs tend to
fall short on is floating point operations. However, manufacturers are currently working on this issue and have
made many improvements in this area.

ASIC
Application-Specific Integrated Circuit is by far the fastest accelerator on the market today. While they are the
fastest, they are the hardest to change as they are manufactured to function for a specific task. These custom chips
are gaining popularity due to size, power per watt performance, and IP protection. This is because the IP burned
into the ASIC accelerator is much harder to backwards engineer proprietary algorithms.

Next steps
Proceed to learn what considerations go into place for Machine learning and data science in Azure IoT Edge Vision.
Machine learning and data science in Azure IoT Edge
Vision
11/2/2020 • 12 minutes to read • Edit Online

The process of designing the machine learning (ML) approach for a vision on the edge scenario is one of the
biggest challenges in the entire planning process. It is important to understand how to consider and think about
ML in the context of edge devices.
To begin using machine learning to address business problems and pain points, consider the following points:
Always consider first how to solve the problem without ML or with a simple ML algorithm.
Have a plan to test several ML architectures as they will have different capacities to "learn".
Have a system in place to collect new data from the device to retrain an ML model.
For poorly performing ML models, often a simple fix is to add more representative data to the training process
and ensure it has variability with all classes represented equally.
Remember, this is often an iterative process with both the choice of data and choice of architecture being
updated in the exploratory phase.
It is not an easy space and, for some, a very new way of thinking. It is a data driven process. Careful planning will
be critical to successful results especially on very constrained devices.
It is always critical to clearly define the problem to be solved as the data science and machine learning approach
will depend upon this. It is also very important to consider what type of data will be encountered in the edge
scenario as this will determine the kind of ML algorithm that should be used.
Even at the start, before training any models, real world data collection and examination will help this process
greatly and new ideas could even arise. This article will discuss data considerations in detail. Of course, the
equipment itself will help determine the ML approach with regard to device attributes like limited memory,
compute, and/or power consumption limits.
Fortunately, data science and machine learning are iterative processes, so if the ML model has poor performance,
there are many ways to address issues through experimentation. This article will also discuss considerations
around ML architecture choices. Often, there will be some trial and error involved as well.

Machine learning data


Both the source(s) and attributes of data will dictate how the intelligent edge system is built. For vision, it could be
images, videos, or even LiDAR, as the streaming signal. Regardless of the signal, when training an ML model and
using it to score new data (called inferencing), domain knowledge will be required. This includes experience in
designing and using ML algorithms or neural network architectures and expertise deploying them to the
specialized hardware. Below are a few considerations related to ML. However, it is recommended to gain some
deeper knowledge in order to open up more possibilities or find an ML expert with edge experience to help with
the project.
Collecting and using a balanced dataset is critical, and it should equally represent all classes or categories. When
the ML model is trained on a dataset, generally that dataset has been split into train, validate, and test subsets. The
purpose of these subsets is as follows:
The training dataset is used for the actual model training over many passes or iterations (often called epochs).
Throughout the training process, the model is spot-checked for how well it is doing on the validation dataset.
After a model is done training, the final step is to pass the test dataset through it and assess how well it did as a
proxy to the real-world.

NOTE
Be wary of optimizing for the test dataset, in addition to the training dataset, once one test has been run. It might be good
to have a few different test datasets available.

Some good news is that in using deep learning, often costly and onerous feature engineering, featurizations, and
preprocessing can be avoided because of how deep learning works to find signal in noise better than traditional
ML. However, in deep learning, transformations may still be utilized to clean or reformat data for model input
during training as well as inference. The same capacity needs to be used in training and when the model is scoring
new data.
When advanced preprocessing is used such as denoising, adjusting brightness or contrast, or transformations like
RGB to HSV, it must be noted that this can dramatically change the model performance for the better or,
sometimes, for the worse. In general, it is part of the data science exploration process and sometimes it is
something that must be observed once the device and other components are placed in a real-world location.
After the hardware is installed into its permanent location, the incoming data stream should be monitored for data
drift.
Data drift is the deviation due to changes in the current data compared to the original. Data drift will often result
in a degradation in model performance (like accuracy), although this is not the only cause of decreased
performance (for example, hardware or camera failure).
There should be an allowance for data drift testing in the system. This new data should also be collected for
another round of training. The more representative data collected for training, the better the model will perform in
almost all cases. So, preparing for this kind of collection is always a good idea.
In addition to using data for training and inference, new data coming from the device could be used to monitor the
device, camera or other components for hardware degradation.
In summary, here are the key considerations:
Always use a balanced dataset with all classes represented equally.
The more representative data used to train a model, the better.
Have a system in place to collect new data from device to retrain.
Have a system in place to test for data drift.
Only run a test set through a new ML model once. If you iterate and retest on the same test set, this could cause
overfitting to the test set in addition to the training set.

Machine learning architecture choices


Machine learning (ML) architecture is the layout of the mathematical operations that process input into the desired
and actionable output. For instance, in deep learning this would be the number of layers and neurons in each layer
of a deep neural network as well as their arrangement. It is important to note that there is no guarantee that the
performance metric goal (such as, high enough accuracy) for any given ML architecture will be achieved. To
mitigate this, several different architectures should be considered. Often, two or three different architectures are
tried before a choice is made. Remember that this is often an iterative process; both the choice of data and the
choice of architecture may be updated in the exploratory phase of the development process.
It helps to understand the issues that can arise when training an ML model that may only be seen after training or,
even at the point of inferencing on device. Overfitting and underfitting are some of the common issues found
during the training and testing process.
Overfitting - Overfitting can give a false sense of success because the performance metric (like accuracy)
might be very good when the input data looks like the training data. However, overfitting can occur when
the model fits to the training data too closely and cannot generalize well to new data. For example, it may
become apparent that the model only performs well indoors because the training data was from an indoor
setting.
Overfitting can be caused by following issues:
The model learned to focus on incorrect, non-representative features specifically found in the training
dataset.
The model architecture may have too many learnable parameters, correlated to the number of layers in
a neural network and units per layer. A model's memorization capacity is determined by the number of
learnable parameters.
Not enough complexity or variation is found in the training data.
The model is trained over too many iterations.
There may be other reasons for good performance in training and significantly worse performance in
validation and testing, which are out of scope for this article.
Underfitting - Underfitting happens when the model has generalized so well that it cannot tell the
difference between classes with confidence. For example, the training loss will still be unacceptably high.
Underfitting can be caused by following issues:
Not enough samples available in training data.
The model is trained for too few iterations, in other words, it's too generalized.
Other reasons related to the model not being able to recognize any objects, or poor recognition and loss
values during training. The assessment values used to direct the training process pass through a process
called optimization and weight updates.
There is a trade-off between too much capacity (such as, a large network or a large number of learnable
parameters) and too little capacity. Transfer learning happens when some network layers are set as not trainable,
or frozen. In this situation, increasing capacity would equate to opening up more layers earlier in the network
versus only using the last few layers in training, with the rest remaining frozen.
There is no hard and fast rule for determining number of layers for deep neural networks. So, sometimes several
model architectures must be evaluated within an ML task. However, in general, it is good to start with fewer layers
and/or parameters (such as, smaller networks) and gradually increase the complexity.
Some considerations when coming up with the best architecture choice will include the inference speed
requirements. These include an assessment and acceptance of the speed versus accuracy tradeoff. Often, a faster
inference speed is associated with lower performance. For example, accuracy, confidence or precision could suffer.
A discussion around requirements for the ML training and inferencing will be necessary based upon the
considerations above and any company-specific requirements. For instance, if the company policy allows open-
source solutions to be utilized, it will open up a great deal of ML algorithmic possibilities as most cutting edge ML
work is in the open-source domain.
In summary, here are the key considerations:
Keep an eye out for overfitting and underfitting.
Testing several ML architectures is often a good idea. This is an iterative process.
There will be a trade-off between too much network capacity and too little. However, it is better to start with too
little and build up from there.
There will be a trade-off between speed and your performance metric such as accuracy.
If the performance of the ML model is acceptable, the exploratory phase is complete. This is important to note,
as one can be tempted to iterate indefinitely.
Data science workflows
The data science process for edge deployments has a general pattern. After a clear data-driven problem statement
is formulated, the next steps generally include those shown in the figure below.

Data collection - Data collection or acquisition could be an online image search from a currently deployed
device, or other representative data source. Generally, the more data the better. In addition, the more variability,
the better the generalization.
Data labeling - If only hundreds of images need to be labeled, such as, when using transfer learning, it can be
done in-house. If tens of thousands of images need to be labeled, a vendor could be enlisted for both data
collection and labeling.
Train a model with ML framework - An ML framework such as TensorFlow or PyTorch (both with Python
and C++ APIs) will need to be chosen. Usually this depends upon what code samples are available in open-
source or in-house, as well as the experience of the ML practitioner. Azure ML may be used to train a model
using any ML framework and approach, as it is agnostic of framework and has Python and R bindings, and
many wrappers around popular frameworks.
Conver t the model for inferencing on device - Almost always, a model will need to be converted to work
with a particular runtime. Model conversion usually involves advantageous optimizations like faster inference
and smaller model footprints. This step differs for each ML framework and runtime. There are open-source
interoperability frameworks available such as ONNX and MMdnn.
Build the solution for device - The solution is usually built on the same type of device as will be used in the
final deployment because binary files created system-specific.
Using runtime, deploy solution to device - Once a runtime is chosen, usually in conjunction with the ML
framework choice, the compiled solution may be deployed. The Azure IoT Runtime is a Docker-based system in
which the ML runtimes may be deployed as containers.
The diagram below shows a sample data science process where open-source tools may be leveraged for the data
science workflow. Data availability and type will drive most of the choices, including the devices/hardware chosen.

If a workflow already exists for the data scientists and app developers, a few other considerations may apply. First,
it is advised to have a code, model, and data versioning system in place. Secondly, an automation plan for code
and integration testing along with other aspects of the data science process, such as triggers, build/release
process, and so on, will help speed up the time to production and cultivate collaboration within the team.
The language of choice can help dictate what API or SDK is used for inferencing and training ML models. This in
turn will then dictate what type of ML model, what type(s) of device, what type of IoT Edge module, and so on,
need to be used. For example, PyTorch has a C++ API for inferencing and training, that works well in conjunction
with the OpenCV C++ API. If the app developer working on the deployment strategy is building a C++ application,
or has this experience, one might consider PyTorch or others (such as TensorFlow or CNTK) that have C++
inferencing APIs.

Machine learning and data science in a nutshell


In summary, here are the key considerations:
Converting models also involves optimizations such as faster inference and smaller model footprints, critical
for very resource-constrained devices.
The solution will usually need to be built on a build-dedicated device, the same type of device to which the
solution will be deployed.
The language and framework of choice will depend upon both the ML practitioner's experience as well as what
is available in open-source.
The runtime of choice will depend upon the availability of the device and hardware acceleration for ML.
It is important to have a code, model, and data versioning system.

Next steps
Proceed to Image storage and management in Azure IoT Edge Vision article to learn how to properly store the
images created by your IoT Edge Vision solution.
Image storage and management in Azure IoT Edge
Vision
11/13/2020 • 2 minutes to read • Edit Online

Storage and management of the images involved in a computer vision application is a critical function.
Some of the key considerations for managing these images are:
Ability to store all raw images during training with ease of retrieval for labeling.
Faster storage medium to avoid pipeline bottlenecks and loss.
Storage on the edge as well as in the cloud, as labeling activity can be performed at both places.
Categorization of images for easy retrieval.
Naming and tagging images to link them with inferred metadata.
The combination of Azure Blob Storage, Azure IoT Hub, and Azure IoT Edge allow several potential options for the
storage of image data, such as:
Use of the Azure IoT Edge Blob Storage module, which will automatically sync images to Azure Blob based on
policy.
Storing images to local host file system and uploading to Azure Blob service using a custom module.
Use of a local database to store images, which then are synced to the cloud database.

Typical storage workflow


The IoT Edge Blob Storage module is one of the most powerful and straightforward solutions, and our preferred
approach. A typical workflow using this module might be as follows:
1. Raw messages after ingestion are stored locally on the Edge Blob module, with time stamping and sequence
numbering to uniquely identify the image files.
2. Policy is set on the Edge Blob module for automatic upload to Azure Blob with ordering.
3. To conserve space on the Edge device, automatic deletion after certain time is configured along with retain
while uploading option to ensure all images get synced to the cloud.
4. Local categorization or domain and labeling is implemented using module that can read these images into the
UX. The label data is associated to the image URI along with the coordinates and category.
5. As label data needs to be saved, a local database is preferred to store this metadata, as it will allow easy lookup
for the UX and can be synced to the cloud using telemetry messages.
6. During scoring run, the model detects matching patterns and generates events of interest. This metadata is sent
to cloud via telemetry referring to the image URI and optionally stored in local database for edge UX. The
images continue to be stored to Edge Blob and synced with Azure Blob.

Next steps
How you respond to alerts generated by the AI model is crucial. Learn more about this in Alert persistence in
Azure IoT Edge Vision.
Alert persistence in Azure IoT Edge Vision
11/2/2020 • 2 minutes to read • Edit Online

In the context of vision on edge, an alert is a response to an event triggered by the AI model. In other words, it is
the inferencing results. The type of event is determined by the training imparted to the model. These events are
separate from operational events raised by the processing pipeline and any event related to the health of the
runtime.

Types of alerts
Some of the common alerts types are:
Image classification
Movement detection
Direction of movement
Object detection
Count of objects
Total count of objects over period of time
Average count of objects over period of time
Alerts are required to be monitored as they drive certain actions. They are critical to operations, being time
sensitive in terms of processing, and are required to be logged for audit and further analysis.

Persistence of alerts
The alerts need to persist locally on the edge where they are raised and then passed on to the cloud for further
processing and storage. This ensures a quick local response and avoids losing critical alerts due to any transient
failures.
Some options to achieve this persistence and cloud syncing are:
Utilize built-in store and forward capability of IoT Edge runtime, which automatically gets synced with Azure IoT
Hub after a lost connectivity.
Persist alerts on host file system as log files, which can be synced periodically to a blob storage in the cloud.
Utilize Azure Blob Edge module, which will sync this data to Azure Blob in cloud based on policies that can be
configured.
Use local database on IoT Edge, such as SQL Edge for storing data, and sync with Azure SQL DB using SQL Data
Sync. Another lightweight database option is the SQLite.
The preferred option is to use the built-in store and forward capability of IoT Edge runtime. This is more suitable
for the alerts due to its time sensitivity, typically small messages sizes, and ease of use.

Next steps
Now you can proceed to work on the User interface in Azure IoT Edge Vision for your vision workload.
User interface in Azure IoT Edge Vision
11/13/2020 • 9 minutes to read • Edit Online

The user interface requirements of an IoT solution will vary depending on the overall objectives. Four types of user
interfaces are commonly found in IoT solutions:
Administrator : Allows full access to device provisioning, device and solution configuration, user management,
and so on. These features could be provided as part of one solution or as separate solutions.
Consumer : Is only applicable to consumer solutions. They provide similar access to the operator's interface,
but limited to the devices owned by the user.
Operator : Provides centralized access to the operational components of the solution. It typically includes
device management, alerts monitoring, and configuration.
Analytics: Is an interactive dashboard which provides visualization of telemetry and other data analyses.
This article focuses on a simple operator’s user interface and visualization dashboard.

Technology options
Power BI - Power BI is a compelling option for our analytics and virtualization needs. It provides power
features to create customizable interactive dashboards. It also allows connectivity to many popular database
systems and services. It is available as a managed service and as a self-hosted package. The former is the
most popular and recommended option. With Power BI embedded in your solution, you could add
customer-facing reports, dashboards, and analytics in your own applications by using and branding Power
BI as your own. You can reduce developer resources by automating the monitoring, management, and
deployment of analytics, while getting full control of Power BI features and intelligent analytics.
Azure Maps - Another suitable technology for IoT visualizations is Azure Maps which allows you to create
location-aware web and mobile applications using simple and secure geospatial services, APIs, and SDKs in
Azure. You can deliver seamless experiences based on geospatial data with built-in location intelligence
from world-class mobility technology partners.
Azure App Ser vice - Azure App Service is a managed platform with powerful capabilities for building web
and mobile apps for many platforms and mobile devices. It allows developers to quickly build, deploy, and
scale web apps created with popular frameworks, such as .NET, .NET Core, Node.js, Java, PHP, Ruby, or
Python, in containers or running on any supported operating system. You can also meet rigorous,
enterprise-grade performance, security, and compliance requirements by using the fully managed platform
for your operational and monitoring tasks.
Azure SignalR Ser vice - For real-time data reporting, Azure SignalR Service, makes adding real-time
communications to your web application as simple as provisioning a service. An in-depth real-time
communications expertise is not required. It easily integrates with services such as Azure Functions, Azure
Active Directory, Azure Storage, Azure App Service, Azure Analytics, Power BI, Azure IoT, Azure Cognitive
Services, Azure Machine Learning, and more.
To secure your user interface solutions, Azure Active Director y (Azure AD) enterprise identity service provides
single sign-on and multi-factor authentication.
Now let's learn how to build the user interface for some common scenarios.

Scenario 1
Contoso Boards produces high-quality circuit boards used in computers. Their number one product is a
motherboard. Lately, they have been seeing an increase in issues with chip placement on the board. Through their
investigation, they have noticed that the circuit boards are getting placed incorrectly on the assembly line. They
need a way to identify if the circuit board is placed on the assembly line correctly. The data scientists at Contoso
Boards are most familiar with TensorFlow and would like to continue using it as their primary ML model structure.
Contoso Boards has several assembly lines that produce these mother boards. They would also like to centralize
the management of the entire solution.
Considerations in this scenario
Contoso Boards can ask themselves questions such as the following:
What are we analyzing?
Motherboard
Where are we going to view the motherboard from?
Assembly Line Conveyor belt
What camera do we need?
Area or line scan
Color or monochrome
CCD or CMOS sensor
Global or rolling shutter
Frame rate
Resolution
What type of lighting is needed?
Backlighting
Shade
Darkfield
How should the camera be mounted?
Top down
Side view
Angular
What hardware should be used?
CPU
FPGA
GPU
ASIC
Solution
To find the solution that will be useful for Contoso Boards, let's focus on the edge detection. We need to position
a camera directly above at 90 degrees and about 16 inches above the edge par t . Since the conveyer
system moves relatively slowly, we can use an area scan camera with a global shutter . For this use case, our
camera should capture about 30 frames per second . The resolution can be found using the formula of
Res=(Object Size) Divided by (details to be captured) . Based on this formula, Res=16”/8” gives 2MP in x and 4
in y, so we need a camera capable of 4MP . As for the sensor type, we are not fast moving, and really looking for
an edge detection, so a CMOS sensor is better suited. One of the more critical aspects for any vision workload is
lighting. In this application, Contoso Boards should choose to use a white diffused filter back light . This will
make the part look almost black and have a high amount of contrast for edge detection. When it comes to color
options for this application, it is better to be in black and white, as this is what will yield the sharpest edge for the
AI detection model. The data scientists are most familiar with TensorFlow, so learning ONNX or others would slow
down the time for development of the model. Also because there are several assembly lines that will use this
solution, and Contoso Boards would like a centrally managed edge solution, so Azure Stack Edge (with GPU
option) would work well here. Based on the workload, the fact that Contoso Boards already knows TensorFlow, and
this will be used on multiple assembly lines, GPU-based hardware would be the choice for hardware acceleration.
The following figure shows a sample of what the camera would see in this scenario:

Scenario 2
Contoso Shipping recently has had several pedestrian accidents at their loading docks. Most of the accidents are
happening when a truck leaves the loading dock, and the driver does not see a dock worker walking in front of the
truck. Contoso Shipping would like a solution that would watch for people, predict the direction of travel, and warn
the drivers of potential dangers of hitting the workers. The distance from the cameras to Contoso Shipping's
server room is too far for GigE connectivity, however they do have a large WIFI mesh that could be used. Most of
the data scientists at Contoso Shipping are familiar with Open-VINO and they would like to be able to reuse the
models on additional hardware in the future. The solution will also need to ensure that devices are operating as
power-efficiently as possible. Finally, Contoso Shipping needs a way to manage the solution remotely for updates.
Considerations in this scenario
Contoso Shipping can introspect by asking the following questions:
What are we analyzing?
People and patterns of movement
Where are we going to view the people from?
The loading docks are 165 feet long.
Cameras will be placed 17 feet high to keep with city ordinances.
Cameras will need to be positioned 100 feet away from the front of the trucks.
Camera focus will need to be 10 feet behind the front of the truck, and 10 additional feet in front of the
truck, giving a 20 foot depth on focus.
What camera do we need?
Area or line scan
Color or monochrome
CCD or CMOS sensor
Global or rolling shutter
Frame rate
Resolution
What type of lighting is needed?
Backlighting
Shade
Darkfield
What hardware should be used?
CPU
FPGA
GPU
ASIC
How should the camera be mounted?
Top down
Side view
Angular
Solution
Based on the distance of the loading dock size, Contoso Shipping will require several cameras to cover the entire
dock. The zoning laws that Contoso Shipping must adhere to, require that the surveillance cameras cannot be
mounted higher that 20 feet. In this use case, the average size of a worker is 5 foot 8 inches. The solution must use
the least number of cameras possible.
Formula for field of view (FOV):

For example, look at the following images.


This image is taken with 480 horizontal pixels at 20 foot:
This image is taken with 5184 horizontal pixels at 20 foot:

The red square is shown to illustrate one pixel color.


These images demonstrate the issue with using the wrong resolution camera for a given use case. Lens can impact
the FOV. However, if the wrong sensor is used for that given use case, the results could be less than expected.
With the above in mind, when choosing a camera for the solution required for Contoso Shipping, we need to think
about how many cameras, and at what resolution, are needed to get the correct number of details to detect a
person. Since we are only trying to identify if a person is in the frame or not, our PPF does not need to be around
80, which is needed for facial identification; we can use somewhere around 15-20. That would place the FOV
around 16 feet. A 16-foot FOV would give us about 17.5 pixels per foot, which fits within our required PPF of 15-
20. This would mean that we need a 10MP camera that has a horizontal resolution of ~5184 pixels , and a
lens that would allow for an FOV of 16 feet . The cameras would need to be placed outside, and the choice of
sensor type should not allow for bloom. Bloom is when light hits the sensor and overloads the sensor, causing a
view of almost over-exposure or a white out kind of condition. CMOS is the sensor of choice here.
Contoso operates 24x7 and as such, needs to ensure that nighttime personnel are also protected. Monochrome
handles low light conditions much better compared to color. We are not looking to identify a person based on
color. And monochrome sensors are a little cheaper then color.
How many cameras will it take? Since we have figured out that our cameras can look at a 16 foot path, we can do
simple math. 165 foot dock divided by 16 foot FOV gives us 10.3125 cameras. So the solution would need 11
monochrome 5184 horizontal pixel (or 10MP) CMOS cameras with IPX67 housings or weather boxes .
The cameras would be mounted on 11 poles 100 feet from the trucks at 17f high. Based on the fact that the data
scientists are more familiar with Open-VINO , data models should be built in ONNX . As for what hardware
should be used, a device that can be connected over WIFI, and use as little power as possible, is required. Based on
this, they should look to an FPGA processor . Potentially an ASIC processor could also be utilized, but due to the
nature of how an ASIC processor works, it would not meet the requirement of being able to use the models on
different hardware in the future.

Next steps
This series of articles have demonstrated how to build a complete vision workload using Azure IoT Edge devices.
For further information, you may refer to the product documentation as following:
Azure IoT Edge documentation
Tutorial: Perform image classification at the edge with Custom Vision Service
Azure Machine Learning documentation
Azure Kinect DK documentation
MMdnn tool
ONNX
Azure IoT reference architecture
11/13/2020 • 14 minutes to read • Edit Online

This reference architecture shows a recommended architecture for IoT applications on Azure using PaaS
(platform-as-a-service) components.

IoT applications can be described as things (devices) sending data that generates insights . These insights
generate actions to improve a business or process. An example is an engine (the thing) sending temperature
data. This data is used to evaluate whether the engine is performing as expected (the insight). The insight is used
to proactively prioritize the maintenance schedule for the engine (the action).
This reference architecture uses Azure PaaS (platform-as-a-service) components. Another recommended option
for building IoT solutions on Azure is:
Azure IoT Central. IoT Central is a fully managed SaaS (software-as-a-service) solution. It abstracts the technical
choices and lets you focus on your solution exclusively. This simplicity comes with a tradeoff in being less
customizable than a PaaS-based solution.
At a high level, there are two ways to process telemetry data, hot path and cold path. The difference has to do with
requirements for latency and data access.
The hot path analyzes data in near-real-time, as it arrives. In the hot path, telemetry must be processed with
very low latency. The hot path is typically implemented using a stream processing engine. The output may
trigger an alert, or be written to a structured format that can be queried using analytical tools.
The cold path performs batch processing at longer intervals (hourly or daily). The cold path typically operates
over large volumes of data, but the results don't need to be as timely as the hot path. In the cold path, raw
telemetry is captured and then fed into a batch process.

Architecture
This architecture consists of the following components. Some applications may not require every component
listed here.
IoT devices . Devices can securely register with the cloud, and can connect to the cloud to send and receive data.
Some devices may be edge devices that perform some data processing on the device itself or in a field gateway.
We recommend Azure IoT Edge for edge processing.
Cloud gateway . A cloud gateway provides a cloud hub for devices to connect securely to the cloud and send
data. It also provides device management, capabilities, including command and control of devices. For the cloud
gateway, we recommend IoT Hub. IoT Hub is a hosted cloud service that ingests events from devices, acting as a
message broker between devices and backend services. IoT Hub provides secure connectivity, event ingestion,
bidirectional communication, and device management.
Device provisioning. For registering and connecting large sets of devices, we recommend using the IoT Hub
Device Provisioning Service (DPS). DPS lets you assign and register devices to specific Azure IoT Hub endpoints at
scale.
Stream processing . Stream processing analyzes large streams of data records and evaluates rules for those
streams. For stream processing, we recommend Azure Stream Analytics. Stream Analytics can execute complex
analysis at scale, using time windowing functions, stream aggregations, and external data source joins. Another
option is Apache Spark on Azure Databricks.
Machine learning allows predictive algorithms to be executed over historical telemetry data, enabling scenarios
such as predictive maintenance. For machine learning, we recommend Azure Machine Learning.
Warm path storage holds data that must be available immediately from device for reporting and visualization.
For warm path storage, we recommend Cosmos DB. Cosmos DB is a globally distributed, multi-model database.
Cold path storage holds data that is kept longer-term and is used for batch processing. For cold path storage,
we recommend Azure Blob Storage. Data can be archived in Blob storage indefinitely at low cost, and is easily
accessible for batch processing.
Data transformation manipulates or aggregates the telemetry stream. Examples include protocol
transformation, such as converting binary data to JSON, or combining data points. If the data must be
transformed before reaching IoT Hub, we recommend using a protocol gateway (not shown). Otherwise, data can
be transformed after it reaches IoT Hub. In that case, we recommend using Azure Functions, which has built-in
integration with IoT Hub, Cosmos DB, and Blob Storage.
Business process integration performs actions based on insights from the device data. This could include
storing informational messages, raising alarms, sending email or SMS messages, or integrating with CRM. We
recommend using Azure Logic Apps for business process integration.
User management restricts which users or groups can perform actions on devices, such as upgrading firmware.
It also defines capabilities for users in applications. We recommend using Azure Active Directory to authenticate
and authorize users.
Security monitoring Azure Security Center for IoT provides an end-to-end security solution for IoT workloads
and simplifies their protection by delivering unified visibility and control, adaptive threat prevention, and
intelligent threat detection and response across workloads from leaf devices through Edge as well as up through
the clouds.

Scalability considerations
An IoT application should be built as discrete services that can scale independently. Consider the following
scalability points:
IoTHub . For IoT Hub, consider the following scale factors:
The maximum daily quota of messages into IoT Hub.
The quota of connected devices in an IoT Hub instance.
Ingestion throughput — how quickly IoT Hub can ingest messages.
Processing throughput — how quickly the incoming messages are processed.
Each IoT hub is provisioned with a certain number of units in a specific tier. The tier and number of units
determine the maximum daily quota of messages that devices can send to the hub. For more information, see IoT
Hub quotas and throttling. You can scale up a hub without interrupting existing operations.
Stream Analytics . Stream Analytics jobs scale best if they are parallel at all points in the Stream Analytics
pipeline, from input to query to output. A fully parallel job allows Stream Analytics to split the work across
multiple compute nodes. Otherwise, Stream Analytics has to combine the stream data into one place. For more
information, see Leverage query parallelization in Azure Stream Analytics.
IoT Hub automatically partitions device messages based on the device ID. All of the messages from a particular
device will always arrive on the same partition, but a single partition will have messages from multiple devices.
Therefore, the unit of parallelization is the partition ID.
Functions . When reading from the Event Hubs endpoint, there is a maximum of function instance per event hub
partition. The maximum processing rate is determined by how fast one function instance can process the events
from a single partition. The function should process messages in batches.
Cosmos DB . To scale out a Cosmos DB collection, create the collection with a partition key and include the
partition key in each document that you write. For more information, see Best practices when choosing a partition
key.
If you store and update a single document per device, the device ID is a good partition key. Writes are evenly
distributed across the keys. The size of each partition is strictly bounded, because there is a single document
for each key value.
If you store a separate document for every device message, using the device ID as a partition key would
quickly exceed the 10-GB limit per partition. Message ID is a better partition key in that case. Typically you
would still include device ID in the document for indexing and querying.
Azure Time Series Insights (TSI) is an analytics, storage and visualization service for time-series data,
providing capabilities including SQL-like filtering and aggregation, alleviating the need for user-defined functions.
Time Series Insights provides a data explorer to visualize and query data as well as REST Query APIs. In addition to
time series data, TSI is also well-suited for solutions that need to query aggregates over large sets of data. With
support for multi layered storage, rich APIs, model and it’s integration with Azure IoT ecosystem, explorer for
visualizations, and extensibility through Power BI, etc. TSI is our recommendation for time series data storage and
analytics.

Security considerations
Trustworthy and secure communication
All information received from and sent to a device must be trustworthy. Unless a device can support the following
cryptographic capabilities, it should be constrained to local networks and all internetwork communication should
go through a field gateway:
Data encryption with a provably secure, publicly analyzed, and broadly implemented symmetric-key
encryption algorithm.
Digital signature with a provably secure, publicly analyzed, and broadly implemented symmetric-key signature
algorithm.
Support for either TLS 1.2 for TCP or other stream-based communication paths or DTLS 1.2 for datagram-
based communication paths. Support of X.509 certificate handling is optional and can be replaced by the more
compute-efficient and wire-efficient pre-shared key mode for TLS, which can be implemented with support for
the AES and SHA-2 algorithms.
Updateable key-store and per-device keys. Each device must have unique key material or tokens that identify it
toward the system. The devices should store the key securely on the device (for example, using a secure key-
store). The device should be able to update the keys or tokens periodically, or reactively in emergency
situations such as a system breach.
The firmware and application software on the device must allow for updates to enable the repair of discovered
security vulnerabilities.
However, many devices are too constrained to support these requirements. In that case, a field gateway should be
used. Devices connect securely to the field gateway through a local area network, and the gateway enables secure
communication to the cloud.
Physical tamper-proofing
It is strongly recommended that device design incorporates features that defend against physical manipulation
attempts, to help ensure the security integrity and trustworthiness of the overall system.
For example:
Choose microcontrollers/microprocessors or auxiliary hardware that provides secure storage and use of
cryptographic key material, such as trusted platform module (TPM) integration.
Secure boot loader and secure software loading, anchored in the TPM.
Use sensors to detect intrusion attempts and attempts to manipulate the device environment with alerting and
potentially "digital self-destruction" of the device.
For additional security considerations, see Internet of Things (IoT) security architecture.
Monitoring and logging
Logging and monitoring systems are used to determine whether the solution is functioning and to help
troubleshoot problems. Monitoring and logging systems help answer the following operational questions:
Are devices or systems in an error condition?
Are devices or systems correctly configured?
Are devices or systems generating accurate data?
Are systems meeting the expectations of both the business and end customers?
Logging and monitoring tools are typically comprised of the following four components:
System performance and timeline visualization tools to monitor the system and for basic troubleshooting.
Buffered data ingestion, to buffer log data.
Persistence store to store log data.
Search and query capabilities, to view log data for use in detailed troubleshooting.
Monitoring systems provide insights into the health, security, and stability, and performance of an IoT solution.
These systems can also provide a more detailed view, recording component configuration changes and providing
extracted logging data that can surface potential security vulnerabilities, enhance the incident management
process, and help the owner of the system troubleshoot problems. Comprehensive monitoring solutions include
the ability to query information for specific subsystems or aggregating across multiple subsystems.
Monitoring system development should begin by defining healthy operation, regulatory compliance, and audit
requirements. Metrics collected may include:
Physical devices, edge devices, and infrastructure components reporting configuration changes.
Applications reporting configuration changes, security audit logs, request rates, response times, error rates,
and garbage collection statistics for managed languages.
Databases, persistence stores, and caches reporting query and write performance, schema changes, security
audit log, locks or deadlocks, index performance, CPU, memory, and disk usage.
Managed services (IaaS, PaaS, SaaS, and FaaS) reporting health metrics and configuration changes that impact
dependent system health and performance.
Visualization of monitoring metrics alert operators to system instabilities and facilitate incident response.
Tracing telemetry
Tracing telemetry allows an operator to follow the journey of a piece of telemetry from creation through the
system. Tracing is important for debugging and troubleshooting. For IoT solutions that use Azure IoT Hub and the
IoT Hub Device SDKs, tracing datagrams can be originated as Cloud-to-Device messages and included in the
telemetry stream.
Logging
Logging systems are integral in understanding what actions or activities a solution has performed, failures that
have occurred, and can provide help in fixing those failures. Logs can be analyzed to help understand and remedy
error conditions, enhance performance characteristics, and ensure compliance with governing rule and
regulations.
Though plain-text logging is lower impact on upfront development costs, it is more challenging for a machine to
parse/read. We recommend structured logging be used, as collected information is both machine parsable and
human readable. Structured logging adds situational context and metadata to the log information. In structured
logging, properties are first class citizens formatted as key/value pairs, or with a fixed schema, to enhance search
and query capabilities.

DevOps considerations
Use the Infrastructure as code (IaC). IaC is the management of infrastructure (networks, virtual machines, load
balancers, and connection topology) with a declarative approach. Templates should be versioned and part of the
release pipeline. The most reliable deployment processes are automated and idempotent. One way is to create
Azure Resource Manager template for provisioning the IoT resources and the infrastructure.
To automate infrastructure deployment, you can use Azure DevOps Services, Jenkins, or other CI/CD solutions.
Azure Pipelines is part of Azure DevOps Services and runs automated builds, tests, and deployments.
Consider staging your workloads by deploying to various stages and running validations at each stage before
moving on to the next one; that way you can push updates to your production environments in a highly controlled
way and minimize unanticipated deployment issues. Blue-green deployment and Canary releases are
recommended deployment strategies for updating live production environments. Also consider having a good
rollback strategy for when a deployment fails; for example you could automatically redeploy an earlier, successful
deployment from your deployment history, the --rollback-on-error flag parameter in Azure CLI is good example.
Consider monitoring your solution by using Azure Monitor. Azure Monitor is the main source of monitoring and
logging for all your Azure services, it provides diagnostics information for Azure resources. You can for example,
monitor the operations that take place within your IoT hub. There are specific metrics and events that Azure
Monitor supports, as well as services, schemas, and categories for Azure Diagnostic Logs.
For more information, see the DevOps section in Microsoft Azure Well-Architected Framework.

Cost considerations
In general, use the Azure pricing calculator to estimate costs. Other considerations are described in the Cost
section in Microsoft Azure Well-Architected Framework.
There are ways to optimize costs associated the services used in this reference architecture.
Azure IoT Hub
In this architecture, IoT Hub is the cloud gateway that ingests events from devices. IoT Hub billing varies
depending on the type of operation. Create, update, insert, delete are free. Successful operations such as device-
to-cloud and cloud-to-device messages are charged.
Device-to-cloud messages sent successfully are charged in 4-KB chunks on ingress into IoT Hub. For example, a 6-
KB message is charged as two messages.
IoT Hub maintains state information about each connected device in a device twin JSON document. Read
operations from a device twin document are charged.
IoT Hub offers two tiers: Basic and Standard .
Consider using the Standard tier if your IoT architecture uses bi-directional communication capabilities. This tier
also offers a free edition that is most suited for testing purposes.
If you only need uni-directional communication from devices to the cloud, use the Basic tier, which is cheaper.
For more information, see IoT Hub Pricing.
Azure Stream Analytics
Azure Stream Analytics is used for stream processing and rules evaluation. Azure Stream Analytics is priced by the
number of Streaming Units (SU) per hour, which takes into compute, memory, and throughput required to process
the data. Azure Stream Analytics on IoT Edge is billed per job. Billing starts when a Stream Analytics job is
deployed to devices regardless of the job status, running, failed, or stopped.
For more information about pricing, see Stream Analytics pricing.
Azure Functions
Azure Functions is used to transform data after it reaches the IoT Hub. From a cost perspective, the
recommendation is to use consumption plan because you pay only for the compute resources you use. You are
charged based on per-second resource consumption each time an event triggers the execution of the function.
Processing several events in a single execution or batches can reduce cost.
Azure Logic Apps
In this architecture, Logic Apps is used for business process integration.
Logic apps pricing works on the pay-as-you-go model. Triggers, actions, and connector executions are metered
each time a logic app runs. All successful and unsuccessful actions, including triggers, are considered as
executions.
For instance, your logic app processes 1000 messages a day. A workflow of five actions will cost less than $6.
For more information, see Logic Apps pricing.
Data Storage
For cold path storage, Azure Blob Storage is the most cost-effective option.
For warm path storage, consider using Azure Cosmos DB. For more information, see Cosmos DB pricing.

Next steps
For a more detailed discussion of the recommended architecture and implementation choices, see (PDF).
Microsof t Azure I oT Ref erence Archit ecture

For detailed documentation of the various Azure IoT services, see Azure IoT Fundamentals.
A sample IoT implementation is available on GitHub.
Condition Monitoring for Industrial IoT
11/2/2020 • 6 minutes to read • Edit Online

This example scenario demonstrates how end manufacturers can connect their assets to the cloud using OPC UA
(Open Platform Communication Unified Architecture) and the Industrial Components. They can monitor their
equipment's key parameters to discover anomalies before they become critical issues. OPC UA is a platform-
independent and service-oriented interoperability standard for a secure and reliable data exchange. OPC UA is
used by various industrial systems and devices such as industry PCs, PLCs, and sensors. It's a standard that is
driven by the OPC Foundation.

Architecture

Data Flow
The data flows through the solution as follows:
1. Industrial devices that can natively communicate OPC UA can directly connect to IoT Edge. IoT Edge is the
compute power that sits on your on-premises network. It’s the runtime environment of the Industrial Modules
(OPC Publisher, OPC Twin, and Discovery Module). Modules are containers that run Azure services, 3rd party
services, or your own code. The OPC Publisher module connects to OPC UA servers and publishes OPC UA
telemetry data to Azure IoT Hub. OPC Twin creates a digital twin of an OPC UA server in the cloud and provides
OPC UA browse/read/write/method call capabilities via a cloud-based REST (Representational State Transfer)
interface. The Discovery module provides discovery services on the edge, which include OPC UA server
discovery.
2. Industrial devices that can’t communicate through OPC UA need a 3rd party PLC adapter to connect to IoT Edge.
Adapters are obtainable as modules in the Azure Marketplace.
3. The 3rd party PLC adapters enable a connectivity between the devices and IoT Edge.
4. For analytical capabilities closer to where the data originates, there are modules like Machine Learning on Edge
or Functions obtainable from the Azure Marketplace, allowing low latency and operation in disconnected state.
5. Azure IoT Hub connects the devices virtually to the cloud for further data processing. It enables a security-
enhanced bidirectional communication between IoT applications and devices.
6. The Industrial Services are made up of several microservices exposing a REST API. All Industrial Services are
deployed to an Azure Kubernetes Service cluster. They implement business logic and functionality for discovery,
registration, remote control, and post-processing telemetry of industrial devices. The REST APIs can be used in
any programming language and framework that can call an HTTP endpoint.
7. Azure Event Hubs transforms and stores the data. It provides a distributed stream processing platform with low
latency and seamless integration.
There are three predominant use cases in which the data provided by the Industrial Services is used:
8. Alternative 1: Store and analyze the data using Azure Time Series Insights (TSI). The telemetry processor in the
Industrial IoT platform forwards contextualized samples to TSI and other consumers.
9. The Time Series Insights Explorer is a web application you can use to visualize the telemetry.
10. Alternative 2: After the Industrial Services process the data, Azure Data Lake stores and further analyzes the
data. Azure Data Lake is a massively scalable data lake with enterprise-grade security and auditing, which allows
batch, stream and interactive analytic programs to run with simplicity. Azure Data Lake solves many of the
productivity and scalability challenges that prevent you from maximizing the value of your data assets.
11. Explore your data with visual reports and collaborate, publish and share them with others. Power BI integrates
with other tools, including Microsoft Excel, so you can get up to speed quickly and work seamlessly with your
existing solutions.
12. Alternative 3: Azure Stream Analytics is a real time analytics service. It’s easily extensible with custom code and
built-in machine learning capabilities for more advanced scenarios.
13. Azure Functions is a serverless compute service, which allows you to run small pieces of code (called
"functions") without worrying about application infrastructure. With Azure Functions, the cloud infrastructure
provides all the up-to-date servers you need to keep your application running at scale.
14. Azure Notification Hubs allow you to send notifications to a wide range of mobile platforms and can allow
notification of operators and administrators on certain events or alerts which require immediate attention.

Components
Data is loaded from these different data sources using several Azure components:
IoT Edge Azure IoT Edge moves cloud analytics and custom business logic to devices so that your organization
can focus on business insights instead of data management. Scale out your IoT solution by packaging your
business logic into standard containers, then you can deploy those containers to any of your devices and
monitor it all from the cloud.
Industrial Modules – The Azure Industrial IoT platform includes modules that run inside Azure IoT Edge to
connect your shop floor. The OPC Publisher module connects to OPC UA servers and publishes OPC UA
telemetry data from these servers to Azure IoT Hub. OPC Twin provides discovery, registration, and remote
control of industrial devices through REST APIs. The Discovery Module provides discovery services on the edge
which include OPC UA server discovery.
Azure IoT Hub IoT Hub is a managed service, hosted in the cloud, that acts as a central message hub for bi-
directional communication between your IoT application and the devices it manages. You can use Azure IoT Hub
to build IoT solutions with reliable and secure communications between millions of IoT devices and a cloud-
hosted solution backend. You can connect virtually any device to IoT Hub.
Industrial Services on Azure Kubernetes The platform is made up of several cloud components that divide into
Microservices providing REST API and Agent services that can provide processing and daemon like
functionality.
Azure Event Hubs is a big data streaming platform and event ingestion service. It can receive and process
millions of events per second. Data sent to an event hub can be transformed and stored by using any real-time
analytics provider or batching/storage adapters.
Time Series Insights Azure Time Series Insights is a fully managed analytics, storage, and visualization service
that makes it simple to explore and analyze billions of IoT events simultaneously. It gives you a global view of
your data, which lets you quickly validate your IoT solution and avoid costly downtime to mission-critical
devices.
Time Series Insights Explorer The Time Series Insights explorer demonstrates the powerful data visualization
capabilities provided by the service and can be accessed within your own environment.
Azure Data Lake makes Azure Storage the foundation for building enterprise data lakes on Azure. Designed
from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput,
it allows you to easily manage massive amounts of data.
Power BI is a suite of business analytics tools to analyze data and share insights. Power BI can query a semantic
model stored in Analysis Services, or it can query Azure Synapse directly.
Azure Stream Analytics is a real-time analytics and complex event-processing engine that is designed to analyze
and process high volumes of fast streaming data from multiple sources simultaneously. Patterns and
relationships can be identified in information extracted from a number of input sources including devices,
sensors, clickstreams, social media feeds, and applications.
Azure Functions allows you to run small pieces of code (called "functions") without worrying about application
infrastructure. Azure Function is a great solution for processing bulk data, integrating systems, working with the
internet-of-things (IoT), and building simple APIs and micro-services.

Next Steps
For a detailed view of the Industrial Modules and Services, see the Azure Industrial IoT Platform architecture and
a detailed view of all the individual Microservices and Agent processes is shown here.
You can find more information on how to get started with the Azure Industrial IoT Platform in the Industrial IoT
GitHub repository.
IoT and data analytics in the construction industry
11/2/2020 • 5 minutes to read • Edit Online

This example scenario is relevant to organizations building solutions that integrate data from many IoT devices into
a comprehensive data analysis architecture to improve and automate decision making. Potential applications
include construction, mining, manufacturing, or other industry solutions involving large volumes of data from
many IoT-based data inputs.
In this scenario, a construction equipment manufacturer builds vehicles, meters, and drones that use IoT and GPS
technologies to emit telemetry data. The company wants to modernize their data architecture to better monitor
operating conditions and equipment health. Replacing the company's legacy solution using on-premises
infrastructure would be both time intensive and labor intensive, and would not be able to scale sufficiently to
handle the anticipated data volume.
The company wants to build a cloud-based "smart construction" solution. It should gather a comprehensive set of
data for a construction site and automate the operation and maintenance of the various elements of the site. The
company's goals include:
Integrating and analyzing all construction site equipment and data to minimize equipment downtime and
reduce theft.
Remotely and automatically controlling construction equipment to mitigate the effects of a labor shortage,
ultimately requiring fewer workers and enabling lower-skilled workers to succeed.
Minimizing the operating costs and labor requirements for the supporting infrastructure, while increasing
productivity and safety.
Easily scaling the infrastructure to support increases in telemetry data.
Complying with all relevant legal requirements by provisioning resources in-country without compromising
system availability.
Using open-source software to maximize the investment in workers' current skills.
Using managed Azure services such as IoT Hub and HDInsight will allow the customer to rapidly build and deploy a
comprehensive solution with a lower operating cost. If you have additional data analytics needs, you should review
the list of available fully managed data analytics services in Azure.

Relevant use cases


Other relevant use cases include:
Construction, mining, or equipment manufacturing scenarios
Large-scale collection of device data for storage and analysis
Ingestion and analysis of large datasets

Architecture
The data flows through the solution as follows:
1. Construction equipment collects sensor data and sends the construction results data at regular intervals to load
balanced web services hosted on a cluster of Azure virtual machines.
2. The custom web services ingest the construction results data and store it in an Apache Cassandra cluster also
running on Azure virtual machines.
3. Another dataset is gathered by IoT sensors on various construction equipment and sent to IoT Hub.
4. Raw data collected is sent directly from IoT Hub to Azure blob storage and is immediately available for viewing
and analysis.
5. Data collected via IoT Hub is processed in near real time by an Azure Stream Analytics job and stored in an
Azure SQL database.
6. The Smart Construction Cloud web application is available to analysts and end users to view and analyze sensor
data and imagery.
7. Batch jobs are initiated on demand by users of the web application. The batch job runs in Apache Spark on
HDInsight and analyzes new data stored in the Cassandra cluster.
Components
IoT Hub acts as a central message hub for secure bi-directional communication with per-device identity between
the cloud platform and the construction equipment and other site elements. IoT Hub can rapidly collect data for
each device for ingestion into the data analytics pipeline.
Azure Stream Analytics is an event-processing engine that can analyze high volumes of data streaming from
devices and other data sources. It also supports extracting information from data streams to identify patterns
and relationships. In this scenario, Stream Analytics ingests and analyzes data from IoT devices and stores the
results in Azure SQL Database.
Azure SQL Database contains the results of analyzed data from IoT devices and meters, which can be viewed by
analysts and users via an Azure-based Web application.
Blob storage stores image data gathered from the IoT hub devices. The image data can be viewed via the web
application.
Traffic Manager controls the distribution of user traffic for service endpoints in different Azure regions.
Load Balancer distributes data submissions from construction equipment devices across the VM-based web
services to provide high availability.
Azure Virtual Machines host the web services that receive and ingest the construction results data into the
Apache Cassandra database.
Apache Cassandra is a distributed NoSQL database used to store construction data for later processing by
Apache Spark.
Web Apps hosts the end-user web application, which can be used to query and view source data and images.
Users can also initiate batch jobs in Apache Spark via the application.
Apache Spark on HDInsight supports in-memory processing to boost the performance of big-data analytic
applications. In this scenario, Spark is used to run complex algorithms over the data stored in Apache
Cassandra.
Alternatives
Cosmos DB is an alternative NoSQL database technology. Cosmos DB provides multi-master support at global
scale with multiple well-defined consistency levels to meet various customer requirements. It also supports the
Cassandra API.
Azure Databricks is an Apache Spark-based analytics platform optimized for Azure. It is integrated with Azure to
provide one-click setup, streamlined workflows, and an interactive collaborative workspace.
Data Lake Storage is an alternative to Blob storage. For this scenario, Data Lake Storage was not available in the
targeted region.
Web Apps could also be used to host the web services for ingesting construction results data.
Many technology options are available for real-time message ingestion, data storage, stream processing,
storage of analytical data, and analytics and reporting. For an overview of these options, their capabilities, and
key selection criteria, see Big data architectures: Real-time processing in the Azure Data Architecture Guide.

Considerations
The broad availability of Azure regions is an important factor for this scenario. Having more than one region in a
single country can provide disaster recovery while also enabling compliance with contractual obligations and law
enforcement requirements. Azure's high-speed communication between regions is also an important factor in this
scenario.
Azure support for open-source technologies allowed the customer to take advantage of their existing workforce
skills. The customer can also accelerate the adoption of new technologies with lower costs and operating
workloads compared to an on-premises solution.

Pricing
The following considerations will drive a substantial portion of the costs for this solution.
Azure virtual machine costs will increase linearly as additional instances are provisioned. Virtual machines that
are deallocated will only incur storage costs, and not compute costs. These deallocated machines can then be
reallocated when demand is high.
IoT Hub costs are driven by the number of IoT units provisioned as well as the service tier chosen, which
determines the number of messages per day per unit allowed.
Stream Analytics is priced by the number of streaming units required to process the data into the service.

Related resources
To see an implementation of a similar architecture, read the Komatsu customer story.
Guidance for big data architectures is available in the Azure Data Architecture Guide.
IoT using Cosmos DB
11/2/2020 • 6 minutes to read • Edit Online

Scale instantly and elastically to accommodate diverse and unpredictable IoT workloads without sacrificing
ingestion or query performance.
Azure Cosmos DB is Microsoft’s globally distributed, multi-model database. Azure Cosmos DB was built from the
ground up with global distribution and horizontal scale at its core. It offers turnkey global distribution across any
number of Azure regions by transparently scaling and replicating your data wherever your users are. You can
elastically scale throughput and storage worldwide, and pay only for the throughput and storage you need.
Cosmos DB is ideally suited for IoT solutions. Cosmos DB can ingest device telemetry data at high rates and can
serve indexed queries back with low latency and high availability.
Cosmos DB is a multi-model database with wire protocol–compatible API endpoints for Cassandra, MongoDB, SQL,
Gremlin, Etcd, and Table along with built-in support for Jupyter Notebook files.

Architecture

Data Flow
1. Events generated from IoT devices are sent to the analyze and transform layer through Azure IoT Hub as a
stream of messages. Azure IoT Hub stores streams of data in partitions for a configurable amount of time.
2. Azure Databricks, running Apache Spark Streaming, picks up the messages in real time from IoT Hub, processes
the data based on the business logic and sends the data to Serving layer for storage. Spark Streaming can
provide real time analytics such as calculating moving averages, min and max values over time periods.
3. Device messages are stored in Cosmos DB as JSON documents. This is considered the hot data store .
Different JSON schemas representing different device vendors can be stored in Cosmos DB or converted to a
canonical JSON schema.
4. The storage layer consists of:
Azure Blob Storage - IoT Hub message routing can save the raw IoT device messages to Azure Blob
storage, allowing blob storage to act as an inexpensive, long-term cold data store .
Azure SQL Database - Utilize Azure SQL for storing your transactional and relational data (for example,
billing data, user roles).
Azure Synapse Analytics (Previously Azure SQL Data Warehouse) - For your solution data warehouse.
Populated it using Azure Data Factory using aggregated data from Cosmos DB and Azure SQL.
5. Microsoft Power BI can be used by your users to analyze warehoused data.
6. Web, mobile and other applications can be built on the storage layer. For example, you can expose APIs based
on the storage layer data for third-party uses.
7. Use Cosmos DB Change Feed to execute an Azure Function each time a device message is added or updated in
Cosmos DB.
8. Some device messages (for example, a fault code) may require an action to be performed on the device. Using
the Azure IoT Hub Service API, the Azure Function can connect to Azure IoT Hub and perform an action on the
device (for example, reboot) using either:
Device Twins
Cloud to Device messages
Direct Methods

Components
This architecture uses the following Azure components:
Azure IoT Hub acts as the cloud gateway, ingesting device telemetry at-scale. IoT Hub also supports bi-
directional communication back to devices, allowing actions to be sent from the cloud or Azure IoT Edge to the
device. Azure IoT Edge can be used to run applications at the edge, such as machine learning models.
Azure Databricks with Apache Spark Streaming is located in the transformation and analytics layer. Databricks
uses the azure-eventhubs-spark_2.11:2.3.6 Maven library to connect to IoT Hub's Event Hub compatible
endpoint. Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively
supports both batch and streaming workloads.
Azure Cosmos DB is a globally distributed, multi-model database.
Consistency Levels - Cosmos DB supports 5 consistency levels (Strong, Bounded Staleness, Session,
Consistent Prefix, Eventual) allowing you to make the tradeoff between the read consistency vs.
availability, latency, and throughput.
TTL - Azure Cosmos DB provides the ability to delete items automatically from a container after a certain
time period. This allows Cosmos DB to act as a hot data store for recent data, with long-term data stored
in Azure Blob cold storage.
Change Feed - Outputs a sorted list of documents that were changed in the order in which they were
modified. You can create small reactive Azure Functions that will be automatically triggered on each new
event in your Azure Cosmos container's change feed. Depending on the contents of the JSON document,
the Azure Function can connect to Azure IoT Hub Service API and execute an action on the device using
Device Twin, Cloud to Device messaging, or Direct Methods.
Request Unit (RU) - Is the measure of throughput in Azure Cosmos DB. RUs are compute units for both
performance and cost. With RUs, you can dynamically scale up and down while maintaining availability,
optimizing for cost, performance and availability at the same time.
Partitioning - The partition key is what will determine how data is routed in the various partitions by
Cosmos DB and needs to make sense in the context of your specific scenario. The IoT Device Id is
generally the “natural” partition key for IoT applications.
Azure SQL Database is the relational database for transactional and other non-IoT data.
Azure Synapse Analytics is the data warehouse and reporting platform, containing aggregated data from Azure
SQL and Cosmos DB. For enterprise data warehousing and big data analytics.
Power BI is a suite of business analytics tools to analyze data and share insights. Power BI can query a semantic
model stored in Azure Analysis Services, or it can query Azure Synapse directly.
Azure App Services can be used to build web and mobile applications. Azure API App can be used to expose
data to third parties, based on the data stored in the Serving Layer.
Azure Functions can be used to translate IoT message payloads (for example, from binary to JSON) or trigger
actions when connected to Cosmos DB Change Feed. Azure Functions is an event-driven serverless compute
platform. Build and debug locally without additional setup, deploy and operate at scale in the cloud, and
integrate services using triggers and bindings.

Alternatives
In the transformation and analytics layer, we could use HDInsight Storm, HDInsight Spark or Azure Stream
Analytics to perform streaming analytics, and Azure Functions to perform message payload transformations.
In the service storage layer, Azure Data Explorer and Time Series Insights could be used for storing IoT
messages. Both of these services also provide rich analytics capabilities.

Considerations
Cosmos DB has a 20-GB limit (in the past it was 10GB) for a single logical partition. For most IoT solutions, this
size is sufficient. If not, we recommend either:
Setting the partition key to an artificial field and assign a composite value (for example, Device ID +
Current Month and Year). This will ensure an extremely high cardinality of values.
Tier old Cosmos DB data out to cold storage (for example, Azure Blob Storage) using a combination of
TTL to automatically prune data from Cosmos DB and change feed to replicate data to cold storage.

Next steps
Review the following articles on IoT and Cosmos DB.
Azure IoT reference architecture
IoT solutions and Cosmos DB
Change feed in Azure Cosmos DB
Time to Live (TTL) in Azure Cosmos DB
Consistency levels in Azure Cosmos DB
Request Units in Azure Cosmos DB
Partitioning and horizontal scaling in Azure Cosmos DB
Predictive Maintenance for Industrial IoT
11/2/2020 • 5 minutes to read • Edit Online

This example scenario demonstrates how end manufacturers can connect assets to the cloud using OPC UA (Open
Platform Communication Unified Architecture) and the Industrial Components. This will enable the use of
Predictive Maintenance to improve the efficiency of your machines, while reducing the costs by optimizing the
production. Mitigate disruptions by applying advanced analytics and machine learning to your production to
anticipate outages. Ensure production uptime with rich insights and automatic alerts triggered by manufacturing
data. OPC UA is a platform-independent and service-oriented interoperability standard for a secure and reliable
data exchange. OPC UA is used by various industrial systems and devices such as industry PCs, PLCs, and sensors.
It's a standard that is driven by the OPC Foundation.

Architecture

Data Flow
The data flows through the solution as follows:
1. Industrial devices that can natively communicate OPC UA can directly connect to IoT Edge. IoT Edge is the
compute power that sits on your on-premises network. It’s the runtime environment of the Industrial Modules
(OPC Publisher, OPC Twin, and Discovery Module). Modules are containers that run Azure services, 3rd party
services, or your own code. The OPC Publisher module connects to OPC UA servers and publishes OPC UA
telemetry data to Azure IoT Hub. OPC Twin creates a digital twin of an OPC UA server in the cloud and provides
OPC UA browse/read/write/method call capabilities via a cloud-based REST (Representational State Transfer)
interface. The Discovery module provides discovery services on the edge, which include OPC UA server
discovery.
2. Industrial devices that can’t communicate through OPC UA need a 3rd party PLC adapter to connect to IoT Edge.
Adapters are obtainable as modules in the Azure Marketplace.
3. The 3rd party PLC adapters enable a connectivity between the devices and IoT Edge.
4. For analytical capabilities closer to where the data originates, there are modules like Machine Learning on Edge
or Functions obtainable from the Azure Marketplace, allowing low latency and operation in disconnected state.
5. Azure IoT Hub connects the devices virtually to the cloud for further data processing. It enables a security-
enhanced bidirectional communication between IoT applications and devices.
6. The Industrial Services are made up of several microservices exposing a REST API. All Industrial Services are
deployed to an Azure Kubernetes Service cluster. They implement business logic and functionality for discovery,
registration, remote control, and post-processing telemetry of industrial devices. The REST APIs can be used in
any programming language and framework that can call an HTTP endpoint. There are three predominant use
cases in which the data provided by the Industrial Services is used.
7. Azure Event Hubs transforms and stores the data. It provides a distributed stream processing platform with low
latency and seamless integration.
8. After the Event Hubs process the data, Azure Data Lake stores and analyzes the data further. Azure Data Lake is
a massively scalable data lake with enterprise-grade security and auditing, which allows to run batch, stream
and interactive analytic programs with simplicity. Azure Data Lake solves many of the productivity and
scalability challenges that prevent you from maximizing the value of your data assets.
9. Azure Databricks provides the latest versions of Apache Spark as an Azure service offering and allows you to
seamlessly integrate with open source libraries. It provides a one-click setup, streamlined workflows, and an
interactive workspace that enables collaboration.
10. Explore your data with visual reports and collaborate, publish and share these reports with others. Power BI
integrates with other tools, including Microsoft Excel, so you can get up to speed quickly and work seamlessly
with your existing solutions.

Components
Data is loaded from these different data sources using several Azure components:
IoT Edge Azure IoT Edge moves cloud analytics and custom business logic to devices so that your organization
can focus on business insights instead of data management. Scale out your IoT solution by packaging your
business logic into standard containers, then you can deploy those containers to any of your devices and
monitor it all from the cloud.
Industrial Modules – The Azure Industrial IoT platform includes modules that run inside Azure IoT Edge to
connect your shop floor. The OPC Publisher module connects to OPC UA servers and publishes OPC UA
telemetry data from these servers to Azure IoT Hub. OPC Twin provides discovery, registration, and remote
control of industrial devices through REST APIs. The Discovery Module provides discovery services on the edge,
which include OPC UA server discovery.
Azure IoT Hub Azure IoT Hub is a managed service, hosted in the cloud, that acts as a central message hub for
bi-directional communication between your IoT application and the devices it manages. You can use Azure IoT
Hub to build IoT solutions with reliable and secure communications between millions of IoT devices and a
cloud-hosted solution backend. You can connect virtually any device to IoT Hub.
Industrial Services on Azure Kubernetes The platform is made up of several cloud components that divide into
Microservices providing REST API and Agent services that can provide processing and daemon like
functionality.
Azure Event Hubs is a big data streaming platform and event ingestion service. It can receive and process
millions of events per second. Data sent to an event hub can be transformed and stored by using any real-time
analytics provider or batching/storage adapters.
Azure Data Lake makes Azure Storage the foundation for building enterprise data lakes on Azure. Designed
from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput,
it allows you to easily manage massive amounts of data.
Azure Data Bricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services
platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click
setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists,
data engineers, and business analysts.
Power BI is a suite of business analytics tools to analyze data and share insights. Power BI can query a semantic
model stored in Analysis Services, or it can query Azure Synapse directly.
Data Factory orchestrates the transformation of staged data into a common structure in Azure Synapse. Using
Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data
from disparate data stores. You can build complex ETL processes that transform data visually with data flows or
by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.

Next Steps
For a detailed view of the Industrial Modules and Services, see the Azure Industrial IoT Platform architecture and
a detailed view of all the individual Microservices and Agent processes is shown here.
You can find more information on how to get started with the Azure Industrial IoT Platform in the Industrial IoT
GitHub repository.
IoT Connected Platform for COVID-19 protection
11/2/2020 • 2 minutes to read • Edit Online

Smart group spaces can help people get back to work and play As Soon And Safely As Possible (ASASAP) after
COVID-19 shutdowns. Insight's Connected Platform for Detection and Prevention rapidly deploys and manages a
flexible, scalable intelligent edge and cloud ecosystem that helps detect and prevent COVID-19 infection in large
group spaces.
This article describes a Connected Platform solution to help provide COVID-19 detection and prevention at a theme
park.
Thermal cameras and contactless thermometers take temperatures of people entering the park.
Portable, interactive virus testing centers provide rapid, discreet onsite virus testing for people who fail
temperature checks.
Anomaly detecting smart cones monitor safe social distancing in groups and lines.
Smart, connected hand sanitizer dispensers monitor usage and supply levels.
Speakers and interactive bots deliver automated alerts, reminders, and instructions to employees and guests.
Data driven messaging and reporting let stakeholders monitor events and overall trends.

Potential use cases


Large, high-usage spaces with controlled entry and access.
Facilities like offices, factories, theaters, stadiums, malls, transportation centers, and tourist attractions.

Architecture

1. Thermal cameras and other sensors provide temperature and visual data through various network protocols like
Bluetooth and WiFi to the Internet of Things (IoT) Edge gateway.
2. The IoT Edge gateway preprocesses data and can respond quickly using onboard resources.
3. In the cloud, Azure IoT Hub communicates with and controls the IoT Edge network, and streams data to Azure
resources.
4. Azure Stream Analytics and Azure Databricks process data and send it to database and blob storage services.
5. Processed, stored data feeds into Docker containerized microservices apps in Azure Kubernetes Service (AKS).
6. The microservices apps trigger alerting and messaging services like email and bots.
7. Azure API Management incorporates internal and external APIs when deploying to endpoints like web apps,
mobile apps, Azure maps, and Power BI.
8. Azure components and deployed apps can share Azure services like Azure Security, Azure Active Directory,
Azure Key Vault, and Azure Monitor.

Components
Azure IoT Edge intelligent devices recognize and respond to sensor input by using onboard processing. These
devices can respond rapidly, or even offline. Intelligent Edge devices limit costs by preprocessing and sending
only necessary data to the cloud.
Azure IoT Hub connects virtually any IoT device with Azure cloud services. IoT Hub enables highly secure and
reliable bi-directional communication, management, and provisioning for IoT Edge devices.
Azure Stream Analytics (ASA) provides real-time serverless stream processing with built-in machine learning
(ML) models to perform anomaly detection directly in streaming jobs.
Azure Data Lake Storage is a data lake storage solution for big data analytics, combining Azure Blob Storage
capabilities with a high-performance file system.
Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics service that can read and
analyze data lake data.
Azure Cognitive Services are artificial intelligence (AI) services and cognitive APIs that help build intelligent
apps. For example, Computer Vision helps count and monitor people density and movements. Speech to Text,
Text to Speech, and Language Understanding help provide verbal responses and interactions.
Azure Kubernetes Service (AKS) is a managed, serverless Kubernetes platform for microservices apps.
Kubernetes is open-source orchestration software for deploying, managing, and scaling containerized apps.
Azure API Management deploys Azure, third-party, and external APIs side by side to optimize traffic flow,
provide unified control and visibility, and ensure security and compliance.
Microsoft Power BI visualizations enable well-informed and data-driven reporting and decision making.

Next steps
For more information, please contact [email protected], and see Connected Platform for
Detection and Prevention.
For more information about the Insight Connected Platform, see Connected Platform.

Related resources
Anomaly detection in Azure Stream Analytics
Microservices architecture on Azure Kubernetes Service (AKS)
Contactless interfaces with Azure IoT Edge
11/2/2020 • 2 minutes to read • Edit Online

Contactless business is the new normal. The world has become more aware of, and endangered by, surfaces that
many people touch every day. Contactless interfaces reduce or eliminate physical touchpoints like traffic light
buttons, touch screens, door handles, and elevator controls by creating touch-free experiences that are both safe
and pleasing for users.
Avanade and the Microsoft COVID-19 task force partnered to develop contactless interfaces using the Azure
intelligent edge platform. This solution combines intelligent and perceptive Internet of Things (IoT) edge devices
with the storage, computing, artificial intelligence (AI), and machine learning (ML) capabilities of the Azure cloud.
IoT Edge devices can quickly recognize and respond to speech, image, gesture, or multi-modal input by using
onboard processing. Azure IoT Hub in the cloud controls the devices and connects them to Azure resources. Azure
Cognitive Services and Machine Learning continually retrain and update models to improve interface accuracy and
performance.

Potential use cases


Turn public touchpoints to touch-free interfaces for uses like building and room access, elevator controls, retail
and vending machine sales, and transit and traffic signaling.
Use voice or other contactless controls with typical mobile endpoints like phones.

Architecture

1. End points like microphones, cameras, and touchscreens collect data.


2. Onboard voice processing units on the IoT Edge devices apply cognitive skillsets and updated machine learning
models to local data.
3. The IoT hub in the Azure cloud controls and communicates with the edge devices, receiving data and sending
updated models.
4. Azure Storage stores the uploaded data.
5. Azure Machine Learning uses the data to retrain its AI models.
6. IoT Hub pushes updated machine learning models to the edge devices.
Components
Azure IoT Edge service deploys cloud workloads to run on IoT Edge devices via standard containers. Modules
can run AI, other Azure and third-party services, or your own business logic. IoT Edge intelligent devices can
respond quickly and offline, and limit costs by preprocessing and sending only necessary data to the cloud.
Azure IoT Hub provides a cloud-hosted backend to connect virtually any IoT device with Azure cloud services.
IoT Hub enables highly secure and reliable bi-directional communication, management, and provisioning for IoT
Edge devices.
Azure Storage provides flexible, scalable, secure storage in the Azure cloud. The current solution uses block
blobs to store unstructured data, page blobs to read and write random small data segments, and file storage for
file shares.
Azure Cognitive Services are a family of AI services and cognitive APIs that help build intelligent apps. For
example, voice control can use the Speech to Text and Speaker Recognition services. Extending the solution to
image or facial recognition could use Computer Vision, Custom Vision, and Facial Recognition.
Machine Learning (ML) uses algorithms to improve machine predictions or decisions automatically through
experience. Machine learning algorithms build and continually train mathematical models. Azure Machine
Learning lets you build, train, deploy, track, and manage ML models at cloud scale.

Next steps
For more information, please contact [email protected].
Voice control is the basic function in this solution. The platform can potentially extend to multiple sense
recognition, like faces and gestures.
The solution can support global deployment by using Cognitive Services Speech Translation APIs.
COVID-19 Safe Solutions at the IoT Edge
11/2/2020 • 3 minutes to read • Edit Online

As workplaces and other group spaces reopen after the COVID-19 shutdown, people need to follow health and
safety guidelines for safe social distancing, mask and PPE use, and occupancy limits. Bosch COVID-19 Safe
Solution combines existing closed-circuit TV (CCTV) infrastructure with the Azure intelligent edge and other Azure
and Microsoft services to help organizations monitor, comply with, and improve these health and safety practices.
This article showcases a COVID-19 Safe Solution which is being implemented in workplace facilities across Europe,
North America and Latin America across industries as diverse as steel manufacturing, building construction and
automotive assembly. Goals were to:
Ensure a safe work environment when resuming manufacturing after the COVID-19 lockdown.
Monitor and enforce compliance with face mask policy, social distancing, and occupancy limits on factory
premises.
Increase visibility and control through efficient reporting and interactive dashboards.
Deliver alerts and notifications so health and safety stakeholders could address safety violations and concerns.
Improve compliance, reduce violations, and enable well-informed safety decisions over time.

Potential use cases


Spaces with existing CCTV infrastructure.
Stores, restaurants, offices, factories and warehouses, public transportation, hospitals, schools, and
entertainment and recreation spaces.
Organizations with multiple locations, to enable widespread, systemic data analysis and actions.

Architecture

1. CCTVs send video data to Internet of Things (IoT) Edge servers. Edge computing handles device registration,
provisioning, and data ingestion.
2. The Bosch Algorithm Engineering and Model Training Environment uses custom vision analytics to continually
retrain machine learning (ML) models, and directly updates edge servers.
3. Edge servers send data to onboard and cloud stream analytics and blob storage. IoT Edge intelligent devices
limit costs by preprocessing and sending only necessary data to the cloud.
4. Stream analytics perform data enrichment and validation on both edge and cloud data.
5. Service bus device-to-cloud and cloud-to-device messaging send data and telemetry to and from the cloud.
6. An interactive visual dashboard app provides near real-time compliance monitoring and alerts.
7. The Azure cloud provides application logs, monitoring, security, application gateway, and API management for
the app.
8. Redis, mongoDB, and blob storage store cloud data for Power BI analytics and visualizations via a custom
connector.
9. The app sends notifications and alerts to stakeholders via Microsoft Teams.

Components
Azure IoT Edge servers with onboard storage, computing, artificial intelligence (AI), and machine learning (ML)
capabilities can quickly recognize and respond to sensor input.
Bosch video analytics use Custom Vision skills and Azure Machine Learning to continually improve monitoring,
detection, and real-time alert triggering.
Azure Digital Twins IoT service creates comprehensive models of physical environments in a spatial intelligence
graph. Rather than simply tracking individual devices, Digital Twins can virtually replicate the physical world by
modeling the relationships between people, places, and devices.
Azure Stream Analytics (ASA) provides real-time serverless stream processing that can run the same queries in
the cloud and on the edge. ASA on IoT Edge can filter or aggregate data that needs to be sent to the cloud for
further processing or storage.
Azure Storage provides flexible, scalable, secure Blob storage for unstructured data in the Azure cloud or on the
IoT Edge. The current solution also uses Redis and mongoDB data storage.
Azure Service Bus messaging through Azure IoT Hub connects devices to Azure cloud resources, and can use
queries to filter data to be sent to the cloud.
Azure Kubernetes Service (AKS) is a managed service for developing, deploying, and managing containerized
applications. In this solution, AKS manages an interactive visual dashboard app that tracks and analyzes safety
violations.
By integrating with the Azure cloud, the solution can use services like Azure Monitor, Azure Security Center, and
Azure Active Directory.
Integration with Microsoft Teams allows automated notifications of relevant stakeholders like HR and Security.
Microsoft Power BI visualizations enable well-informed and data-driven reporting and decision making.

Next steps
For more information:
Contact [email protected]
See https://www.bosch-india-software.com

Related resources
What is Azure Digital Twins?
Azure Stream Analytics on IoT Edge
Store data at the edge with Azure Blob Storage on IoT Edge
Smart and secure lighting and disinfection
11/2/2020 • 5 minutes to read • Edit Online

Smart, connected Internet of Things (IoT) devices can make life healthier and safer. For example, research
increasingly confirms the importance of high-quality light and natural light cycles for promoting human alertness,
health, well-being, and productivity. BrainLit's patented BioCentric Lighting™ (BCL) system is a dynamic, self-
learning, IoT-based system that creates high-quality natural lighting for indoor environments.
The BCL system can also deliver radiant energy in the non-visible ultraviolet (UV) spectrum to kill viruses. All UV
radiation has antimicrobial capabilities. The shortest wavelength, 200-300 nm UVC radiation, causes DNA
inactivation and prevents viral replication. UVC radiation disinfection is an important weapon in the fight against
COVID. BrainLit's new UVEN concept combines BCL light control with UVC disinfection to promote health and well-
being and help kill viruses.
Other UVC virus-killing technology that uses plug-in devices, robots, drones, and wands can't provide overall space
coverage or performance and effectiveness data. To operate these devices safely and efficiently disrupts business
and personnel, resulting in loss of productivity and lower space utilization. UVEN disinfection operates in
unoccupied spaces during non-use times, providing safe, comprehensive, autonomous microbe deactivation
without business disruption.
Because IoT devices work directly on the physical environment and may use and collect sensitive data, device safety
and security are paramount. UVEN double fail-safe features ensure occupants receive beneficial light only. The
Azure Sphere standalone microprocessor unit (MCU)-based platform securely runs IoT apps and connects directly
to the cloud for complete Azure-based security and the latest OS and app updates.
The system can continually integrate new research developments and public health updates via its cloud
connection, ensuring an up-to-date and scientifically based lighting and disinfection system.
UVEN combined BCL and UVC technology offers:
A smart, safe, and secure integrated system solution for lighting and disinfection, with optimal usage of floor
areas.
Long-term solutions that can adapt lighting and disinfection doses and recipes to changing circumstances.
Overall better health and improved immune systems for occupants through optimized lighting and a virus-free
environment.

Potential use cases


UVEN cells can fit any layout of a specific area, like a reception counter, or split between several areas, like a
number of washrooms.
The modular system can scale to fit any space, from individual rooms and homes to large campuses and public
areas.
Azure Sphere-based infrastructure can easily integrate with existing IoT devices or building management
systems.

Architecture
The system uses radar proximity sensors to detect when people are present. Whenever a space is unoccupied,
the panels emit UV radiation to heavily reduce or eliminate viruses and bacteria.
One UVEN lighting control system (LCS) can control both BCL light and UVC radiation emissions for up to 20
luminaires. The LCS connects to radar sensors, electrical power, and to the internet or other LCSs for updates
and real-time calibration.
Several cells can connect in a cluster. Only one of the clustered LCSs needs to connect to the internet. An LCS or
cluster can operate standalone, or in an overall UVEN BCL network system with other clusters.
A UVEN BCL Net is completely decentralized and offers unlimited scalability, control, and flexibility.

Components
The UVEN system consists of light-emitting diode (LED) luminaires that are controlled by Azure Sphere-powered
lighting control systems.
UVEN components
The basic UVEN ceiling panel consists of dynamic, adaptable LEDs that integrate BCL with disinfection in the same
luminaire. As well as offering a large range of visible and UV emissions, the LEDs provide exceptional color and
detail rendering, and the ability to dim smoothly down to 0.1%. LEDs are low-cost, long-lasting, and
environmentally sustainable compared to traditional incandescent and fluorescent indoor light sources.
When people are present, the ceiling panels emit BCL light only. In unoccupied spaces, the panels emit UV radiation
to heavily reduce or eliminate viruses and bacteria. Separate drivers optimized to operate in their respective
regimes maintain full BCL and UV functionality simultaneously.
Radar motion sensors, alarms, and manual overrides provide double fail-safe insurance so users receive beneficial
light only. Flexible, scalable, updatable LCSs based on Azure Sphere technology control luminaires, sensors, and
safety and alarm mechanisms.
UV radiation can damage surfaces and equipment. UVEN advanced sensors and controllers ensure that the
luminaires generate the minimum doses of UVC radiation to ensure disinfection. The system can continually
integrate new research developments and public health guidelines via its cloud connection, providing the best
scientifically based lighting and disinfection.
UVEN administrative support tools include an admin panel to maintain information about users and devices, and a
dashboard to provide an overall view of user activity and exposure.
Azure Sphere
Azure Sphere is a secure application platform for IoT connected devices with built-in cloud communication and
security. The platform consists of an Azure Sphere-certified secured MCU chip that runs a custom high-level Linux-
based microcontroller operating system (OS). The Azure Sphere OS provides a platform for IoT application
development, including both high-level and real-time capable applications. The OS connects to the cloud-based
Azure Sphere Security Service (AS3) for continuous, renewable security.
AS3 establishes a secure connection between a device and the internet or cloud, ensures secure boot, and enables
maintenance, updates, and control for Azure Sphere-certified chips. AS3 authenticates device identity, ensures the
integrity and trust of the system software, and certifies that the device is running a trusted code base. The service
also provides a secure channel to automatically download and install OS and customer application updates to
deployed devices.
The MediaTek 3620 is the first Azure Sphere-certified chip. The chip includes an ARM Cortex-A7 500-MHz
processor, two ARM Cortex-M4F 200-MHz I/O subsystems, Wi-Fi capability, and the Microsoft Pluton security
subsystem with a dedicated ARM Cortex-M4F core. The Linux-based Azure Sphere OS lets developers write
applications that can run on either the A7 core with access to external communications, or as real-time capable
apps on one of the M4 processors. Developers can distribute applications to Azure Sphere devices through the
same secure mechanism as the Azure Sphere OS updates.
Azure Sphere greenfield implementations involve designing and building new IoT devices with Azure Sphere-
certified chips. Brownfield implementation allows existing IoT devices to securely connect to the internet through
an Azure Sphere Guardian module. The guardian module contains an Azure Sphere certified chip, the Azure Sphere
OS, and AS3. The module connects to the IoT device through an existing peripheral, and then securely connects to
the internet. The IoT device itself isn't exposed to the internet.

Next steps
For more information, please contact [email protected].
For more information about the BCL system, see the BrainLit website.
For more information about Azure Sphere, see the Azure Sphere documentation.
For Azure Sphere code samples, see Azure Sphere Samples.
Light, power, and internet for emerging markets
11/2/2020 • 4 minutes to read • Edit Online

Globally, over 1 billion people lack access to electricity, and over 3 billion people have no internet access. Veriown is
an energy provider working with Azure Internet-of-Things (IoT) cloud services to provide life-changing, low-cost
electricity and connectivity to remote and rural customers. Veriown's IoT devices combine solar power with internet
communications to deliver monumental improvements to users' quality of life.
Veriown has run large-scale network and telecommunications infrastructures for years, but has never before
provided such a wide scope of services at such low cost. Veriown's Connect devices combine IoT hardware and
mobile applications with Azure cloud capabilities to act as energy and internet hubs for homes and small
businesses.
The Connect IoT device uses a rooftop solar panel and charged battery to deliver clean electricity, light, and cellular
internet connectivity for pennies a day, with high reliability and minimal downtime. An integrated SIM card and
tablet provide individualized access to online content and services like telemedicine, education, news and weather,
entertainment, and commerce.
Azure supports two major workstreams in Veriown's IoT solution:
Real-time IoT device telemetry detects transient or long-running anomalies, and the system can respond in
real time with chatbots and retrained device actions. For example, in low-power conditions a customer's
device can automatically reduce power usage for background or inactive features, so the customer
continues to get a good experience with the services they're actively using.
Post-processing data analytics evaluate usage and incidents to determine algorithms for future needs and
preventive maintenance. For example, Veriown could send customers parts that are predicted to fail soon, or
could improve the responses of an artificial intelligence (AI) chatbot.
Since bandwidth is limited and expensive in emerging markets, analyzing usage patterns and incidents can help
content and service owners target customers with only the content and services they currently need.

Potential use cases


A solar-powered Connect device in a customer's home or business can provide:
LED light to replace kerosene lanterns
USB device charging
Telemedicine support
Online education
Commerce and banking services
News and weather
Entertainment programming
Emergency and support communications
The basic Connect device provides light, electricity, internet connectivity, and customer support. Customers can
purchase additional services and content as one-offs or subscriptions.

Architecture
1. Field sales and service agents use a mobile platform to interact with the cloud application via Azure
Application Gateway. End users use a built-in device or mobile interface to access and control their devices
and interact with content.
2. Application Gateway uses messaging protocols to interact with users and operators.
3. The cloud app consists of containerized microservices that provide functions and interfaces like identity and
access management, device upgrades, notifications, and commerce services.
The app uses Azure services and resources like Azure Blob Storage for unstructured data storage, Azure
Cosmos DB for large structured databases, and Azure Media Services to store and deliver entertainment
content.
The Connect devices also send streaming telemetry and user data to the cloud via Azure IoT Hub. In the business
intelligence part of the process:
1. IoT Hub and Azure Event Hub receive the streaming data and route events.
2. Azure Databricks extracts, transforms, and loads (ETLs) the event data.
3. Azure Synapse, a SQL big-data warehouse, performs analytics and stores the transformed data.
4. The analyzed data populates Power BI reports for system evaluation and future planning.
Analytics and machine learning data flow
The system includes the following data analysis and control loop:
1. The Connect devices stream telemetry and user behavior to IoT Hub.
2. IoT Hub sends the data to Azure Databricks and to Azure Machine Learning (ML).
3. Databricks sends some events, like alarms, directly to Customer Support for intervention.
4. Synapse Analytics performs analytics on the ETL data.
5. The analyzed data populates Power BI reports.
6. Azure ML combines current data with stored external data, like historical weather and forecasts, and uses the
results to retrain the power management ML models.
7. IoT Hub sends the retrained models to the Connect devices, which adjust their behavior and schedules
accordingly.

Components
Azure Application Gateway manages and load balances traffic to and from cloud web apps.
Azure Kubernetes Service (AKS) hosts and simplifies Kubernetes orchestration of Docker containerized apps.
Azure Container Registry (ACR) is a managed, private registry service that supports AKS applications at scale.
Azure IoT Hub is a central cloud message hub for bi-directional communications between IoT applications and
devices.
Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics service for big data pipelines.
Azure Synapse Analytics, formerly SQL Data Warehouse, is an analytics service that brings together enterprise
data warehousing and big data analytics.
Power BI is a collection of software services, apps, and connectors that turn data into coherent, immersive,
interactive visualizations and reports.
Azure Machine Learning is a cloud-based ML environment that uses existing data to forecast future behaviors,
outcomes, and trends.
See also
Azure IoT documentation
A solar-powered device will bring online entertainment, education to villages
Veriown website
Predictive maintenance with the intelligent IoT Edge
11/2/2020 • 6 minutes to read • Edit Online

The Internet-of-things (IoT) Edge brings data processing and storage close to the data source, enabling fast,
consistent responses with reduced dependency on cloud connectivity and resources. Edge computing can
incorporate artificial intelligence (AI) and machine learning (ML) models to create intelligent edge devices and
networks, which can integrate with the cloud for further processing and security.
This article describes a collaboration between the Microsoft Commercial Software Engineering (CSE) team and a
major railway company to create an intelligent cloud and intelligent edge train maintenance and safety solution.
The railway company wants to improve railroad safety and efficiency by proactively identifying defective
components, predictively scheduling maintenance and repair, and continuously improving their findings and
predictions. The pilot project for the ML on Edge solution is a train wheel health analysis system.
Over 4,000 trackside detectors continuously monitor and stream wheel data from all the company's trains. The
detectors measure heat and force of equipment on the tracks, listen for invisible wheel bearing defects or wheel
cracks, and identify missing or misplaced parts. The ML on Edge system processes and acts on this continuous
streaming detector data in near-real time to identify at-risk equipment, determine repair urgency, generate alerts,
and send data to the Azure cloud for storage. The IoT Edge modules run on server class hardware in trackside
bungalows, allowing for future parallel deployment of other workloads.
Bringing ML and business logic closer to the data sources lets devices react faster to local changes and critical
events. Devices can operate reliably offline or when connectivity is limited. The Edge network can determine which
data to send to the cloud, or prioritize urgent and important data first.
The wheel health analysis system provides early identification of potential equipment failure, helping prevent
catastrophic failures that could lead to train derailment. The company can use stored data to spot trends and
inform prescriptive maintenance and overhaul schedules.

Use cases
IoT Edge implementations are most relevant when large amounts of data captured in real time need action or
decisions with little or no latency. The example system had to maintain 99.999% uptime, process data from up to
24 trains and 35 million readings per day, and guarantee one-hour delivery of alerts and notifications.

Architecture
1. An image file server (NAS) in a bungalow serves processed and categorized train wheel images. Three pictures
of each wheel create a stitched image.
2. The polling module alerts the Edge device that new images are available for processing.
3. A third-party ML model processes the images and identifies wheel areas that need more inspection.
4. The alert handler uploads all images into Azure Blob Storage, starting with images that have potential defects,
and returns the image blob URIs.
5. IoT Edge Hub associates the image URIs with image metadata, and uploads the metadata and alerts to Azure IoT
Hub.
6. IoT Hub sends the metadata via Event Hub and Azure Functions to an Azure Cosmos DB database.
7. The Cosmos DB database holds the image metadata and points to the location of images in Azure Blob Storage.
Components
The deployed solution requires an Azure subscription with permission to add service principals and the ability to
create Azure resources. For more information, see Container registries and service principals.
Azure IoT Edge is made up of three components:
IoT Edge modules are containers that can run Azure, third-party, or custom components. The current
example deploys the IoT Edge modules in trackside bungalows, using server-class hardware with
customized industrial automation cards and graphics processing units (GPUs) for performance.
The IoT Edge runtime, consisting of the IoT Agent and IoT Edge Hub, runs on the IoT Edge devices to
manage and coordinate the deployed modules.
The Azure IoT Hub interface enables secure bi-directional cloud communication, management, and
monitoring of IoT Edge modules.
IoT Edge ML modules support Azure Machine Learning, third-party ML models, or custom code. The current
solution uses a third-party open-source ML model called Cogniac to score train wheel data and recognize
potential defects. The ML software uses historical samples of high- and low-confidence failure images to retrain
its ML model.
Azure Blob Storage is Microsoft's object storage solution for the cloud. Blob storage is optimized for storing
massive amounts of unstructured data like the image data in this example.
Azure Cosmos DB is a fully-managed, NoSQL database service with low response times and high availability
and scalability.
An Azure Pipelines workflow builds, tests, deploys, and archives the IoT Edge solution through built-in Azure IoT
Edge tasks.

Considerations
The team identified several design considerations:
The system requires 99% uptime and on-premises message delivery within 24 hours. The Quality of Service
(QoS) for the last mile of connectivity between bungalow and Azure determines the QoS of data from the edge.
Local internet service providers (ISPs) govern the last mile of connectivity, and may not support the required
QoS for notifications or bulk data uploading.
This system doesn't interface with the wheel cameras and backing data stores, so has no control or ability to
raise alerts on camera system or image server failures.
The railway company only owns the inferencing system, and relies on a third-party vendor for ML model
generation. The black-box nature of the ML module poses some risk of dependency. Long-term solution
maintenance requires understanding how the third party governs and shares assets. The system may be able to
use placeholder ML modules for future engagements when ML assets aren't available.
Security and monitoring are considerations for IoT Edge systems. For this engagement, the company's existing
third-party enterprise solution covered system monitoring. The physical security of trackside bungalows and
network security were already in place, and connections from the IoT Edge to the cloud are secure by default.
This solution doesn't replace existing manual inspection requirements determined by company and federal
regulatory authorities.
The Edge architecture is currently split into multiple modules, but can be condensed into a single module,
depending on solution performance requirements or development team structure.
This solution builds on the following previous CSE customer engagements in the manufacturing, oil and gas,
and natural resource management industries:
CloudEvents
Claim Check Patterns
Command and Query Responsibility Segregation (CQRS) Pattern

Deployment
The railway company hosts the continuous integration/continuous deployment (CI/CD) system on-premises. The
following diagram shows the DevOps architecture for deployment:
1. In the first CI pipeline, a code push into the Git repository triggers the build of the IoT Edge module and
registers the module image in Azure Container Registry.
2. When the CI pipeline completes, it triggers the CD pipeline, which generates the deployment manifest and
deploys the module to IoT Edge devices.
The deployment has three environments: Dev, QA, and Production. Module promotion from Dev to QA and from
QA to Production supports both automatic and manual gated checks.
Building and deploying the solution also uses:
Azure CLI
Docker CE or Moby to build and deploy the container modules
For development, Visual Studio or Visual Studio Code with the Docker, Azure IoT, and relevant language
extensions

Next steps
The future of computing: intelligent cloud and intelligent edge
Azure IoT Edge documentation
Build IoT Edge modules
End-to-end solution using Azure Machine Learning and IoT Edge
Continuous integration and continuous deployment to Azure IoT Edge
Deploy Azure IoT Edge modules from the Azure portal

Related resources
GitHub code projects
NVIDIA Container Toolkit
Azure IoT Edge and RabbitMQ
Cookiecutter Template for Azure IoT Edge Python Module
Streaming at Scale
ServiceBusExplorer
DASH for Azure Storage
Solution learning resources
Docker Compatibility Matrix
Jenkins Azure IoT Edge plugin
Event Sourcing
React to Blob storage events
Azure Blob storage bindings for Azure Functions
Serverless Streaming At Scale with Cosmos DB
Ingestion and processing of real-time automotive IoT
data
11/2/2020 • 7 minutes to read • Edit Online

This example scenario builds a real-time data ingestion and processing pipeline to ingest and process messages
from IoT devices (in general sensors) into a big data analytic platform in Azure. Vehicle telematics ingestion and
processing platforms are the key to create connected car solutions. This specific scenario is motivated by the car
telematics ingestion and processing systems. However, the design patterns are relevant for many industries using
sensors to manage and monitor complex systems in industries such as smart buildings, communications,
manufacturing, retail, and healthcare.
This example demonstrates a real-time data ingestion and processing pipeline for messages from IoT devices
installed in vehicles. Thousands and millions of messages (or events) are generated by the IoT devices and sensors.
By capturing and analyzing these messages, we can decipher valuable insights and take appropriate actions. For
example, with cars equipped telematics devices, if we can capture the device (IoT) messages in real time, we would
be able to monitor the live location of vehicles, plan optimized routes, provide assistance to drivers, and support
telematics-related industries such as auto insurance.
For this example demonstration, imagine a car manufacturing company that wants to create a real-time system to
ingest and process messages from telematics devices. The company's goals include:
Ingest and store data in real time from vehicles sensors and devices.
Analyze the messages to understand vehicle location, and other information emitted through different types of
sensors (such as engine-related sensors and environment-related sensors).
Store the data after analysis for other downstream processing to provide actionable insights (For example, in
accident scenarios, insurance agencies may be interested to know what happened during an accident etc.)

Relevant use cases


Other relevant use cases include:
Vehicle maintenance reminders and alerting.
Location-based services for the vehicle passengers (that is, SOS).
Autonomous (self-driving) vehicles.

Architecture
In this real-time big data processing pipeline, the data flows through the solution as follows:
1. Events generated from the IoT data sources are sent to the stream ingestion layer through Azure IoT Hub as a
stream of messages. Azure IoT Hub stores streams of data in partitions for a configurable amount of time.
2. Azure Databricks picks up the message in real time from IoT Hub, processes the data based on the business
logic and sends the data to Serving layer for storage.
3. Downstream storage services, like Azure Cosmos DB, Azure SQL Data warehouse, or Azure SQL DB, will store
the IoT data, plus any transactional data, and be a data source for presentation and action layer.
4. Business analysts can use Microsoft Power BI to analyze warehoused data.
5. Web, mobile and other applications can be built on the serving layer as well. For example, we can expose APIs
based on the serving layer data for third-party uses.
Components
IoT device-generated events (data or messages) are ingested, processed, and then stored for further analysis,
presentation, and action, using the following Azure components:
Azure IoT Hub is in the ingestion layer. IoT Hub also support bi-directional communication back to devices,
allowing Actions to be sent from the cloud or Azure IoT Edge to the device.
Azure Databricks is located in the transformation and analytics layer. Databricks uses the azure-eventhubs-
spark_2.11:2.3.6 Maven library to connect to IoT Hub's Event Hub compatible endpoint.
Azure Cosmos DB, Azure SQL Database, and Azure Synapse Analytics are in the Serving storage layer, where
Azure Databricks can write the data via data connectors.
Azure Synapse is a distributed system for storing and analyzing large datasets. Its use of massive parallel
processing (MPP) makes it suitable for running high-performance analytics.
Power BI is a suite of business analytics tools to analyze data and share insights. Power BI can query a semantic
model stored in Analysis Services, or it can query Azure Synapse directly.
Azure Active Directory (Azure AD) authenticates users, when connecting to Azure Databricks. If we would build a
cube in Analysis Services based on the model based on Azure Synapse data, we could use Azure AD to connect
to the Analysis Services server through Power BI. Data Factory can also use Azure AD to authenticate to Azure
Synapse via a service principal or Managed Service Identity (MSI).
Azure App Services can be used to build web and mobile applications. Azure API App can be used to expose data
to third parties, based on the data stored in the Serving Layer.

Alternatives
A more generalized big data pipeline could be implemented using other Azure components.
In the stream ingestion layer, we could use Event Hubs or Apache Kafka on HDInsight to ingest data. It is
impor tant to note these ser vices only offer uni-directional communication, no device
management, no edge capabilities and are recommended for non-IoT scenarios such as streaming
data pipelines. Refer to Connecting IoT Devices to Azure: IoT Hub and Event Hubs for a more detailed
comparison.
In the transformation and analytics layer, we could use HDInsight Storm, HDInsight Spark or Azure Stream
Analytics to perform streaming analytics, and Azure Functions to perform payload transformations.
In the service storage layer, we could use Analysis Services to provide a semantic model for your data. It can also
increase system performance when analyzing your data. You can build the model based on Azure DW data.
Azure Data Explorer and Time Series Insights could be used for storing and analyzing time series data.

Considerations
The technologies in this architecture were chosen based on the scale needed to process events, the SLA of the
services, the cost management and ease of management of the components.
Azure IoT Hub comes with a 99.9% SLA and has integrated disaster recovery capabilities.
Azure Databricks is optimized from the ground up for performance and cost-efficiency in the cloud. The
Databricks Runtime adds several key capabilities to Apache Spark workloads that can increase performance and
reduce costs by as much as 10-100x when running on Azure, including:
Azure Databricks integrates deeply with Azure databases and stores: Azure Synapse, Azure Cosmos DB, Azure
Data Lake Storage, and Azure Blob Storage
Autoscaling and autotermination for Spark clusters to automatically minimize costs.
Performance optimizations including caching, indexing, and advanced query optimization, which can
improve performance by as much as 10-100x over traditional Apache Spark deployments in cloud or on-
premises environments.
Integration with Azure Active Directory enables you to run complete Azure-based solutions using Azure
Databricks.
Role-based access in Azure Databricks enables fine-grained user permissions for notebooks, clusters,
jobs, and data.
Comes with Enterprise-grade SLAs.
Azure Cosmos DB is Microsoft's globally distributed, multi-model database. Azure Cosmos DB was built from
the ground up with global distribution and horizontal scale at its core. It offers turnkey global distribution across
any number of Azure regions by transparently scaling and replicating your data wherever your users are. You
can elastically scale throughput and storage worldwide, and pay only for the throughput and storage you need.
The massively parallel processing architecture of Azure Synapse provides scalability and high performance.
Azure Synapse has guaranteed SLAs and recommended practices for achieving high availability.
When analysis activity is low, the company can scale Azure Synapse on demand, reducing or even pausing
compute to lower costs.
The Azure Synapse security model provides connection security, authentication, and authorization via Azure AD
or SQL Server authentication, and encryption.

Pricing
Review Azure Databricks pricing, Azure IoT Hub pricing, pricing sample for a data warehousing scenario via the
Azure pricing calculator. Adjust the values to see how your requirements affect your costs.
Azure IoT Hub is a fully managed cloud service that makes it easy, fast, and cost-effective to ingest massive
amounts of IoT data
Azure Databricks offers two distinct workloads on several VM Instances tailored for your data analytics
workflow — the Data Engineering workload makes it easy for data engineers to build and execute jobs, and the
Data Analytics workload makes it easy for data scientists to explore, visualize, manipulate, and share data and
insights interactively.
Azure Cosmos DB guarantees single-digit-millisecond latencies at the 99th percentile anywhere in the world,
offers multiple well-defined consistency models to fine-tune performance, and guarantees high availability with
multi-homing capabilities — all backed by industry leading comprehensive service level agreements (SLAs).
Azure Synapse allows you to scale your compute and storage levels independently. Compute resources are
charged per hour, and you can scale or pause these resources on demand. Storage resources are billed per
terabyte, so your costs will increase as you ingest more data.
Analysis Services is available in developer, basic, and standard tiers. Instances are priced based on query
processing units (QPUs) and available memory. To keep your costs lower, minimize the number of queries you
run, how much data they process, and how often they run.
Power BI has different product options for different requirements. Power BI Embedded provides an Azure-based
option for embedding Power BI functionality inside your applications. A Power BI Embedded instance is included
in the pricing sample above.

Next steps
Review the Azure IoT reference architecture that shows a recommended architecture for IoT applications on
Azure using PaaS (platform-as-a-service) components.
Review the Real-time analytics reference architecture that includes big data pipeline flow.
Review the Advanced analytics on big data reference architecture to get a peek on how different Azure
components can help build a big data pipeline.
Create a safe building
11/2/2020 • 4 minutes to read • Edit Online

The world is cautiously reopening in a physical business environment where COVID-19 remains a factor. To help
keep people healthy, Cognizant adapted their OneFacility solution to make workspaces safer for everyone who
enters them. The Cognizant Safe Buildings solution provides a strategic approach to create safe buildings with
sustainably healthy working environments.
The solution monitors critical insights so you can act and respond to risks. It also builds confidence that you're
focused on making sure people are safe. The measurements the solution reacts to are: human body temperature,
effective physical distancing, hand sanitization compliance, and air quality tracking.
Using Azure IoT Hub and other Azure services, the Cognizant Safe Buildings solution brings together people,
regulations, analytics, and technology to transform smart buildings into safe buildings. Through the use of different
devices, the solution collects biometric and environmental data. When the system detects any deviation to health
and safety protocol in a building, the solution activates. Safe Buildings layers safety controls to protect, monitor,
and respond with real-time alerts.

Potential use cases


Safe Buildings adapts to stores, offices, factories, warehouses, recreation spaces, hotels, and restaurants. It's
adaptable to any interior space where people congregate.

Architecture

1. Different devices collect data: thermal imaging cameras, LIDAR sensors, occupancy sensors, smart watches
with hygiene apps, IR sensors, and air quality and temperature sensors. That data flows into the building's
IoT Edge server.
2. The server feeds the data to the Azure IoT Hub, Azure Data Factory, and Azure Blob storage.
3. The Azure IoT Hub sends data through an Azure Kubernetes Service (AKS).
4. AKS routes the data so Azure Stream Analytics and Azure Databricks can analyze and enrich it.
5. Safe Buildings sends the processed data to various data stores:
Data Lake - Azure Data Lake Storage
Time Insight - Azure Cosmos DB and Azure Database for PostgreSQL
Aggregate/Access data - Azure SQL Database
Reference Data - Azure Cache for Redis
6. From there, different services and APIs consume the data:
Data Service - These services consume the data, process it, and pass it on to downstream services
that notify users of any abnormalities that Safe Buildings detected.
Platform Service - These services analyze device, asset, spatial, and user metrics and pass the
information to a web app instance of Power BI. Power BI displays the statistics in a meaningful and
useful way.
Training - Azure Databricks and Azure Machine Learning take the data and update deep learning
models. Safe Buildings uses Azure Container Registry to distribute the models to the IoT Edge
server and the Data Service.
Command and Control - AKS passes the data from the Platform Service to Azure Functions.

Components
Azure IoT Edge allows you to deploy your cloud workloads to run on Internet of Things (IoT) edge devices
via standard containers. Workloads like artificial intelligence, Azure and third-party services, or your own
business logic.
Azure IoT Hub enables highly secure and reliable communication between your IoT application and the
devices it manages.
Azure Kubernetes Service deploy and manage containerized applications more easily with a fully managed
Kubernetes service.
Azure Stream Analytics is an easy-to-use, real-time analytics service designed for mission-critical workloads.
Azure Databricks unlocks insights from all your data and build artificial intelligence (AI) solutions. Set up
your Apache Spark environment in minutes, autoscale, and collaborate on shared projects in an interactive
workspace.
Azure Data Lake Storage - A highly scalable and cost-effective data lake solution for big data analytics.
Azure Cosmos DB - A fully managed NoSQL database service for modern app development. Azure Cosmos
DB has guaranteed single-digit millisecond response times. It also guarantees 99.999-percent availability
backed by SLAs, automatic and instant scalability, and open-source APIs for MongoDB and Cassandra.
Azure Database for PostgreSQL - Focus on application innovation, not database management, with fully
managed and intelligent Azure Database for PostgreSQL.
Azure SQL Database - The intelligent, scalable, relational database service built for the cloud. It’s evergreen
and always up to date, with AI-powered and automated features that optimize performance and durability
for you.
Azure Cache for Redis - Fully managed, open source–compatible in-memory data store to power fast,
scalable applications.
Azure Machine Learning empowers developers and data scientists to build, train, and deploy machine
learning models faster. Continuous retraining, updating, and improving data collection and analysis makes
the models better over time.
Azure Monitor collects, analyzes, and acts on telemetry data from your Azure and on-premises
environments.
Power BI enables team members to discover insights hidden in your data.
Azure Functions lets you build and debug locally without additional setup, deploy and operate at scale in the
cloud, and integrate services using triggers and bindings.
In addition, Safe Buildings has access to these other Azure services: Azure Active Directory, Azure API
Management, Azure Blob storage, Azure Container Registry, Azure Data Factory, Azure Event Hubs, Azure
Key Vault, Azure Logic App, Azure Notification Hubs, Azure VMs, Azure Webapp Service.

Next steps
For more information, please contact [email protected].
Visit the Microsoft commercial marketplace for info on Cognizant's OneFacility.
Secure your IoT SaaS app with the Microsoft identity
platform
11/2/2020 • 14 minutes to read • Edit Online

IoT device manufacturers are creating custom apps and services to better serve their customers’ device
management needs. Developers building IoT apps and services want to provide secure, frictionless access to cloud-
connected devices and the telemetry data sent by these devices to the cloud.
In this article we explain how a developer can take advantage of Azure Active Directory and the Microsoft identity
platform to create a secure SaaS app. We'll cover the basic how-tos and benefits of integrating your app with Azure
AD , as well as two common customer scenarios IoT developers may encounter.
By integrating with the Microsoft identity platform and configuring the app to be multi-tenant, developers can:
Ensure isolation of customer resources and data.
Enable access control by enforcing organizational policies and conditions based on time, location, and more.
Support auditing and reporting of access.
Allow users to sign in with existing work and social accounts, which in turn allows a single sign-on experience so
that users sign in once and are able to access other corporate resources and apps without signing in again.
To learn more about the topics covered in this article, check out the Microsoft identity platform homepage. The
quickstart guides there will help you get set up quickly to integrate your app with our platform.

NOTE
Please leave feedback on the document whether you found it helpful or not! Your input matters in helping us to create useful
content.

Example scenario
Contoso is a manufacturer of IoT-ready chillers and pumps which they sell to geographically distributed customers.
An in-house developer at Contoso is tasked with creating an app that customers of Contoso will use to manage the
chillers and pumps that will be connected to the cloud via Azure IoT. These devices will send telemetry data to Azure
which is then used for predictive maintenance of these devices. The IoT app will be sold along with the industrial
chillers and pumps to two different types of customers represented by the following examples:
Fabrikam, a large enterprise customer who has their own Azure AD tenant and would like their employees to
sign into the app using their Azure AD work accounts. The Fabrikam IT department will manage access and
apply organizational polices for access to the app as well as the IoT devices/resources and data the IoT app
manages.
Woodgrove, a customer who does not have an Azure AD tenant and would like users to log in using existing
email accounts such Gmail or Microsoft accounts.

Building a multi-tenant app


Contoso's developer is building a SaaS app and they want the employees at the companies that purchase the app to
be able to sign in with existing credentials and only access data that belongs to their own company. This can be
done by configuring the app to be multi-tenant in Azure AD.
Multi-tenant apps are configured to accept sign-ins from multiple Azure AD tenants. Users in any Azure AD tenant
will be able to sign in to the application after consenting to use their account with the application. For most IoT
developers, the end users of the app will belong to a company with an Azure AD account. However, in the case that
their company does not have an Azure AD tenant, the B2B capabilities described in the Woodgrove scenario below
will allow users to sign in using existing email accounts such Gmail or Microsoft accounts.
For step by step instructions on making an app multi-tenant, check out How to: Sign in any Azure Active Directory
user using the multi-tenant application pattern or this example multi-tenant SaaS application available on GitHub.
The basic process and benefits of integrating with Azure AD and making your IoT app multi-tenant are included
here.
Prerequisites
To get started, the developer of the app must belong to an Azure AD tenant. If they don't have an existing tenant,
they can create a new Azure AD tenant. Additionally, the developer needs to have permission to register an app in
their Azure AD tenant. You can check if you have sufficient permissions in the Azure portal.
Register the app
The first step is for the developer to register the app inside their own Azure AD tenant. This established an identity
for the app, and allows the developer to specify the permissions it will need to perform its tasks in any given tenant.
This is also the point when the developer specifies that it is a multi-tenant app.
When the app is registered, two objects are created in the tenant:
Application object - lives in the tenant where the app was registered (the "home" tenant) and serves as the
template for common and default properties.
Service principal object - defines the access policy and permissions for the application in the current tenant.
A service principal will be created in each tenant that uses this application.
Direct user sign-ins to the common endpoint
For multi-tenant apps, sign-in requests for AAD accounts should be sent to the common endpoint:
https://login.microsoftonline.com/common . When the Microsoft identity platform receives a request on the
common endpoint, it signs the user in and identifies which tenant the user belongs to. It sends this information
back to the app, which can then determine the resources this user should have access to based on tenant
membership. In this way, the developer can maintaining separation of proprietary information between companies.

NOTE
Guest users' sign-in attempts must be directed to a tenant-specific endpoint (i.e. the developer's tenant)
https://login.microsoftonline.com/{TenantId_or_Name} . This is required when users are signing in with social accounts,
and we provide an example of this in the Woodgrove Bank scenario below.
1. User logs in to app
2. Azure AD determines the user's tenant and issues token with tenant ID
3. App accepts the token and user gets access to tenant-specific resources (note that the app developer must
enforce the separation of resources by tenant ID)
Obtain user (or admin) consent
When a user (or admin) logs in to a multi-tenant application for the first time, the application must ask for consent
to be registered in the user's directory and access directory data it needs to run. For example, if an application
needs to read calendar information about a user from Microsoft 365, that user is required to consent to this access
before the app can do so.
Developers can use the Azure AD consent framework to obtain consent via interactive prompts as part of the first
time log in experience. The prompts ensure that users and administrators know exactly what data the app is
requesting access to, and they can be confident that it isn't behaving with malicious intent. The consent framework
is built on OAuth 2.0 and its various flows, such as authorization code grant and client credentials grant, using
public or confidential clients. An example of the consent process is available in the Azure Active Directory consent
framework document, as well as this example multi-tenant application.
Once the user consents, a service principal defining the access policy and permissions for this app is created in the
user's tenant, and the sign-in will continue.
1. App sends sign-in request to common for the user
2. User responds to sign-in prompt and provides any required consent to app
3. Azure AD creates service principal (SP) in customer's tenant
In addition, the developer can decide based on the requirements of the app if it needs consent from the user (to
access data related to them) or an admin (to allow access to the entire organization's data), although some tenants
may require admin approval for any new application. This is covered in more depth in How to: Sign in any Azure
Active Directory user using the multi-tenant application pattern.
Implement authorization with Azure AD App Roles
In addition to authentication, the developer can enforce authorization within the app. Authorization is used to
ensure once a user has access to an app, they are given proper privileges to access data and resources.
Authorization rules can be enforced with Role-Based Access Control (RBAC) and Role Claims. When using RBAC, a
tenant admin grants permissions to use or view resources based on roles , and not to individual users or groups.
The developer can add app roles to an app and declare them in the app's manifest. When a user signs into the app,
Azure AD emits a claim for each role that the user has been granted, individually and from their group
membership. The app can use these role claims with the tenant ID claim to do an access check. This will determine
whether a user is allowed to perform an action on a specific resource associated with an Azure AD tenant.
An app does not need any extra Active Directory permissions, other than reading the user's profile, to enable app
roles. Role assignment managers within each tenant can then assign these pre-defined app roles to users and
manage who has access to which resource following their organization's needs.
Conditional access to IoT resources
If customers have additional requirements to restrict access to IoT resources based on specific conditions, they can
use Azure AD Conditional Access in their own tenants. Common signals that Conditional Access can take into
account include:
User or group membership
IP location information
Devices marked as compliant
Specific applications
Get the app to customers
Finally, the developer may want to provide an e-commerce site for customers to purchase licenses for the app. This
e-commerce site is separate from the IoT app itself. A sample workload describing how to build an e-commerce
front-end is available on the Azure Reference Architecture Center.
To increase the discoverability and adoption of your IoT app, the developer should publish the app to the Azure AD
App gallery. The Azure AD application gallery is in the Azure Marketplace app store, where all application
connectors are published for single sign-on and user provisioning. IT administrators can add connectors from the
app gallery, and then configure and use the connectors for single sign-on and provisioning for additional
convenience for the end user of the app.

Example app usage scenarios


In the first part of this document we explained the basic configurations necessary for a developer to set up a multi-
tenant app. Now, let's explore a couple customer scenarios to explain how the authentication process will work for
the IoT app. The following two customers, Fabrikam, Inc. and Woodgrove Bank, will sign in to our IoT developer's
app in different ways. We'll go through each scenario and discuss how Azure AD helped the developer meet the
customers' needs.
Scenario 1: Customer with an existing Azure AD tenant
Fabrikam, Inc. is a large enterprise customer of the developer and has its own Azure AD tenant. Fabrikam, Inc. would
like its employees to sign into the IoT app using their existing Azure AD work accounts. In addition, the company's
IT departments will manage access and apply organizational policies for access to the app, the IoT devices, and data
the IoT app manages. Fabrikam, Inc.'s global admin will review the permissions required by any app before allowing
it in the tenant.
The developer's app supports these requirements as follows:
Signing in Fabrikam, Inc.'s users with their existing work accounts
Since the app has been registered as a multi-tenant app, Fabrikam, Inc. users can sign into the app with their Azure
AD work credentials. The app must send the sign-in request with these credentials to the common endpoint,
https://login.microsoftonline.com/common . The Microsoft identity platform will discover which tenant the user is
from and send a sign-in response token that contains information on which tenant the user belongs to. The
developer can use this information to determine which resources the user can access.
Admin consent for the app
Fabrikam, Inc.'s tenant administrators can set the enterprise app settings so that administrator approval is required
before a user in the tenant can register or provide consent to a new app. This allows the administrators to control
which apps are being run in their tenant. If a user is blocked from registering or signing in, they'll see a generic
error message that says they're unauthorized to access the app and they should ask their admin for help.
In addition, the admins can set up the tenant to require admin consent for certain permissions (e.g. the ability to
write data to your Azure AD tenant) regardless of whether regular users can register or consent to new apps.
TIP
The new admin consent workflow gives tenant admins a secure way to grant access to applications that require admin
approval while allowing them to disable their end-user’s ability to consent. When a user tries to access an app but is unable
to provide consent, they can use the dialog box with the error message to send a request for admin approval. Benefits of the
admin consent workflow include the following:
Helps admins manage access to organizational data and regulate the enterprise apps within their organization.
Gives administrators increased visibility into what app users need access to.
The new workflow gives all end-users the ability to request access to apps that they were previously unable to access because
of a lack of permissions. The request is sent via email to admins who have been designated as reviewers. A reviewer acts on
the request, and the user is notified of the action.

Scenario 2: Customer without Azure AD tenant


Most customers for IoT app developers will likely be larger companies that have their own Azure AD tenants.
However, in the case that there are customers who want to use the app and do not have Azure AD, they can still use
the app by taking advantage of features available with Azure AD B2B as described in this scenario.
Woodgrove Bank is another company using the developer's app. Woodgrove Bank does not have an Azure AD
tenant and would like its employees to log in with existing email accounts such Gmail or Microsoft accounts. As
with other customers, isolation of Woodgrove Bank's users and resources from other customers must be
maintained. In addition, the developer would like to ensure all access to the app can be audited and reported on.
To enable this, the developer and IT Admins can use Azure AD B2B (business-to-business) collaboration, adding
Woodgrove Bank's users as guests in the app's home tenant. Azure AD B2B will fulfill the business requirements as
follows.
Adding guest users to a tenant
The developer can enable guests to log in with existing email accounts using an invitation and redemption process
that involves sending either a direct link or an email to the guest. The process can be customized using Azure AD
business-to-business APIs.
The developer will need to decide which tenant they want to keep the guest user accounts in. These guest users will
show up in the directory, so if this will be a common scenario for the app the developer and company
administrators may decide it is best to create a new tenant separate from the company's corporate tenant in which
to register the app so that the guest user accounts are tracked there and not in the company's corporate tenant.
Authenticating guest users
To enable guest sign-ins, the developer must ensure that guest users' sign-in attempts are directed to the tenant-
specific endpoint (i.e. the developer's tenant) https://login.microsoftonline.com/{TenantId_or_Name} . Note that this
differs from the endpoint used for users with existing Azure AD accounts, so may require the developer to
implement a separate sign-in page specifically for guests.
There are several options for authenticating guests who wish to use non-Azure AD accounts with your app:
Direct federation. Direct federation works with identity systems that support the SAML or WS-Fed standards.
Google federation. Gmail users are authenticated using Google federation.
One Time Passcode (OTP) authentication. Provides a temporary code to authenticate users who cannot be
authenticated through other means.
Isolation
The developer can create a security group in their home tenant that will act as the isolation boundary for
Woodgrove Bank's users and their IoT devices. The developer or IT Admin in the developer's tenant can choose to
designate a Woodgrove Bank employee as the owner of the security group so that they can control security group
membership. The developer's app will factor in guest users' group memberships when making access control
decisions.
In addition, the Woodgrove security group owner can set group-level access policies that will control how the
group members can access the app in order to enforce company policies.

NOTE
Depending on the business logic required by your app, security group ownership and membership can be managed by either
an account manager who is an employee of the developer's company or by a Woodgrove guest user who is assigned as
owner of the group. To learn more about Azure AD security group functionality, see Manage app and resource access using
Azure Active Directory groups.

Ru n t i m e

1. The user signs into the app using the targeted tenant endpoint.
2. Azure AD issues a token with tenant ID and group membership claims.
3. The app accepts the token and user gets access to security group-specific resources.

Conclusion
This has been an overview of how to use Azure AD to secure your application, focusing on common IoT app
scenarios. The links below provide more information on the topics discussed in this document.

NOTE
Please leave feedback on the document whether you found it helpful or not! Your input matters in helping us to create useful
content.

Other resources
Microsoft identity platform best practices and recommendations
Manage identity in multi-tenant applications
Azure AD developer guidance
Permissions and consent
Signing up and signing in with Azure AD
Self-service group management
Restrict your app to a set of users
Microsoft identity platform (v2.0) Authentication flows and app scenarios
Controlling IoT devices using a Voice Assistant
11/2/2020 • 2 minutes to read • Edit Online

Solution Idea
If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples,
let us know with GitHub Feedback!

Create seamless conversational interfaces with all of your internet-accessible devices-from your connected
television or fridge to devices in a connected power plant. By combining Azure Speech Service, Language
Understanding Service (LUIS) and Azure Bot Framework, developers can create natural, human-like conversational
interfaces to control smart devices using Azure IoT Hub.

Architecture

Data Flow
1. Using voice, the user asks the voice assistant app to turn on the exterior house lights.
2. Using the Speech SDK, the app connects to Direct Line Speech. If keywords are confirmed by Keyword
Verification, the speech is transcribed to text and sent to the Bot Service.
3. The Bot Service connects to Language Understanding service (LUIS). LUIS allows an application to understand
what a person wants in their own words. The intent of the user's request (example: TurnOnLight) is returned to
the Bot Service.
4. The request is relayed to the device.
If the device is connected to Azure IoT Hub, Bot Service connects to Azure IoT Hub Service API and sends
the command to the device using either a Direct Method, an update to the device twin's Desired Property,
or a Cloud to Device message.
If the device is connected to a third party IoT cloud, Bot Service connects to the third-party service API
and sends a command to the device.
5. The Bot returns the results of the command to the user by generating a response that includes the text to speak.
6. The response is turned into audio using the Text-to-speech service and passed back to the voice assistant app by
Direct Line Speech.
7. Application Insights gathers runtime telemetry to help development with bot performance and usage
8. Azure App Service hosts the Bot Service application.

Components
1. Voice assistants documentation
2. Tutorial: Voice-enable your bot using the Speech SDK
3. What is Direct Line Speech
4. Azure Bot Service
5. Speech to Text
6. Text to Speech
7. Custom Keywords
8. Language Understanding Service (LUIS)
9. Azure IoT Hub
IoT telemetry analytics with Azure Data Explorer
11/2/2020 • 3 minutes to read • Edit Online

This architecture pattern demonstrates near real-time analytics over fast flowing, high volume streaming data from
IoT devices, sensors, connected buildings and vehicles, and so on. It focuses on integration of Azure Data Explorer
with other IoT services to cater to both operational and analytical workloads using Cosmos DB and Azure Data
Explorer.
This architecture is already being used by Microsoft customers for IoT device telemetry analytics. For example,
Bosch uses this to combine real-time road conditions with weather data for safer autonomous driving.

Data flow
1. Ingest a wide variety of fast-flowing streaming data such as logs, business events, and user activities from
various sources such as Azure Event Hub, Azure IoT Hub, or Kafka.
2. Process data in real-time using Azure Functions or Azure Stream Analytics.
3. Store streamed JSON messages in Cosmos DB to serve a real-time operational application.
4. Ingest data into Azure Data Explorer with low-latency and high throughput using its connectors for Azure Event
Hub, Azure IoT Hub, Kafka, and so on. Alternatively, ingest data through Azure Storage (Blob or ADLS Gen2),
which uses Azure Event Grid and triggers the ingestion pipeline to Azure Data Explorer. You can also
continuously export data to Azure Storage in compressed, partitioned parquet format and seamlessly query
that data as detailed in the continuous data export overview.
5. Azure Data Explorer is a big data analytics store for serving near real-time analytics applications and
dashboards. As this pattern serves both operational and analytical use cases, data can either be routed to Azure
Data Explorer and Cosmos DB in parallel, or from Cosmos DB to Azure Data Explorer.
Integrate Cosmos DB with Azure Data Explorer using change feed. Every transaction in Cosmos DB can
trigger an Azure Function via this change feed. Data can then be streamed to Event Hub and ingested
into Azure Data Explorer.
From Azure Function, invoke Azure Digital Twin (ADT) APIs. ADT is used for storing digital models of
physical environments. This service streams data to Event Hub, which then gets ingested to Azure Data
Explorer.
6. Gain insights from data stored in Azure Data Explorer by any of the following methods:
Build a custom analytics app that invokes APIs exposed by ADT and Azure Data Explorer to blend the data
from both sources.
Build near real-time analytics dashboards using Azure Data Explorer dashboards, Power BI, or Grafana.
Create and schedule alerts and notifications using the Azure Data Explorer connector for Azure Logic
Apps.
Analyze data using Azure Data Explorer Web UI, Kusto.Explorer, or Jupyter notebooks.
7. Azure Data Explorer provides native advanced analytics capabilities for time series analysis, pattern recognition,
anomaly detection and forecasting, and machine learning. Azure Data Explorer is also well integrated with ML
services such as Databricks and Azure Machine Learning. This integration allows you to build models using
other tools and services and export ML models to Azure Data Explorer for scoring data.

Components
Azure Event Hub: Fully managed, real-time data ingestion service that’s simple, trusted, and scalable.
Azure IoT Hub: Managed service to enable bi-directional communication between IoT devices and Azure.
Kafka on HDInsight: Easy, cost-effective, enterprise-grade service for open source analytics with Apache Kafka.
Azure Cosmos DB: Fully managed fast NoSQL database service for modern app development with open APIs
for any scale.
Azure Data Explorer: Fast, fully managed and highly scalable data analytics service for real-time analysis on
large volumes of data streaming from applications, websites, IoT devices, and more.
Azure Data Explorer Dashboards: Natively export Kusto queries that were explored in the Web UI to optimized
dashboards.
Azure Digital Twins: Create next-generation IoT solutions that model the real world.

Next steps
For more information, see Azure Data Explorer documentation.
Optimizing curbside pickup with Azure IoT
11/13/2020 • 7 minutes to read • Edit Online

Contoso is a European retailer operating mid-sized supermarkets. They have grown through the years and are now
one of the largest retailers, with more than 1000 stores located in both cities and suburbs.

Challenges faced
What are the business challenges?
With the recent COVID-19 pandemic, customers have been making fewer physical trips to the supermarkets as
many are concerned with health risks and are practicing safe distancing. As a result, Contoso has also seen an
increase in the usage of buy online, pickup in store (BOPIS) which is also known as curbside pickup.
In order to ensure that their customers get the freshest quality produce, Contoso starts packing temperature-
controlled items when their customers arrive. Due to the rise in usage and time taken to pack fresh produce, there
is a longer than usual waiting time for curbside pickup. These increased waiting times have caused an increased
amount of customer dissatisfaction, and surveys have shown that customers are choosing to use other retailers.
Some questions to ask to help understand the challenges are:
What is the estimated impact caused by these issues?
What are current operations like?

Business outcomes
What are the desired business outcomes?
With 59% of consumers polled saying they're likely to continue choosing curbside pickup even after the
pandemic, Contoso wants to improve the efficiency of their curbside operations. For curbside pickups, Contoso
needs information on how far away each customer is and when the customer arrives at the parking lot.
What should the solution do for the business?
The solution should provide alerts to the store associates and trigger a work order to start packing the fresh
produce for customers that are near the outlet. This decreases customer waiting time, improving a key aspect of
the curbside pickup experience.

Requirements
What are key technical and non-technical requirements?
When accessing data of customers' movements, Contoso must ensure personal privacy protection for their
customers. This is an essential prerequisite as part of regulatory requirements.
There are cases where the vehicles license details do not match the database or that information is non-existent. To
mitigate such data gaps, Contoso would like to use GPS to help them identify the location of their customers.

Patterns to address challenges


How can technology like Internet of Things (IoT) help solve your business challenges?
The following table provides a summary of common use cases and corresponding IoT solutions. Each use case is
an example of how an IoT process pattern can be applied to real-world scenarios.

USE C A SE SO L UT IO N S PAT T ERN

Obtain license plate details and cross- Use video analytics to detect license Monitor and manage loop - The video
reference that with the respective plates when the vehicle turns into the analytics is part of the monitoring layer
customer order when the vehicle turns parking lot. This information is and the insights are actioned by the
into the parking lot. To alert store reconciled with order management store associates as part of the manage
associate immediately to start packing system and a task is sent to Teams pattern.
and preparing the order. which alerts and schedules a store
associate to start packing.

Notify customer that store has started When store associate starts task, Monitor and manage loop - The
packing and will deliver the order soon. system will cross-check with geofence geofence rule is monitoring the location
rule of the customer location. A of the customer and sending an alert
notification will be sent to the customer when the customer is in the vicinity
to let them know that their delivery is (manage)
on the way when the system detects
that the customer is in the boundaries
of the geofence.

Reference architecture

1. Video feed is obtained as cars come into the parking area. The IP camera hosting a Real-Time Streaming
Protocol (RTSP) server sends the feed to Live View Analytics (LVA) module. The LVA module processes the
frame rate and sends the image to Azure Cognitive Services running on the gateway. Azure Cognitive
Services extracts only the license plate details and sends it to the cloud application.
2. Azure IoT Central is used because it is a fully managed application platform. It allows Contoso to extend it
easily and focus on features that directly impacts business.
3. License plate details are queued in Event hub which routes it to an Azure Storage for long-term storage and
allows other services to use it.
4. The license plate details are sent to the curbside pickup application via Azure Functions. The serverless
design allows Contoso to lower their infrastructure management and cost.
5. The license plate details are cross-referenced with the order management system. The order management
system is usually integrated with the Customer Relationship Management system to provide details of the
license plate. When a match is confirmed, the order packing task is sent to Microsoft Teams. A store
associate will be notified to start preparing the order immediately and deliver it to the customer upon
arrival.
6. The curbside pickup application also uses Azure Maps geofence triggers based on rules to start the order
packing. The triggers can be used to reconcile the license plate notification (described in the previous point)
or it can be a standalone function in the absence of license plate information.
7. Once the store associate begins on the task, the application will send a notification to the customer to
inform them of the order progress. Customers have mentioned that this is appreciated as it completes the
communication loop with the customer and presents accountability in the process.
8. Order pickup details are written back into storage so that Contoso can understand the time taken to work on
the order.

Architecture considerations
Privacy
Azure IoT Edge is selected as it's a runtime that allows the use and orchestration of container-based modules. The
solution uses Azure Cognitive Services running on the edge device and only the license plate details are sent to the
application. This process ensures that no facial images or privacy data are sent to storage.
Geofence
The application uses Azure Maps, which allows Contoso to create geofence rules. These rules provide another data
point to confirm a customer's location. The geofence triggers provides additional accuracy on the estimated time of
arrival.

Components
Azure IoT Edge runs analytics, applications on-premise to ensure low latency, lower bandwidth usage, and
privacy (that is, extract only the license plate details).
Live Video Analytics on IoT Edge offers the capability to capture, record, and analyze live video. This allows
Contoso to focus on building features for the business rather than managing the video processing pipeline.
Azure Cognitive Services allows the AI model to run on the edge to adhere to privacy regulations and lower
the bandwidth required. Contoso can utilize the service without spending engineering efforts for creating
and training the model.
Azure IoT Central is a fully managed application platform that reduces the burden and cost of developing,
managing, and maintaining enterprise-grade IoT solutions.
Event Hubs are used to queue the events sent to the curbside pickup application. Event Hubs creates
decoupling for consumption in a distributed application.
Azure Storage is used to store raw data for analysis. Contoso has decided to use this service as they are
storing the objects in flat namespace.
Azure Functions is a serverless service used to process the events received. Without having to maintain the
infrastructure, Contoso can write single function programs to send data from Event Hubs to the curbside
pickup application API. It is also used to read the changefeed in Cosmos DB to write data into Azure Storage.
Azure Cosmos DB provides a low-latency database with guaranteed availability and automatic scalability.
Contoso requires low latency reads and write for seamless user experience. The curbside pickup application
uses NoSQL document to store the data because of the variety in order SKU. As the application is available
throughout Europe, Contoso wants a turn key database that provides multi-master read and write.
Azure Kubernetes Service. The curbside pickup application is based on a microservice design. It is built using
a cluster of containers and managed a hosted Kubernetes service. The Kubernetes masters are managed by
Azure. It handles critical tasks like health monitoring and maintenance.
Azure Maps provides geofencing as a service required to gauge customer proximity. Contoso is also using
Azure Maps for Indoor mapping too.
Azure Notification Hubs is a massively scalable mobile push notification engine for quickly sending millions
of notifications to iOS, Android, Windows. Using this service, Contoso can easily broadcast notifications to
their customers.
For more detailed discussions, see the IoT reference architecture document to understand and explore the various
implementation choices available.

Next steps
For more details on implementations, take a look at some of these retail partner solutions.
Oneview Commerce
Ombori
Azure Industrial IoT Analytics Guidance
11/2/2020 • 11 minutes to read • Edit Online

This article series shows a recommended architecture for an Industrial IoT (IIoT) analytics solution on Azure using
PaaS (Platform as a service) components. Industrial IoT or IIoT is the application of Internet of Things in the
manufacturing industry.
An IIoT analytics solution can be used to build a variety of applications that provide:
Asset monitoring
Process dashboards
Overall equipment effectiveness (OEE)
Predictive maintenance
Forecasting
Such an IIoT analytics solution relies on real-time and historical data from industrial devices and control systems
located in discrete and process manufacturing facilities. These include PLCs (Programmable Logic Controller),
industrial equipment, SCADA (Supervisory Control and Data Acquisition) systems, MES (Manufacturing Execution
System), and Process Historians. The architecture covered by this series, includes guidance for connecting to all
these systems.
A modern IIoT analytics solution goes beyond moving existing industrial processes and tools to the cloud. It
involves transforming your operations and processes, embracing PaaS services, and leveraging the power of
machine learning and the intelligent edge to optimize industrial processes.
The following list shows some typical personas who would use the solution and how they would use this solution:
Plant Manager - responsible for the entire operations, production, and administrative tasks of the
manufacturing plant.
Production Manager - responsible for production of a certain number of components.
Process Engineer - responsible for designing, implementing, controlling, and optimizing industrial processes.
Operations Manager - responsible for overall efficiency of operation in terms of cost reduction, process time,
process improvement, and so on.
Data Scientist – responsible for building and training predictive Machine Learning models using historical
industrial telemetry.
The following architecture diagram shows the core subsystems that form an IIoT analytics solution.
NOTE
This architecture represents an ingestion-only pattern. No control commands are sent back to the industrial systems or
devices.

The architecture consists of a number of subsystems and services, and makes use of the Azure Industrial IoT
components. Your own solution may not use all these services or may have additional services. This architecture
also lists alternative service options, where applicable.

IMPORTANT
This architecture includes some services marked as "Preview" or "Public Preview". Preview services are governed by
Supplemental Terms of Use for Microsoft Azure Previews.

Industrial Systems and Devices


In process manufacturing, industrial equipment (for example, flow monitors, pumps, and so on) is often
geographically dispersed and must be monitored remotely. Remote Terminal Units (RTUs) connect remote
equipment to a central SCADA system. RTUs work well in conditions where connectivity is intermittent, and no
reliable continuous power supply exists.
In discrete manufacturing, industrial equipment (for example, factory robots, conveyor systems, and so on) are
connected and controlled by a PLC. One or more PLCs may be connected to a central SCADA system using
protocols such as Modbus or other industrial protocols.
In some cases, the data from SCADA systems is forwarded and centralized in an MES or a historian software (also
known as process historian or operational historian). These historians are often located in IT controlled networks
and have some access to the internet.
Frequently, industrial equipment and SCADA systems are in a closed Process Control Network (PCN), behind one
or more firewalls, with no direct access to the internet. Historians often contain industrial data from multiple
facilities and are located outside of a PCN. So, connecting to a historian is often the path of least resistance, rather
than connecting to a SCADA, MES, or PLC. If a historian is not available, then connecting to an MES or SCADA
system would be the next logical choice.
The connection to the historian, MES, or SCADA system will depend on what protocols are available on that
system. Many systems now include Industry 4.0 standards such as OPC UA. Older systems may only support
legacy protocols such as Modbus, ODBC, or SOAP. If so, you will most likely require a protocol translation software
running on an intelligent edge device.

Intelligent Edge
Intelligent edge devices perform some data processing on the device itself or on a field gateway. In most industrial
scenarios, the industrial equipment cannot have additional software installed on it. This is why a field gateway is
required to connect the industrial equipment to the cloud.
Azure IoT Edge
To connect industrial equipment and systems to the cloud, we recommend using Azure IoT Edge as the field
gateway for:
Protocol and identity translation;
Edge processing and analytics; and
Adhering to network security policies (ISA 95, ISA 99).
Azure IoT Edge is a free, open source field gateway software that runs on a variety of supported hardware devices
or a virtual machine.
IoT Edge allows you to run edge workloads as Docker container modules. The modules can be developed in
several languages, with SDKs provided for Python, Node.js, C#, Java and C. Prebuilt Azure IoT Edge modules from
Microsoft and third-party partners are available from the Azure IoT Edge Marketplace.
Real-time industrial data is encrypted and streamed through Azure IoT Edge to Azure IoT Hub using AMQP 1.0 or
MQTT 3.1.1 protocols. IoT Edge can operate in offline or intermittent network conditions providing store and
for ward capabilities.

There are two system modules provided as part of IoT Edge runtime.
The EdgeAgent module is responsible for pulling down the container orchestration specification (manifest)
from the cloud, so that it knows which modules to run. Module configuration is provided as part of the module
twin.
The EdgeHub module manages the communication from the device to Azure IoT Hub, as well as the inter-
module communication. Messages are routed from one module to the next using JSON configuration.
Azure IoT Edge automatic deployments can be used to specify a standing configuration for new or existing devices.
This provides a single location for deployment configuration across thousands of Azure IoT Edge devices.
A number of third-party IoT Edge gateway devices are available from the Azure Certified for IoT Device Catalog.
IMPORTANT
Proper hardware sizing of an IoT Edge gateway is important to ensure edge module performance. See the performance
considerations for this architecture.

Gateway Patterns
There are three patterns for connecting your devices to Azure via an IoT Edge field gateway (or virtual machine):
1. Transparent - Devices already have the capability to send messages to IoT Hub using AMQP or MQTT.
Instead of sending the messages directly to the hub, they instead send the messages to IoT Edge, which in
turn passes them on to IoT Hub. Each device has an identity and device twin in Azure IoT Hub.
2. Protocol Translation - Also known as an opaque gateway pattern. This pattern is often used to connect
older brownfield equipment (for example, Modbus) to Azure. Modules are deployed to Azure IoT Edge to
perform the protocol conversion. Devices must provide a unique identifier to the gateway.
3. Identity Translation - In this pattern, devices cannot communicate directly to IoT Hub (for example, OPC
UA Pub/Sub, BLE devices). The gateway is smart enough to understand the protocol used by the
downstream devices, provide them identity, and translate IoT Hub primitives. Each device has an identity
and device twin in Azure IoT Hub.
Although you can use any of these patterns in your IIoT Analytics Solution, your choice will be driven by which
protocol is installed on your industrial systems. For example, if your SCADA system supports ethernet/IP, you will
need to use a protocol translation software to convert ethernet/IP to MQTT or AMQP. See the Connecting to
Historians section for additional guidance.
IoT Edge gateways can be provisioned at scale using the Azure IoT Hub Device Provisioning Service (DPS). DPS is a
helper service for IoT Hub that enables zero-touch, just-in-time provisioning to the right IoT hub without requiring
human intervention, enabling customers to provision millions of devices in a secure and scalable manner.

OPC UA
OPC UA is the successor to OPC Classic (OPC DA, AE, HDA). The OPC UA standard is maintained by the OPC
Foundation. Microsoft has been a member of the OPC Foundation since 1996 and has supported OPC UA on
Azure since 2016.
Industry and domain-specific Information Models can be created based on the OPC UA Data Model. The
specifications of such Information Models (also called industry standard models since they typically address a
dedicated industry problem) are called Companion Specifications. The synergy of the OPC UA infrastructure to
exchange such industry information models enables interoperability at the semantic level. OPC UA can use a
number of transport protocols including MQTT, AMQP, and UADP.
Microsoft has developed open source Azure Industrial IoT components, based on OPC UA, which implement
identity translation pattern:
OPC Twin consists of microservices and an Azure IoT Edge module to connect the cloud and the factory
network. OPC Twin provides discovery, registration, and synchronous remote control of industrial devices
through REST APIs.
OPC Publisher is an Azure IoT Edge module that connects to existing OPC UA servers and publishes telemetry
data from OPC UA servers in OPC UA PubSub format, in both JSON and binary.
OPC Vault is a microservice that can configure, register, and manage certificate lifecycle for OPC UA server and
client applications in the cloud.
Discovery Services is an Azure IoT Edge module that supports network scanning and OPC UA discovery.
The Microsoft Azure IIoT solution also contains a number of services, REST APIs, deployment scripts, and
configuration tools that you can integrate into your IIoT analytics solution. These are open source and available on
GitHub.

Edge Workloads
The ability to run custom or third-party modules at the edge is important.
If you want to respond to emergencies as quickly as possible, you can run anomaly detection or Machine
Learning module in tight control loops at the edge.
If you want to reduce bandwidth costs and avoid transferring terabytes of raw data, you can clean and
aggregate the data locally then send only the insights to the cloud for analysis.
If you want to convert legacy industrial protocols, you can develop a custom module or purchase a third-party
module for protocol translation.
If you want to quickly respond to an event on the factory floor, you can use an edge module to detect that event
and another module to respond to it.
Microsoft and our partners have made available on the Azure Marketplace a number of edge modules, which can
be used in your IIoT analytics solution. Protocol and identity translation are the most common edge workloads
used within an IIoT analytics solution. In the future, expect to see other workloads such as closed loop control using
edge ML models.

Connecting to Historians
A common pattern when developing an IIoT analytics solution is to connect to a process historian and stream real-
time data from the historian to Azure IoT Hub. How this is done, will depend on which protocols are installed and
accessible (that is, not blocked by firewalls) on the historian.

P ROTO C O L AVA IL A B L E O N H ISTO RIA N O P T IO N S

OPC UA - Use Azure IoT Edge, along with OPC Publisher, OPC
Twin, and OPC Vault, to send OPC UA data over MQTT to
IoT Hub. OPC Twin also has support for OPC UA HDA
profile, useful for obtaining historical data.
- Use a third-party Azure IoT Edge OPC UA module to
send OPC UA data over MQTT to IoT Hub.

OPC DA - Use third-party software to convert OPC DA to OPC UA


and send OPC UA data to IoT Hub over MQTT. Or use
OPC Publisher, OPC Twin and OPC Vault to send OPC UA
data over MQTT to IoT Hub.

Web Service - Use a custom Azure IoT Edge HTTP module to poll the
web service.
- Use third-party software that supports HTTP to MQTT
3.1.1 or AMQP 1.0.

MQTT 3.1.1 (Can publish MQTT messages) - Connect historian directly to Azure IoT Hub using MQTT.
- Connect historian to Azure IoT Edge as a leaf device. See
Transparent Gateway pattern.
P ROTO C O L AVA IL A B L E O N H ISTO RIA N O P T IO N S

Other - Use a custom Azure IoT Edge module.


- Use third-party software to convert to MQTT 3.1.1 or
AMQP 1.0.

A number of Microsoft partners have developed protocol and identity translation modules or solutions that are
available on the Azure Marketplace.
Some historian vendors also provide first-class capabilities to send data to Azure.

H ISTO RIA N O P T IO N S

OSIsoft PI PI Integrator for Azure

Honeywell Uniformance Cloud Historian

Once real time data streaming has been established between your historian and Azure IoT Hub, it is important to
export your historian's historical data and import it into your IIoT analytics solution. For guidance on how to
accomplish this, see Historical Data Ingestion.

Cloud Gateway
A cloud gateway provides a cloud hub for devices and field gateways to connect securely to the cloud and send
data. It also provides device management capabilities. For the cloud gateway, we recommend Azure IoT Hub. IoT
Hub is a hosted cloud service that ingests events from devices and IoT Edge gateways. IoT Hub provides secure
connectivity, event ingestion, bidirectional communication, and device management. When IoT Hub is combined
with the Azure Industrial IoT components, you can control your industrial devices using cloud-based REST APIs.
IoT Hub supports the following protocols:
MQTT 3.1.1,
MQTT over WebSockets,
AMQP 1.0,
AMQP over WebSockets, and
HTTPS.
If the industrial device or system supports any of these protocols, it can send data directly to IoT Hub. In most
industrial environments, this is not permissible because of PCN firewalls and network security policies (ISA 95, ISA
99). In such cases, an Azure IoT Edge field gateway can be installed in a DMZ between the PCN and the Internet.

Next steps
To learn the services recommended for this architecture, continue reading the series with Services in an IIoT
analytics solution.
Services in an IIoT analytics solution
11/2/2020 • 19 minutes to read • Edit Online

Building on the architectural components in the recommended Azure Industrial IoT analytics solution, this article
discusses the subsystems and Azure services that can be used in such a solution. Your solution may not use all
these services or may have additional services.

Time Series Service


A time series service will be required to provide a warm and cold store for the JSON-based industrial time series
data. We recommend using Time Series Insights (or TSI) as the time series service for the following reasons:
Ability to ingest streaming data directly from Azure IoT Hub or Azure Event Hub.
Multi-layered storage solution with warm and cold analytics support. It provides you the option to route
data between warm and cold, for interactive analytics over warm data as well as operational intelligence
over decades of historical data.
A highly interactive warm analytics solution to perform frequent, and large number of queries over
shorter time span data.
A scalable, performant, and cost-optimized time series data lake based on Azure Storage allowing
customers to trend years' worth of time series data.
Asset hierarchy support that describes the domain and metadata associated with the derived and raw
signals from industrial assets and devices. Multiple hierarchies can be defined reflecting different
departments within your company. For instance, a factory hierarchy will look different from an Oil & Gas
hierarchy.
Flexible analytics platform to store historical time series data in your own Azure Storage account, thereby
allowing you to have ownership of your industrial IoT data. Data is stored in open source Apache Parquet
format that enables connectivity and interoperability across a variety of data scenarios including predictive
analytics, machine learning, and other custom computations done using familiar technologies including
Spark, Databricks, and Jupyter.
Rich analytics with enhanced query APIs and user experience that combines asset-based data insights with
rich, improvised data analytics with support for interpolation, scalar and aggregate functions, categorical
variables, scatter plots, and time-shifting time series signals for in-depth analysis.
Enterprise grade platform to support the scale, performance, security, and reliability needs of our industrial
IoT customers.
Ability to view and export the data in CSV format for further analysis.
Ability to share analysis across your organization. Current visualizations can be saved and shared.
Alternatively, the current state of each user's analysis is assigned a unique identifier. This identifier is placed
in the URL allowing users to easily share their analysis by placing the URL in emails, documents, etc.
Because TSI pricing is not seat-based, this democratizes the data by allowing anyone with access to the data
to see it.
You can continue to use Time Series Insights with the old pricing model (S1, S2) for warm, unplanned data analysis.
However, we highly encourage you to use the updated offering (PAYG) as it offers many new capabilities as well as
Pay-As-You-Go pricing, cost savings, and flexibility to scale. Alternatively, Azure Data Explorer or Cosmos DB may
be used as a time series database if you plan to develop a custom time series service.
NOTE
When connecting Time Series Insights with IoT Hub or Event Hub, ensure you select an appropriate Time Series ID. We
recommend using a SCADA tag name field or OPC UA node id (for example, nsu=http://msft/boiler;i=#####) if possible, as
these will map to leaf nodes in your Time Series Model.

The data in Time Series Insights is stored in your Azure Blob Storage account (bring your own storage account) in
Parquet file format. It is your data after all!
You can query your data in Time Series Insights using:
Time Series Insights Explorer
Query API
REST API
Power BI
Any of your favorite BI and analytics tools (for example, Spark, Databricks, Azure Notebooks) by accessing the
Parquet files in your Azure blob storage account

Microservices
Your IIoT analytics solution will require a number of microservices to perform functions such as:
Providing HTTP REST APIs to support your web application.
We recommend creating HTTP-triggered Azure Functions to implement your APIs.
Alternatively, you can develop and host your REST APIs using Azure Service Fabric or Azure Kubernetes
Service (AKS).
Providing an HTTP REST API interface to your factory floor OPC UA servers (for example, using Azure
Industrial IoT components consisting of OPC Publisher, OPC Twin and OPC Vault) to provide discovery,
registration, and remote control of industrial devices.
For hosting the Azure Industrial IoT microservices, we recommend using Azure Kubernetes Service
(AKS). See Deploying Azure Industrial IoT Platform to understand the various deployment options.
Performing data transformation such as converting binary payloads to JSON or differing JSON payloads to
a common, canonical format.
We recommend creating Azure Functions connected to IoT Hub to perform payload transformations.
Different industrial equipment vendors will send telemetry in different payload formats (JSON,
binary, and so on) and schemas. When possible, we recommend converting the different equipment
schemas to a common, canonical schema, ideally based on an industry standard.
If the message body is binary, use an Azure Function to convert the incoming messages to JSON and
send the converted messages back to IoT Hub or to Event Hub.
When the message body is binary, IoT Hub message routing cannot be used against the message
body, but can be used against message properties.
The Azure Industrial IoT components include the capability to decode OPC UA binary messages to
JSON.
A Data Ingest Administration service for updating the list of tags monitored by your IIoT analytics solution.
A Historical Data Ingestion service for importing historical data from your SCADA, MES, or historian into
your solution.
Your solution will likely involve additional microservices to satisfy the specific requirements of your IIoT analytics
solution. If your organization is new to building microservices, we recommend implementing custom
microservices using Azure Functions. Azure Functions is an event-driven serverless compute platform that can be
used to develop microservices and solve complex orchestration problems. It allows you to build and debug locally
(in several software languages) without additional setup, deploy and operate at scale in the cloud, and integrate
Azure services using triggers and bindings.
Both stateless and stateful microservices can be developed using Azure Functions. Azure Functions can use
Cosmos DB, Table Storage, Azure SQL, and other databases to store stateful information.
Alternatively, if your organization has a previous experience building container-based microservices, we
recommend you to also consider Azure Service Fabric or Azure Kubernetes Service (AKS). Refer to Microservices in
Azure for more information.
Regardless of your microservices platform choice, we recommend using Azure API Management to create
consistent and modern API gateways for your microservices. API Management helps abstract, publish, secure, and
version your APIs.
Data Ingest Administration
We recommend developing a Data Ingest Administration service to add/update the list of tags monitored by your
IIoT analytics solution.
SCADA tags are variables mapped to I/O addresses on a PLC or RTU. Tag names vary from organization to
organization but often follow a naming pattern. As an example, tag names for a pump with tag number 14P103
located in STN001 (Station 001), has these statuses:
STN001_14P103_RUN
STN001_14P103_STOP
STN001_14P103_TRIP
As new tags are created in your SCADA system, the IIoT analytics solution must become aware of these tags and
subscribe to them in order to begin collecting data from them. In some cases, the IIoT analytics solution may not
subscribe to certain tags as the data they contain may be irrelevant.
If your SCADA system supports OPC UA, new tags should appear as new NodeIDs in the OPC UA hierarchy. For
example, the above tag names may appear as:
ns=2;s= STN001_14P103_RUN
ns=2;s= STN001_14P103_STOP
ns=2;s= STN001_14P103_TRIP
We recommend developing a workflow that informs the administrators of the IIoT analytics solution when new
tags are created, or existing tags are edited in the SCADA system. At the end of the workflow, the OPC Publisher is
updated with the new/updated tags.
To accomplish this, we recommend developing a workflow that involves Power Apps, Logic Apps, and Azure
Functions, as follows:
The SCADA system operator can trigger the Logic Apps workflow using a Power Apps form whenever tags are
created or edited in the SCADA system.
Alternatively, Logic Apps connectors can monitor a table in the SCADA system database for tag changes.
The OPC UA Discovery service can be used to both find OPC UA servers and the tags and methods they
implement.
The Logic Apps workflow includes an approval step where the IIoT analytics solution owners can approve the
new/updated tags.
Once the new/updated tags are approved and a frequency assigned, the Logic App calls an Azure Function.
The Azure function calls the OPC Twin microservice, which directs the OPC Publisher module to subscribe to the
new tags.
A sample can be found here.
If your solution involves third-party software, instead of OPC Publisher, configure the Azure Function to
call an API running on the third-party software either directly or using an IoT Hub Direct Method.
Alternatively, Microsoft Forms and Microsoft Flow can be used in place of Power Apps and Logic Apps.
Historical Data Ingestion
Years of historical data likely exists in your current SCADA, MES, or historian system. In most cases, you will want
to import your historical data into your IIoT analytics solution.
Loading historical data into your IIoT analytics solution consists of three steps:
1. Export your historical data.
a. Most SCADA, MES, or historian systems have some mechanism that allows you to export your historical
data, often as CSV files. Consult your system's documentation on how best to do this.
b. If there is no export option in your system, consult the system's documentation to determine if an API
exists. Some systems support HTTP REST APIs or OPC Historical Data Access (HDA). If so, build an
application or use a Microsoft partner solution that connects to the API, queries for the historical data,
and saves it to a file in formats such as CSV, Parquet, TSV, and so on.
2. Upload to Azure.
a. If the aggregate size of the exported data is small, you can upload the files to Azure Blob Storage over
the internet using Azcopy.
b. If the aggregate size of the exported data is large (tens or hundreds of TBs), consider using Azure
Import/Export Service or Azure Data Box to ship the files to the Azure region where your IIoT analytics
solution is deployed. Once received, the files will be imported into your Azure Storage account.
3. Import your data.
a. This step involves reading the files in your Azure Storage account, serializing the data as JSON, and
sending data as streaming events into Time Series Insights. We recommend using an Azure Function to
perform this.
b. Time Series Insights only supports IoT Hub and Event Hub as data sources. We recommend using an
Azure Function to send the events to a temporary Event Hub, which is connected to Time Series Insights.
c. Refer to How to shape JSON events and Supported JSON shapes for best practices on shaping your
JSON payload.
d. Make sure to use the same Time Series ID as you do for your streaming data.
e. Once this process is completed, the Event Hub and Azure Function may be deleted. This is an optional
step.

NOTE
Exporting large volumes of data from your industrial system (for example, SCADA or historian) may place a significant
performance load on that system, which can negatively impact operations. Consider exporting smaller batches of historical
data to minimize performance impacts.

Rules and Calculation Engine


Your IIoT analytics solution may need to perform near real-time (low latency) calculations and complex event
processing (or CEP) over streaming data, before it lands in a database. For example, calculating moving averages
or calculated tags. This is often referred to as a calculations engine. Your solution may also need to trigger actions
(for example, display an alert) based on the streaming data. This is referred to as a rules engine.
We recommend using Time Series Insights for simple calculations, at query time. The Time Series Model
introduced with Time Series Insights supports a number of formulas including: Avg, Min, Max, Sum, Count, First,
and Last. The formulas can be created and applied using the Time Series Insights APIs or Time Series Insights
Explorer user interface.
For example, a Production Manager may want to calculate the average number of widgets produced on a
manufacturing line, over a time interval, to ensure productivity goals are met. In this example, we would
recommend the Production Manager to use the Time Series Insights explorer interface to create and visualize the
calculation. Or if you have developed a custom web application, it can use the Time Series Insights APIs to create
the calculation, and the Azure Time Series Insights JavaScript SDK (or tsiclient) to display the data in your custom
web application.
For more advanced calculations and/or to implement a rules engine, we recommend using Azure Stream Analytics.
Azure Stream Analytics is a real-time analytics and complex event-processing engine, that is designed to analyze
and process high volumes of fast streaming data from multiple sources simultaneously. Patterns and relationships
can be identified in information extracted from a number of input sources including devices, sensors, click streams,
social media feeds, and applications. These patterns can be used to trigger actions and initiate workflows such
creating alerts, feeding information to a reporting tool, or storing transformed data for later use.
For example, a Process Engineer may want to implement a more complex calculation such as calculating the
standard deviation (SDEV) of the widgets produced across a number of production lines to determine when any
line is more than 2x beyond the mean over a period of time. In this example, we recommend using Stream
Analytics, with a custom web application. The Process Engineer authors the calculations using the custom web
application, which calls the Stream Analytics REST APIs to create and run these calculations (also known as Jobs).
The Job output can be sent to an Event Hub, connected to Time Series Insights, so the result can be visualized in
Time Series Insights explorer.
Similarly, for a Rules Engine, a custom web application can be developed that allows users to author alerts and
actions. The web application creates associated Jobs in Azure Stream Analytics using the Steam Analytics REST API.
To trigger actions, a Stream Analytics Job calls an Azure Function output. The Azure Function can call a Logic App or
Power Automate task that sends an Email alert or invokes Azure SignalR to display a message in the web
application.
Azure Stream Analytics supports processing events in CSV, JSON, and Avro data formats while Time Series Insights
supports JSON. If your payload does not meet these requirements, consider using an Azure Function to perform
data transformation prior to sending the data to Stream Analytics or Time Series Insights (using IoT Hub or Event
Hubs).
Azure Stream Analytics also supports reference data, a finite data set that is static or slowly changing in nature,
used to perform a lookup or to augment your data streams. A common scenario is exporting asset metadata from
your Enterprise Asset Management system and joining it with real-time data coming from those industrial devices.
Stream Analytics is also available as a module on the Azure IoT Edge runtime. This is useful for situations where
complex event processing needs to happen at the Edge. As an alternative to Azure Stream Analytics, near real-time
Calculation and Rules Engines may be implemented using Apache Spark Streaming on Azure Databricks.

Notifications
Since the IIoT analytics solution is not a control system, it does not require a complete Alarm Management system.
However, there will be cases where you will want the ability to detect conditions in the streaming data and
generate notifications or trigger workflows. Examples include:
temperature of a heat exchanger exceeding a configured limit, which changes the color of an icon in your web
application,
an error code sent from a pump, which triggers a work order in your ERP system, or
the vibration of a motor exceeding limits, which triggers an email notification to an Operations Manager.
We recommend using Azure Stream Analytics to define and detect conditions in the streaming data (refer to the
rules engine mentioned earlier). For example, a Plant Manager implements an automated workflow that runs
whenever an error code is received from any equipment. In this example, your custom web application can use the
Stream Analytics REST API to provide a user interface for the Plant Manager to create and run a job that monitors
for specific error codes.
For defining an alert (email or SMS) or triggering a workflow, we recommend using Azure Logic Apps. Logic Apps
can be used to build automated, scalable workflows, business processes, and enterprise orchestrations to integrate
your equipment and data across cloud services and on-premises systems.
We recommend connecting Azure Stream Analytics with Azure Logic Apps using Azure Service Bus. In the previous
example, when an error code is detected by Stream Analytics, the job will send the error code to an Azure Service
Bus queue output. A Logic App will be triggered to run whenever a message is received on the queue. This Logic
App will then perform the workflow defined by the Plant Manager, which may involve creating a work order in
Dynamics 365 or SAP, or sending an email to maintenance technician. Your web application can use the Logic Apps
REST API to provide a user interface for the Plant Manager to author workflows or these can be built using the
Azure portal authoring experience.
To display visual alerts in your web application, we recommend creating an Azure Stream Analytics job to detect
specific events and send those to either:
An Event Hub output - Then connect the Event Hub to Time Series Insights. Use the Azure Time Series
Insights JavaScript SDK (tsiclient) to display the event in your web application.
or,
An Azure Functions output - Then develop an Azure Function that sends the events to your web application
using SignalR.
Operational alarms and events triggered on premise can also be ingested into Azure for reporting and to trigger
work orders, SMS messages, and emails.

Microsoft 365
The IIoT analytics solution can also include Microsoft 365 services to automate tasks and send notifications. The
following are a few examples:
Receive email alerts in Microsoft Outlook or post a message to a Microsoft Teams channel when a condition is
met in Azure Stream Analytics.
Receive notifications as part of an approval workflow triggered by a Power App or Microsoft Forms submission.
Create an item in a SharePoint list when an alert is triggered by a Logic App.
Notify a user or execute a workflow when a new tag is created in a SCADA system.

Machine Learning
Machine learning models can be trained using your historical industrial data, enabling you to add predictive
capabilities to your IIoT application. For example, your Data Scientists may be interested in using the IIoT analytics
solution to build and train models that can predict events on the factory floor or indicate when maintenance
should be conducted on an asset.
For building and training machine learning models, we recommend Azure Machine Learning. Azure Machine
Learning can connect to Time Series Insights data stored in your Azure Storage account. Using the data, you can
create and train forecasting models in Azure Machine Learning. Once a model has been trained, it can be deployed
as a web service on Azure (hosted on Azure Kubernetes Services or Azure Functions, for example) or to an Azure
IoT Edge field gateway.
For those new to machine learning or organizations without Data Scientists, we recommend starting with Azure
Cognitive Services. Azure Cognitive Services are APIs, SDKs, and services available to help you build intelligent
applications without having formal AI or data science skills or knowledge. Azure Cognitive Services enable you to
easily add cognitive features into your IIoT analytics solution. The goal of Azure Cognitive Services is to help you
create applications that can see, hear, speak, understand, and even begin to reason. The catalog of services within
Azure Cognitive Services can be categorized into five main pillars - Vision, Speech, Language, Web Search, and
Decision.

Asset Hierarchy
An asset hierarchy allows you to define hierarchies for classifying your asset, for example, Country > Location >
Facility > Room. They may also contain the relationship between your assets. Many organizations maintain asset
hierarchies within their industrial systems or within an Enterprise Asset Management (EAM) system.
The Time Series Model in Azure Time Series Insights provides asset hierarchy capabilities. Through the use of
Instances, Types and Hierarchies, you can store metadata about your industrial devices, as shown in the image
below.

If possible, we recommend exporting your existing asset hierarchy and importing it into Time Series Insights using
the Time Series Model APIs. We recommend periodically refreshing it as updates are made in your Enterprise Asset
Management system.
In the future, asset models will evolve to become digital twins, combining dynamic asset data (real-time telemetry),
static data (3D models, metadata from Asset Management Systems), and graph-based relationships, allowing the
digital twin to change in real-time along with the physical asset.
Azure Digital Twins is an Azure IoT service that provides the ability to:
Create comprehensive models of physical environments,
Create spatial intelligence graphs to model the relationships and interactions between people, places, and
devices,
Query data from a physical space rather than disparate sensors, and
Build reusable, highly scalable, spatially aware experiences that link streaming data across the physical and
digital world.

Business Process Integration


In some instances, you will want your IIoT analytics solution to perform actions based on insights from your
industrial data. This can include raising alarms, sending email, sending SMS messages, or triggering a workflow in
your line-of-business systems (for example, CRM, ERP, and so on). We recommend using Azure Logic Apps to
integrate your IIoT analytics solution with your line-of-business systems. Azure Logic Apps has a number of
connectors to business systems and Microsoft services such as:
Dynamics 365
SharePoint Online
Office 365 Outlook
Salesforce
SAP
For example, an error code from a pump is detected by an Azure Stream Analytics job. The job sends a message to
Azure Service Bus and triggers a Logic App to run. The Logic App sends an email notification to the Plant Manager
using the Office 365 Outlook connector. It then sends a message to your SAP S/4 HANA system using the SAP
connector, which creates a Service Order in SAP.

User Management
User management involves managing user profiles and controlling what actions a user can perform in your IIoT
analytics solution. For example, what asset data can a user view, or whether the user can create conditions and
alerts. This is frequently referred to as role-based access control (RBAC).
We recommend implementing role-based access control using the Microsoft identity platform along with Azure
Active Directory. In addition, the Azure PaaS services mentioned in this IIoT analytics solution can integrate directly
with Azure Active Directory, thereby ensuring security across your solution.
Your web application and custom microservices can also integrate with the Microsoft identity platform using
libraries such as Microsoft Authentication Library (or MSAL) and protocols such as OAuth 2.0 and OpenID Connect.
User management also involves operations such as:
creating a new user,
updating a user's profile, such as their location and phone number,
changing a user's password, and
disabling a user's account.
For these operations, we recommend using the Microsoft Graph.
Next steps
Data visualization is the backbone of a well-defined analytics system. Learn about the data visualization techniques
that you can use with the IIoT analytics solution recommended in this series.
Data analysis in Azure Industrial IoT analytics solution
11/2/2020 • 4 minutes to read • Edit Online

This article shows you how to visualize the data collected by the Azure Industrial IoT analytics solution. You can
easily look for data trends using visual elements and dashboards, and use these trends to analyze the effectiveness
of your solution.

Visualization
There are many options for visualizing your industrial data. Your IIoT analytics solution may use some or all of
these options, depending on the personas using your solution.
For Process Engineers and other personas looking to perform ad-hoc analytics and trend visualizations, we
recommend using Azure Time Series Insights explorer.
For Plant Managers and other personas wanting to develop dashboards, we recommend using Power BI, and
connecting Power BI with your data in Time Series Insights using the Power BI connector. Using Power BI, these
users can also combine external data from your ERP, EAM, or other systems with the data in Time Series
Insights.
For advanced visualizations, such as schematic views and process graphics, we recommend a custom web
application.
For Data Scientists interested in using open source data analysis and visualization tools such as Python, Jupyter
Notebooks, and Matplotlib, we recommend Azure Notebooks.

Data Trends
The Azure Time Series Insights explorer is a web application that provides powerful data trending and visualization
capabilities that make it simple to explore and analyze billions of IIoT events simultaneously.
Time Series Insights Explorer is ideally suited to personas, such as a Process Engineer or Operations Manager, who
want to explore, analyze and visualize the raw data coming from your industrial systems. The insights gained from
exploring the raw data can help build Azure Stream Analytics jobs, which look for conditions in the data or perform
calculations over the data.
The Azure Time Series Insights explorer allows you to seamlessly explore both warm and cold data, or your
historical data, as demonstrated in the following figure.
Azure Time Series Insights explorer has a powerful yet intuitive user interface, as shown below.

Key features of Azure Time Series Insights explorer:


1. Environment panel - Displays all your Azure Time Series Insights environments.
2. Navigation bar - Lets you switch between the Analyze and Model pages.
3. Hierarchy tree and search panel - Lets you select and search for specific data elements to be charted.
4. Time series well - Shows all your currently selected data elements.
5. Char t panel - Displays your current working chart.
6. Timeline - Lets you modify your working time span.
7. App bar - Contains your user management options (such as current tenant), and allows you to change them
and language settings.

Dashboards
For some personas, such as a Plant Manager, dashboards containing factory or plant KPIs and visualizations are
more important than viewing the raw data. For such users, we recommend Power BI as the visualization solution.
You can connect Power BI with your data stored in Time Series Insights, providing you with powerful reporting and
dashboard capabilities over your industrial data, and allowing you to share insights and results across your
organization.
By connecting your data to Power BI, you can:
Perform correlations with other data sources supported by Power BI and access a host of different data
visualization options.
Create Power BI dashboards and reports using your Time Series Insight data and share them with your
organization.
Unlock data interoperability scenarios in a simple, easy-to-use manner, and get to insights faster than ever.
Modify Time Series Insights data within Power BI using the powerful Advanced Editor.

Schematic Views
For advanced visualizations, such as schematic views or process graphics, you may require a custom web
application. A custom web application also allows you to provide a single pane of glass user experience and other
advanced capabilities including:
a simplified and integrated authoring experience for Azure Stream Analytics jobs and Logic Apps,
displaying real-time data using process or custom visuals,
displaying KPIs and external data with embedded Power BI dashboards,
displaying visual alerts using SignalR, and
allowing administrators to add/remove users from the solution.
We recommend building a Single Page Application (SPA) using:
JavaScript, HTML5, and CSS3
Time Series Insights JavaScript SDK for displaying process or custom visuals with data from Time Series
Insights
MSAL.js to sign in users and acquire tokens to use with the Microsoft Graph
Azure App Services Web Apps to host the web application
Power BI to embed Power BI dashboards directly in the web app
Azure Maps to render map visualizations
Microsoft Graph SDK for JavaScript to integrate with Microsoft 365

Notebooks
One of the advantages of moving operational data to the cloud is to take advantage of modern big-data tool sets.
One of the most common tools used by Data Scientists for ad-hoc analysis of big data are Jupyter Notebooks.
Jupyter (formerly IPython) is an open-source project that lets you easily combine Markdown text, executable code,
persistent data, graphics, and visualizations onto a single, sharable canvas - the notebook. Production Engineers
should also consider learning Jupyter Notebooks technology to assist in analysis of plant events, finding
correlations, and so on. Jupyter Notebooks provide support for Python 2/3, R, and F# programming languages and
can connect to your Time Series Insights data stored Azure storage.

Next steps
Now that you have learned the architecture of an Azure IIoT analytics solution, read the architectural
considerations that improve the resiliency and efficiency of this architecture.
Architectural Considerations in an IIoT Analytics
Solution
11/2/2020 • 3 minutes to read • Edit Online

The Microsoft Azure Well-Architected Framework describes some key tenets of a good architectural design.
Keeping in line with these tenets, this article describes the considerations in the reference Azure Industrial IoT
analytics solution that improve its performance and resiliency.

Performance Considerations
Azure PaaS Services
All Azure PaaS services have an ability to scale up and/or out. Some services will do this automatically (for
example, IoT Hub, Azure Functions in a Consumpti

You might also like