0% found this document useful (0 votes)
79 views321 pages

I&A Tech Solution Architecture Guidelines

This document provides guidelines for architecting Azure solutions for Unilever. It outlines strategies for data lakes, environment provisioning, naming conventions, and approved Azure components. It also discusses design patterns, information security practices, and cost optimization. The goal is to help Unilever build scalable, secure and cost-effective data solutions on Azure.

Uploaded by

1977am
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views321 pages

I&A Tech Solution Architecture Guidelines

This document provides guidelines for architecting Azure solutions for Unilever. It outlines strategies for data lakes, environment provisioning, naming conventions, and approved Azure components. It also discusses design patterns, information security practices, and cost optimization. The goal is to help Unilever build scalable, secure and cost-effective data solutions on Azure.

Uploaded by

1977am
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 321

I&A Azure Solution Architecture Guidelines

I&A Azure Solution Architecture Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


Section 1 - Solution Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Section 1.1 - Data Lake Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Unilever Data Lake Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Business Data Lake (BDL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Product Data Stores (PDS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Experiment and Market Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Section 1.2 - Environment Provisioning Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
UDL & BDL Environment Provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Product Environment Provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Experiment Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Market Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Section 1.3 - Naming Convention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Section 1.4 - File Management Tool - FMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Section 1.5 - HA (High Availability) & DR (Disaster Recovery) . . . . . . . . . . . . . . . . . . . . . . . . 42
Section 1.6 - Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Deletion Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Section 2 - Approved Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Section 2.1 - Azure Databricks (ADB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Design Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Databricks Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Delta handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Databricks Delta to DW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Databricks Cluster Pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Section 2.2 - Azure Data Factory V2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Azure Data Factory Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Ingestion Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Anaplan Integration with Azure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
ADF Job Triggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
ADF Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Section 2.3 - Azure BLOB Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Section 2.4 - Azure Data Lake Storage (ADLS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Section 2.5 - Azure Analysis Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Azure Analysis Services - Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Webhooks for AAS cube refreshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Section 2.6 - Azure Data Warehouse/Azure Synapse Analytics . . . . . . . . . . . . . . . . . . . . . . . 161
Section 2.7 - Azure Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Section 2.8 - Azure Monitor & Log Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Section 2.9 - Azure App Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Web app connection to Azure SQL DB using MSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Section 2.10 - Azure Logic App . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Section 2.11 - Microsoft PowerApp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Section 2.12 - Azure Cache for Redis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Section 2.13 - Power BI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Power BI performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Livewire Learnings - Power BI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Section 2.14 - Azure Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Section 3 - Design Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Section 3.1 - Streaming Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Section 3.2 - Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Section 3.3 - Data Distribution Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Section 3.4 - Analytical Product Insights write-back to BDL . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Section 3.5 - Data Preparation/Staging Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Section 3.6 - Self Service in PDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Enable MFA on AD Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

Page 1 of 321
Section 3.7 - Global or Country Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Section 3.8 - Job Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Section 3.9 - Data Integration Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Section 4 - Information security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Section 4.1 - Environment and data access management . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
UDL & BDL Access Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Section 4.2 - Encrypting Data-At-Rest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Section 4.3 - Encrypting Data-in-Transit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Section 4.4 - Security on SQL Database/DW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Section 4.5 - Data Lake Audit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
JML Audit Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Section 5 - Cost Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Section 5.1 - High Level Estimate - Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
Section 5.2 - Cost Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Technical Implementation - Cost Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Section 6 - New Foundation Design - Azure Foundation 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Section 6.1 - Express Route – Setup and Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Section 6.2 - I&A Subscription Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
Section 6.3 - Product migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Section 7 - New Tool Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Section 7.1 - Data Share . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Section 7.2 - Snowflake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
Section 7.3 - Synapse Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319

Page 2 of 321
I&A Azure Solution Architecture Guidelines

I&A Azure Solution Architecture Guidelines


Sections Link

Solution Architecture Section 1 - Solution Architecture

Approved Components Section 2 - Approved Components

Design Patterns Section 3 - Design Patterns

Information Security Section 4 - Information security

Cost Management Section 5 - Cost Management

New Foundation Design - Azure Foundation Section 6 - New Foundation Design - Azure Foundation
2018 2018

New Tool Evaluation Section 7 - New Tool Evaluation

Authors of the document

1. Manasa Sampya
2. Hemalatha B
3. Sumi Nair
4. Indira Kotennavar
5. Vishal Gupta
6. Niranjan Waghulde

Version 9.1 Published on 5th August 2020 Page 3 of 321


I&A Azure Solution Architecture Guidelines

Section 1 - Solution Architecture


Overview & Background
The Reference Architecture provides a conceptual and logical template for solution development outlining common
services, applications and vocabulary with which to define the scope of developments agreed with the TDA.

The following section is broken down into the following solution areas:

Logical Architecture
The Logical Reference Architecture defines the applications approved to support the conceptual services.

Physical Architecture
The Physical Reference Architecture defines the products available to support them.

Old Foundation Design was the landscape and network design created when Unilever Azure Journey started.
Since Unilever is constantly making the networking layer and security layer better, a new foundation setup/design is
created.

Version 9.1 Published on 5th August 2020 Page 4 of 321


I&A Azure Solution Architecture Guidelines

New Foundation Design includes the similar components as in Old Foundation design but the platform is made
more secure with networking made better.

Central UDL - ADLS File Path

Trusted Data - \Unilever\UniversalDataLake


Technical Debt - \Unilever\TechDebt

Version 9.1 Published on 5th August 2020 Page 5 of 321


I&A Azure Solution Architecture Guidelines

Project Specific Landing Zone -\Unilever\PSLZ

BDL - ADLS File Path

Business Data Lake - \Unilever\BusinessDataLake

Version 9.1 Published on 5th August 2020 Page 6 of 321


I&A Azure Solution Architecture Guidelines

Section 1.1 - Data Lake Strategy


Overview

Azure platform hosted for Information & Analytics covers 4 main layers.

Universal Data Lake (UDL)


Business Data Lake (BDL)
Product Data store (PDS)
Experimental Area

UDL: Universal Data lake

UDL is the single shared capability that will act as the backbone for all analytics works in Unilever
All master, and transactional data scattered across business systems
Data from true source systems
Data in its native form (in its source form) and time series beyond the time duration that sources will maintain

BDL : Business data lake (BDLs)

Which are, to begin with, Unilever business functional specific data lakes that slices and dices data, and
perform calculations, summarization, aggregation (all the verbs that can describe to curate data that is
unique to a business function)
Unilever functional boundaries like CD, CMI, Finance, Supply Chain, R&D, HR, etc
Sharable KPI’s and sharable business logic.

Product

The business data lakes while providing data that is curated to specific business areas , to make decisions
that are cross business like CCBTs or markets where cross business processed data is required to deliver
specific KPIs or answer granular questions to drive insights. To deliver insights via dashboards or data
science models built on the data and to assist decision making.

Experiment

Time bound environment provisioned to prove business use cases , which can be analytical or data
science. Experiment can take the data from UDL & BDL by following right governance process.

Next sections will cover the usecases of each layers.

Universal Data Lake

As the data, both internal and external grows exponentially, the traditional methods of storing vast volumes of data
in a database or data warehouse are no longer sustainable or cost effective.
Data Lake, a new concept emerged at the back of BIG data where various types of data stored in its native form
within a system. As in (water) lake, a data lake is hydrated (populated) via many data streams: master data,
enterprise transactional data, social feeds, 3rd party data that an enterprise wants to co-mingle with internal data,
irrespective of shape (structured and unstructured) and volume of the data. The most significant advantage is, there
is no constraint on how big a data lake should be and how many varieties of data we can hydrate into the data lake.
Combined with data lake as storage idea and new high performing cloud processing capabilities, makes data lake a
desirable and cost-effective solution for vast enterprises like Unilever to leverage as foundation capability to build
complex data, information and analytics solutions.

Unilever data lake strategy is built on a layered approach to ensure we maintain enterprise-scale while giving agility

Version 9.1 Published on 5th August 2020 Page 7 of 321


I&A Azure Solution Architecture Guidelines

for businesses to define analytics solutions leveraging the data lake concept. It comprises of 3 layers, Trusted Zone
,TechDebt and Project Specific Landing Zone(PSLZ).

The UDL is hosted in Azure Data Lake Store (ADLS). As of today, UDL, BDL, Techdebt, PSLZ shares the same
ADLS, which are organised and managed using different folders.

Changes planned in Future are:

UDL will be hosted in 1 ADLS consisting of Trusted and Tech Debt Folders. PSLZ (project specific landing
zone), will exist as part of each PDS / Product. Though all the 3 layers are part of UDL, but physically PSLZ
will sit as part of respective PDS.

TechDebt Folder (UnTrusted sources)

Datasets which are not from true data sources,or not in raw form (may be aggregated) are considered for
TechDebt, as they are in non-compliance with the Data Lake Strategy and will needs to be retrofitted to become
compliant (i.e. source changed to an trusted source, and stored following the UDL, BDL strategy).

Technical debt will also be identified, organised, managed and secured by folders.

Project Specific Landing Zone(PSLZ)

Datasets which are local and specific to the product and do not need to be shared across products, will reside
within a ‘Project Specific Landing Zone’ folder. Project specific data will be identified, organised, managed and
secured by folders.

Business Data Lakes (BDL)

Every Business function will consist of a Business Data Lake – CD, R&D, Finance, Supply Chain,HR, MDM , and
Marketing

Data will be read from the UDL, transformed and landed into the BDL using Azure Databricks (ADB); this process
will be orchestrated using Azure Data Factory(ADF).

Business functional specific data lakes:

Are used to create shareable KPIs & calculations, summarisations and aggregations of data sets within that
function
Is preserved in its curated form along with any required master data.
BDLs will be shared across products that are building analytics capabilities.
I&A will own (or via proxy) the definitions for curated data within BDL

As each function owns respective BDL. Every data set / KPI ingested into BDL will be governed, managed,
cataloged by respective BDL teams. Data history, retention and granularity will be defined by BDL SME’s based on
the global product requirements.

Product Data Store(PDS)

The business data lakes while providing data that is curated to specic business areas will not be sucient to make
decisions that are cross business like CCBTs or markets where cross business process data is required to deliver
specic KPIs or answer granular questions to drive insights. To facilitate granular and meaningful insights, it is
required to develop analytics products where we will deliver insights via dashboards to assist decision making,
respond to natural queries to augment decision makings, and oer on-demand trusted insights via push data that to
use in automated decision making. For optimal and timely delivery insights, the strategy envisions the concept of
Product specic data stores (PDS). These PDSs will pull (and push where needed) necessary data from all the

Version 9.1 Published on 5th August 2020 Page 8 of 321


I&A Azure Solution Architecture Guidelines

associated BDLs or UDL . We also envision that a multi-geo rollout of a product may have geo-specic PDSs above
and beyond a core PDS. The only constraint on the PDS is that the data in one PDS is not shareable with
another Product.

In future, if we identify any product specic functionality and data that we can share across other products, we will
promote that functionality along with the data to the respective BDLs so that we avoid duplication.
We will preserve the data processed in this layer in its processed form along with any required master data. I&A will
build, support and act as custodian of these PDS and the corresponding products with agility to ensure we deliver
new product capabilities at the pace required for business decision making. I&A will own (or via proxy) the denitions
for the data elements within PDS and build the business glossary and data catalogue as the BDLs take shape

Experimentation Area

Experimentation environment is created for feasibility or quick business use case piloting.Environment is time
bound with validity of 1 -3 Months, with required TDA and Business approvals.

Types of environments :

Analytics Experimentation Environment (Business Reporting)


Analytics experimentation environment should only use the approved Azure PAAS Tools
Data Science Experimentation environment
Data Science experiment environment can be PAAS or IAAS based on the business use case.
IAAS: Approved IAAS component includes only “Data Science VM’s”. No other IAAS
components are approved or provided as part of Experimentation.
PAAS: Approved PAAS components

Version 9.1 Published on 5th August 2020 Page 9 of 321


I&A Azure Solution Architecture Guidelines

Unilever Data Lake Strategy


Overview

The UDL,BDL will be hosted in Azure Data Lake Store (ADLS). UDL is the single shared capability that will act as
the backbone for all analytics works in Unilever.

UDL Data Layers

Unilever data lake strategy is built on a layered approach to ensure we maintain enterprise-scale while giving agility
for businesses to define analytics solutions leveraging the data lake concept. It comprises of Trusted Zone ,
TechDebt and Project Specific Landing Zone(PSLZ).

The UDL is hosted in Azure Data Lake Store (ADLS). As per the current approach, there is only one ADLS
instance, which will be used for all layers of Universal Data Lake. UDL, Techdebt, PSLZ which will be separated,
organised and managed using folders. This design is being reviewed to simply the PSLZ layer to manage it within
respective PDS. Though the PSLZ is logically part of UDL, but for better management this will be hosted as part of
each PDS layer.

UDL (TRUSTED ZONE)

Trusted zone consists of the the Unfiltered/full data from trusted/true source in its most granular format. True
sources like ECC data will be hosted in Trusted zone. Some of the external data like Nielsen or Retailer data will
also be hosted in Trusted zone.

Data in trusted zone should follow the below guidelines

Data should be in full / no filtered


Data in most granular format
Source should be identified as true source / first source.
Clear SLA/Availability of data agreed with source.
Data Cataloging and DMR is created with all columns and details of the data set listed.
Restricted data should be encrypted at all layers.
Data ingestion should be automated.
Data ingestion should be supported only as a delta data and not full

TECHDEBT FOLDER (UNTRUSTED SOURCES)

Datasets which are from untrusted data sources,or not in raw form (i.e. aggregated / Filtered) data are considered
for TechDebt, as they are in non-compliance with the Data Lake Strategy and will need to be retrofitted to become
compliant (i.e. source changed to an trusted source, and stored following the UDL, BDL strategy). Example sources
are Teradata, Merlin etc.

Technical debt will also be identified, organised, managed and secured by folders.

PROJECT SPECIFIC LANDING ZONE (PSLZ)

Datasets which are local and specific to the product and do not need to be shared across products will only reside
within a ‘Project Specific Landing Zone’ folder, which also resides at the same level as the UDL , TechDebt and
BDL and will also reside on the same ADLS instance.

Project specific data will be identified, organised, managed and secured by folders.

Any data planned for PSLZ should take care of the below approvals.

Approval and justification from the Data Owner/Data Expertise to scope it to PSLZ
Reason why the data cannot be added into UDL Trusted/Tech Debt.

Version 9.1 Published on 5th August 2020 Page 10 of 321


I&A Azure Solution Architecture Guidelines

Clear commitment from the project team to retrofit to Trusted / Tech Debt layer whenever the data is
available in these layers
ICAA / Security assessment of data set to classify the data into different categories (restricted, confidential,
internal, personal, personal sensitive)

Data in PSLZ is non-sharable at any case. If the similar data is required for another project then the project needs
to bring the data again into respective PSLZ project folder.

Folder Structure & Organisation

It is imperative that a logical, organised folder structure is in place within the Data Lake, to ensure it remains a Data
Lake and does not become a Data Swamp. The organisation of the UDL follows an industry best practice
framework, providing a logical easy to follow structure for both developers and support staff alike. There are 8 levels
in total, described below:

1. Top Level – Total Company


2. UDL and BDL
3. Internal and External datasources (Note manual files appear in both). There are also folders for Audit and
Logs at this level
4. Data Sources (ie the system data is coming from). This can be global, regional or local. The historical file(s)
will be held here
5. Data Object Type (ie table, extractor, open hub etc.)
6. Data Object name
7. Country / Global
8. Landed, Processed, Historical files
9. Year
10. Month
11. Day

This structure has been determined as it is immune to organisation changes. The only time the structure would
need to be amended is if a source is changed/added
It will be the single point of ingestion for all internal and 3rd party data (structured and unstructured) and will act as
the single shared capability backbone for all analytics work in Unilever.

This layer will hold a complete set of data (ie no rows filtered; no columns removed) so there should be no need to
ever go back to the source system to provide the same/similar data. This layer will hold at least as much data as is
required by the functions and as storage is inexpensive then more could be kept if deemed beneficial.

The standard way Spark handles date partitions is to prefix the date related folders with the date partition it relates
to, for example ‘year=2018’, ‘month=01’, ‘day=01’. As Spark is the primary data processing engine which Unilever
will use, particularly for data ingestion, this prefixing has influenced the naming conventions for the date related
folders in UDL/BDL.

Architecture Standards:

Do not use spaces, underscoperes or special characters in the folder names


Capitalise each word, rather than use space/special character
For Example: InternalFiles, ManualFiles etc.
Year, Month and Day folder names are numerical. For example: o yyyy=2017, yyyy=2018 etc. o mm=01,
mm=02 …mm=12 for January, February etc o dd=01, dd=02 …dd=31 for day of month

A worked example is depicted below:

Version 9.1 Published on 5th August 2020 Page 11 of 321


I&A Azure Solution Architecture Guidelines

Folder names for dates must be exactly as shown above (ie ‘yyyy=’ rather than ‘year=’). This is because
partition name cannot be same as the column name (As it causes issue in spark processing), and we might
have column as ‘Year’ in the data set, which can cause issues.

Data Ingestion Guidelines

Data must be ingested from the true Source


All Data Sourcing should be automated – with no manual steps between the source and data extraction
Manual files will be accepted if the source is an external Data Provider and these should be loaded via the
File Management Tool
KPI’s can be ingested from Strategic Data Sources (e.g. Inventory Snapshot SAP, Finance Connect) if
agreed with the data owner as an official source
The data copied in must be at the lowest level of granularity and be a copy of the source
No filtering of the data being pulled in along with all columns
The volume of data will in most cases determine if the ingestion pulls Delta’s (Typically Transactions) as
opposed to snapshots (Typically MRD) – the other determinations will be if the source system cannot provide
deltas
The primary key and data types as defined in the Data Catalogue must match the source system
Product Data Stores will pull the data from the BDL (where it is aligned and shareable) or from the UDL.

Version 9.1 Published on 5th August 2020 Page 12 of 321


I&A Azure Solution Architecture Guidelines

Business Data Lake (BDL)


Overview

The BDLs will be hosted in ADLS and there will be one per function – CD, CMI, R&D, Finance, Supply Chain and
Marketing (TBC Master Data or others). Data will be read from the UDL, transformed and landed into the BDL using
Spark processing, this process will be orchestrated using Azure Data Factory. Business functional specific data
lakes:

Are used to create shareable KPIs & calculations, summarisations and aggregations of data sets within that
function
Is preserved in its curated form along with any required master data.
BDLs will be shared across products that are building analytics capabilities
I&A Technology (or the IT functional platforms within ETS) will develop, support and act as custodian to
these.
I&A will own (or via proxy) the definitions for curated data within BDL
Extensive data cataloging should be maintained for all KPI's published in BDL.
BDL's should also catalogue the consumer's of data i.e. PDS s consuming the data.

As each function will have its own BDL then each function will determine the retention policy for data within their
BDL. This retention will be supported/challenged by I&A, as proxy for the function to ensure each has enough data
to meet its requirements.
These retention policies will be published so product owners can request a longer retention policy if needed by their
current/future products.

FOLDER STRUCTURE\ORGANISATION OF BDL

The folder structure for the Business Data Lake follows the same principals as for the UDL, but will have less levels

The structure of the BDL will be: 1. Top Level – Total Company 2. UDL and BDL 3. Functional 4. Process 5. Sub-
process

Though this can be tailored to meet the needs of the function if appropriate.

A worked example is depicted below:

Version 9.1 Published on 5th August 2020 Page 13 of 321


I&A Azure Solution Architecture Guidelines

APPROVED SOFTWARE TOOLS AND COMPONENTS

Data Persistence:

Azure Data Lake Storage (ADLS) is where data is physically stored (persisted) within the BDL.

Data Movement and Orchestration:

Azure Data Factory v2 (ADF) will be the primary tool used for the orchestration of data between UDL and BDL.

Data Curation:

Version 9.1 Published on 5th August 2020 Page 14 of 321


I&A Azure Solution Architecture Guidelines

Azure Databricks (ADB) is the recommended processing service to transform and enrich the data from the UDL,
applying the business logic required for writing the data back to the BDL.

BDL ARCHITECTURE

This is sample architecture based on CD BDL.

Version 9.1 Published on 5th August 2020 Page 15 of 321


I&A Azure Solution Architecture Guidelines

Product Data Stores (PDS)


Overview

Business data lakes provides data that is curated to specific business areas. BDL's are not sufficient to make
decisions that are cross business like CCBTs or markets where cross business process data is required to deliver
specific KPIs or answer granular questions to drive insights.

Product specific data stores (PDS) will be developed to facilitate granular and meaningful insights, which deliver
insights via dashboards to assist decision making, respond to natural queries to augment decision making. These
PDSs will pull (and push where needed) necessary data from all the associated BDLs.

1. To avoid duplication of data, PDS ARE NOT allowed to share data from one PDS to another PDS.
2. Data science specific PDS are allowed to write data back to BDL so that it can be used by other PDS

STANDARD E2E ARCHITECTURE FOR PDS

Version 9.1 Published on 5th August 2020 Page 16 of 321


I&A Azure Solution Architecture Guidelines

Experiment and Market Environments


Experiment Environment

Experiment environment is a time bound environment provided for quick piloting of the solution. There are two types
of experiment environments,

Data science Experiment Environment : Experiment environment provided for piloting data science use
cases. There are two types of environments provided here.
IaaS Environment : Azure Data Science VM’s are provided here, which are hosted in a private
network. This is mainly provided to data scientists who are comfortable piloting in a desktop like
environment. IaaS Vm’s cannot be productionized. Once the pilot is complete and the usecase can be
industrialized, this has to be moved into a PaaS environment.
PaaS Environment : PaaS environment consists of only PaaS tools for experimentation. Azure
Databricks and ML service are the two approved data science tools provided in this environment.
Going with PaaS environment is suggested to avoid migration effort during the industrialization (which
is higher for IaaS environment) .
Analytics Experiment Environment : Analytics experiment environment is used to mainly pilot reporting /
BI applications. This environment consists of only approved PaaS tools.

Approved components: Refer Section 2 - Approved Components for the information on Approved PaaS tools.

Market Environment

Market environment is provided to the market use cases where delivery of the product is managed by respective
markets and not by I&A delivery team. Market development environment is a quick environment provided in order to
start the development activities while the required process is being sorted. Every market use case should be
aligned with respective I&A Functional team for any duplication on the demand or alignment for delivery process.

Market environment should follow the same process as that of any product environment. Dev environment will be
provisioned to start with and project can bring their own data to pilot and test. Industrialization process for market
environment is similar to that of any product environment. All the data consumed by Market environment has to flow
through one of the layers of UDL. Project should follow all the architecture standards and complete UDL process in
order to get next set of environments for industrialization.

Approved components: Any approved components listed in Section 2 - Approved Components section can be used
in Market environment. Environment owner needs to make sure the tools are used as per the approved purpose.
Any deviation from the standards needs to be corrected by the project team before industrialization.

Version 9.1 Published on 5th August 2020 Page 17 of 321


I&A Azure Solution Architecture Guidelines

Section 1.2 - Environment Provisioning Strategy


Types of Environments

Most of the applications will operate using following environments.

1. Development Environment: Development environment is for Build activities. Members of the Developer
and Dev-Ops security groups can build and deploy components and code in this environment.
2. QA Environment: QA environment is a locked down environment, mainly used to test the application.
Deployment to QA environment is done using CI/CD pipelines built as part of the Azure Dev-Ops.
Deployment and integration testing can be carried out in QA environment.
3. UAT Environment (optional): UAT environment is for the business users/SMEs to validate the functionality
/requirements. This environment is also used for performance testing. This is also a lockdown environment
and only way to deploy code is through CICD release pipelines.
4. PPD (Pre-Prod) Environment: The primary purpose of your PPD environment is to provide an area where a
hotfixs can be created and tested by members of the DevOps group. A hotfix only applies to an application
that is live in its production environment. Using PPD environment DevOps team can support, fix and test a fix
to the issues from production system.
5. Production Environment: Access to this environment is restricted. All releases must be performed from the
production CICD release pipeline in lieu of an approved RFC. The release team will perform the release and
make sure all pull requests are complete and in order.

Version 9.1 Published on 5th August 2020 Page 18 of 321


I&A Azure Solution Architecture Guidelines

UDL & BDL Environment Provisioning


Universal Data lake (UDL)

UDL is the single shared capability that will act as the backbone for all analytics works in Unilever. All master and
transactional data scattered across those 20+ business systems (True Source) is stored here. Data is stored in its
native form (in its source form) and time series beyond the time duration that sources will maintain

APPROVED COMPONENTS FOR UDL

Data Lake Storage (Gen1/Gen2) for Storage


Azure Data factory(V2) for Ingestion and Scheduling
Used for Integration with different source system for data ingestion in to UDL.
Databricks as a processing engine for Data Quality checks.

Note: Other than the above 3 Azure PAAS components, no other components are approved in UDL

Business data lake(BDL)

BDLs are Unilever business functional specific data lakes that slices and dices data, and perform calculations,
summarization, aggregation (all the verbs that can describe to curate data that is unique to a business function).
They are aligned to Unilever functional boundaries like CD, CMI, Finance, Supply Chain, R&D, HR, etc

APPROVED COMPONENTS FOR BDL

Data Lake Storage (Gen1/Gen2) for Storage


Azure Data factory(V2) for Ingestion and Scheduling & Orchestration
Databricks as a processing engine for Data Quality checks build aggregation, KPI & Business logic for BDLs

Note: Other than the above 3 Azure PAAS components, no other components are approved in BDL

Version 9.1 Published on 5th August 2020 Page 19 of 321


I&A Azure Solution Architecture Guidelines

Product Environment Provisioning


Overview

While the business data lakes providing data that is curated to specific business areas (as defined above), will not
be sufficient to make decisions that are cross business like CCBTs or markets where cross business process data
is required to deliver specific KPIs or answer granular questions to drive insights.

To facilitate granular and meaningful insights, ETS will develop analytics products where we will deliver insights via
dashboards to assist decision making, respond to natural quires to augment decision makings, and offer on-
demand trusted insights via push data that to use in automated decision making

Provisioning Process

Analytics products built on I&A Azure platform should be aligned with respecting Business functional I&A director.
Once the alignment is done, product needs to follow the below process for the provisioning of environment.

Technical Design Authority (TDA) consisting of azure architect’s (solution architect, cloud architect, EC, Security,
UDL, Landscape), will review the architecture presented by solution architect and sign off on the architecture for
provisioning.

All I&A products MUST follow I&A architecture process.

Product Industrialization

Please refer to the following flowchart for product industrialization process.

Version 9.1 Published on 5th August 2020 Page 20 of 321


I&A Azure Solution Architecture Guidelines

Credits for flowchart: [email protected]

Reach out to the concerned architect for your business function if you have any question.

Version 9.1 Published on 5th August 2020 Page 21 of 321


I&A Azure Solution Architecture Guidelines

Experiment Environment
Requirements for requesting an Experimentation Environment

Experiment environment is a time bound environment provided for quick piloting of the solution. Experimentation
environment is created for feasibility or quick business use case piloting. Experiment environment is time bound
with validity of 1-3 months with approvals from work level 3.

TYPES OF EXPERIMENT ENVIRONMENT:

There are two types of experiment environments

Analytics Experimentation Environment (Business Reporting): Analytics experimentation environment is


mainly for BI usecases and consists of only PAAS experiment environments.
Data Science Experimentation environment
Data Science experiment environment can be PAAS or IAAS based on the business use case.
IAAS: Approved IAAS component includes only “Data Science VM’s”. No other IAAS
components are approved or provided as part of Experimentation.
PAAS: Any approved PAAS Tools

PROCESS FOR EXPERIMENT ENVIRONMENT PROVISIONING:

SECURITY SELF SIGN-OFF FROM PROJECT

Self Sign of From Experimentation team, is mandatory to make sure all the rules listed by security is followed. Every
individual working in the experiment environment has to go through security document and sign off on following
the same.

Additional Rules from TDA for Self Sign Off:

PAAS: Experimentation project should not create any more SQL components apart from what is provisioned
by Landscape.
PAAS: Project should not share SQL Service account credentials with anyone other than the one who is
approved by responsible Unilever PM.
PAAS: Firewall should be “On” for all the resources and no IP whitelisting is allowed.
IAAS: Responsibility of the user to make sure no new tools are installed in the IAAS VM’s which are against
Unilever policies.
Self sign off holds good only for data sets with classification as Confidential and below . Restricted or
Sensitive data set usage in experimentation space has to go through ISA (Information security
Assessment)
Restricted and Sensitive data should be encrypted at transit and rest, in all layers of the architecture.

APPROVED PAAS COMPONENTS FOR EXPERIMENT:

Version 9.1 Published on 5th August 2020 Page 22 of 321


I&A Azure Solution Architecture Guidelines

HLE – HIGH LEVEL ESTIMATE

All cost on azure is based on usage (Pay as you Go) i.e. Number of hours of usage of the environment. Find the
cost guide here

DATA ACCESS:

Work with I&A Data expertise team to identify the data in UDL and BDL and request for the approvsl.

UDL Access : Reach out to UDL Dev Ops team to get access on UDL data.
BDL Access : Reach out to respective BDL owner for any data in BDL.

EXPERIMENTATION REQUEST TEMPLATE

Project team or User has to fill the requested in Experimentation Environment Request Template and share it with
TDA team to approve the experimentation environment Provisioning

Version 9.1 Published on 5th August 2020 Page 23 of 321


I&A Azure Solution Architecture Guidelines

Market Environment
Overview

Market development environment is created for markets to build the products quickly with or without including the
I&A standard Delivery teams/SI partners. Market environment should be aligned with respecting Business Function
director. Industrialization process for a market environment is similar to that of any product environment.

Market environments are provided to markets for their development activities, only when there is a clear
industrialization path defined for the use case, else team needs to go with experiment environment.

Market environments needs to follow the same process as that of any development environment. Only
difference is that the delivery is not managed by I&A delivery teams.

Market teams will be shared with below standards/Guidelines, which needs to be taken care.
Environment Provisioning process
Standard architecture
Standard Components and approved usage
Architecture Guidelines for components
Data Access and approvals
Security guidelines to be followed
Environment guidelines
Industrialization process
Markets are fully responsible for taking care of all the guidelines mentioned. In case the standards are not
met industrialization of environment is not allowed.

Components approved and listed will only be allowed for usage.

Access controls are similar to the one followed today for project environments

No Support will be provided if the team is not using the components as per the approved usage.

ENVIRONMENT PROVISIONING PROCESS

Business team needs to work with I&A Team to align on the use case and to identify the solution categorization as
Market Built solution.

New Environment
Dev Environment - TDA approval Mandatory (TDA meeting conducted only on Tuesday’s)
Project BOSCARD and Alignment with I&A delivery director (if not run by I&A Delivery)
Functional Requirement – Use case detaills
Scoping Initiation Mail – For all identified data with Data Architect and Data SME
ICAA process initiation Mail
Self Sign off by Project on Security Standards
TDA approval from Architect: with Architecture Artifacts - SLA < 5 Days (TDA review
happens every Tuesday )
TDA approved architecture will be shared with project with all components and
connections defined. (Jira Entry into TDA project)
Additional Environment – TDA approval Mandatory
Approved ICAA
Approval from Data SME on Scoping, DC and DMR for UDL and BDL
Exceptional approval and retrofit alignment for any PSLZ data.
Gate 1 & Gate 2 Checklist completion.
TDA approval from Architect: with Architecture Artifacts - SLA < 5 Days (Once all the above
is provided) - TDA review happens every Tuesday.

Version 9.1 Published on 5th August 2020 Page 24 of 321


I&A Azure Solution Architecture Guidelines

TDA approval for further environments – With Jira Entry into TDA project
Existing Environment:
Additional Components for existing environment – Offline approval from Architect, if required
Component is from approved List. – SLA < 2 days
If component is from Approved List – Architect can approve offline.
If component is new and not from approved list then architecture has to go through TDA
approval.

GATE PROCESS:

STANDARD ARCHITECTURE

Please refer to https://ulatlprod01.atlassian.net/wiki/x/LID2bg for the approved solution architecture

Here is the Market based architecture template for Market Led project environment.

STANDARD COMPONENTS AND APPROVED USAGE

Please refer to Section 2 - Approved Components for details about approved components for PDS

ARCHITECTURE GUIDELINES FOR COMPONENTS

Please refer to Section 2 - Approved Components and right practices for each component

DATA ACCESS AND APPROVALS

Project requires data from UDL and BDL


UDL Access: Reach out to UDL Dev Ops team. UDL Dev Ops Contact ([email protected]) .
BDL Access: Reach out to respective BDL Owner.
Project requires data which is not in UDL or BDL: If the project requires any data set apart from the data
existing in UDL and BDL, team needs to work on bringing the data into UDL i.e. strategic environment. For

Version 9.1 Published on 5th August 2020 Page 25 of 321


I&A Azure Solution Architecture Guidelines

development purpose project can bring their own data but industrialization is allowed only through UDL.
Integration patterns with source systems for bringing any new data has to be aligned with UDL team and
architecture team

SECURITY GUIDELINES TO BE FOLLOWED

Market project team needs to work with security to derive data classification for the data sets brought directly
into project environment. (ICAA)
Restricted and sensitive data needs to be encrypted at all layers (at rest and transit). PII data needs to go
through Data protection Impact Assessment (DPIA) with security.
All components hosted in Azure must have Firewall ON. Access is allowed only for Unilever ID’s through
MFA enabled AD Groups.
Market Project team is not allowed to share any data from this environment with any other application or
external system.
Market Project team should not provide any access or credentials of the environment to any user who is not
supposed to view the data.
Responsibility of the project to make sure data is accessed by right users. Make sure the team has gone
through security guideline listed in the below link.
In case of any confusion in security guideline project team to contact either security or solution architecture
team

Read the document Security Guidelines and sign off on the security.

ENVIRONMENT GUIDELINES

Only Dev environment will be provided to the project.


Cost on the environment is responsibility of project team. Managing the cost within optimal limit is
responsibility of project team. Project team will be able to pause and resume components as per the usage.
Cost on the environment will be charged back to the cost center provided, periodically.
On completion of the build , Market project team needs to request for higher environment through
Architecture team. Minimum 4 environment are provided as per strategy (Dev, QA , PPD and Prod). PPD
Environment is optional and project team can opt out.
Azure DevOps must be used for CI/CD or deployment to higher environments and as code
repository. Project team needs to build respective scripts to automate the deployment.
Home page of Dev Ops will have the user guide for the environment
No support will be provided on how to build the code or the components.

INDUSTRIALIZATION PROCESS

Industrialization of solution must be done only in production environment. Development environment


provided will be only for build activities.
Industrialization or Deployment to higher environment must be done using Azure DevOps component of
Azure.
Market project team is completely responsible for below.
Automating the deployment using Azure DevOps scripts
Managing deployment through Azure DevOps
Managing service
Managing Dev Ops /Support in industrialized system.
No admin privilege will be provided in higher environments. Hence no manual updates of parameters or
configuration will be allowed. Any release/deployment must be taken care through automated scripts in
Azure Dev Ops.

Version 9.1 Published on 5th August 2020 Page 26 of 321


I&A Azure Solution Architecture Guidelines

Section 1.3 - Naming Convention


AD Groups Naming convention
Applications Naming convention
Platform Naming Conventions
Azure Data Factory
Generic naming conventions for ADF
Linked services and datasets
Pipelines & Activities
Pipeline Naming
Activities Naming

AD Groups Naming convention

AD entity Standard Example

Data Reader Group SEC-ES-{platform}-{env}-{ITSG}-data-reader SEC-ES-fs-d-54321-data-reader

Storage Reader SEC-ES-{platform}-{env}-{ITSG}-stg-reader SEC-ES-fs-d-54321-stg-reader


Group

ADLS Trusted Data SEC-ES-{platform}-{env}-{ITSG}-{entity}-reader SEC-ES-fs-d-54321-entity-reader


Reader

Developer User SEC-ES-{platform}-{env}-{ITSG}-developer SEC-ES-fs-d-54321-developer


Group

Tester User Group SEC-ES-{platform}-{env}-{ITSG}-tester SEC-ES-fs-d-54321-tester

Landscape User SEC-ES-{platform}-{env}-{ITSG}-landscape SEC-ES-fs-d-54321-landscape


Group

Support User Group SEC-ES-{platform}-{env}-{ITSG}-support SEC-ES-fs-d-54321-support

Read Only User SEC-ES-{platform}-{env}-{ITSG}-readonly SEC-ES-fs-d-54321-readonly


Group

Service Principal svc-b-{platform}-{env}-{ITSG}-ecosystem- svc-b-fs-d-54321-ecosystem-


Name (SPN) AADprincipal AADprincipal

Build service principal b{regionCode}-{platform}-{env}-{ITSG}-ecosystem- bieno-fs-d-54321-ecosystem-


deployment-app deployment-app

SPN URI http://b {regionCode}-{platform}-{env}-{ITSG}- bieno-fs-d-54321-


ecosystemdeploymentap ecosystemdeploymentap

Resource Group b{regionCode}-platform-{env}-{ITSG}-app-rg bieno-fs-d-54321-app-rg


(app) (1-64)

Resource Group b{regionCode}-platform-{env}-{ITSG}-data-rg bieno-fs-d-54321-data-rg


(data)

Resource Group b{regionCode}-platform-{env}-{ITSG}-stg-rg bieno-fs-d-54321-stg-rg


(storage)

Version 9.1 Published on 5th August 2020 Page 27 of 321


I&A Azure Solution Architecture Guidelines

Applications Naming convention

Application Identifier standard Identifier example

SQL DB b{regionCode}-{platform}-{env}-{ITSG}-unilevercom- bieno-fs-d-54321-unilevercom-sqldb-


sqldb-{NN} 01

SQL DW b{regionCode}-{platform}-{env}-{ITSG}-unilevercom- bieno-fs-d-54321-unilevercom-sqldw-


sqldw-{NN} 01

HDInsight b{regionCode}-{platform}-{env}-{ITSG}-unilevercom- bieno-fs-d-54321-unilevercom-hdi-01


hdi-{NN}

DocumentDB b{regionCode}-{platform}-{env}-{ITSG}-unilevercom- bieno-fs-d-54321-unilevercom-


documentDB-{NN} documentDB-01

ADF (3-63) b{regionCode}-{platform}-{env}-{ITSG}-adf-{NN} bieno-fs-d-54321-adf-01

Datafactory b{regionCode}-{platform}-{env}-{ITSG}-dfgw-{NN} bieno-fs-d-54321-dfgw-01


Gateway

ADLA b{regionCode}-{platform}-{env}-{ITSG}-appadla-{NN} bieno-fs-d-54321-appadla-01

Search b{regionCode}-{platform}-{env}-{ITSG}-appsearch- bieno-fs-d-54321-appsearch-01


{NN}

Batch b{regionCode}-{platform}-{env}-{ITSG}-appban-{NN} bieno-fs-d-54321-appban-01


Account

Batch Pool b{regionCode}-{platform}-{env}-{ITSG}-appbap-{NN} bieno-fs-d-54321-appbap-01

Event Hub b{regionCode}-{platform}-{env}-{ITSG}-eventhub-{NN} bieno-fs-d-54321-eventhub-01


(50)

Function b{regionCode}-{platform}-{env}-{ITSG}-function-{NN} bieno-fs-d-54321-function-01

ADLS b{regionCode}-{platform}-{env}-{ITSG}-stgadls bienod54321stgadls

KeyVault b{regionCode}-{platform}-{env}-{ITSG}-appunileverk- bieno-fs-d-54321-appunileverk-01


{NN}

ML b{regionCode}-{platform}-{env}-{ITSG}-unilevercom- bieno-fs-d-54321-unilevercom-ml-01
ml-{NN}

WebApp b{regionCode}-{platform}-{env}-{ITSG}-unilevercom- bieno-fs-d-54321-unilevercom-


webapp-{NN} webapp-01

Storage b{regionCode}<type><class><nnn>unilevercomstg bienoaa123unilevercomstg


Account (3-
24)

Eco System b{regionCode}-{platform}-{env}-{ITSG}-vstsp-{NN} bieno-fs-d-54321-vstsp-01


VSTS Team
Project

Eco App https://b {regionCode}-{platform}-{env}-{ITSG}- bnlwe-fs-p-54321-unilevercom-vsts.


VSTS Team unilevercom-vsts.visualstudio.com/{regionCode}-{env}- visualstudio.com/{regionCode}-p-
Project {ITSG}-vstsp/ {ITSG}-vstsp/

Version 9.1 Published on 5th August 2020 Page 28 of 321


I&A Azure Solution Architecture Guidelines

LogicApp b{regionCode}-{platform}-{env}-{ITSG}-unilevercom- bieno-fs-d-54321-unilevercom-


logicapp-{NN} logicapp-01

Maximum length of identifiers =128 characters except where shown

Platform Naming Conventions

Short name Platform

FS Cloud

CE Consumer Engagement

CS Consumer Services and Operations

CP Commerce / Consumer Performance

DM Leveredge Distributor Management

FR Frontier

CS CD-SAP and O2

ES eScience

PL PLM

GS Global Supply Chain

GF Global Finance

SM Solution Manager and GRC

SR Core ERP - Sirius

CO Core ERP - Cordillera

UK Core ERP - U2K2

FU Core ERP - Fusion

WP Workplace

HR HR

CF Corporate Functions

DA Core Data Ecosystem

MD Master Data

OV OneView Information Landscape

CA Central Analytics

CL Collaboration

ME Messaging, ID and Access Management

Version 9.1 Published on 5th August 2020 Page 29 of 321


I&A Azure Solution Architecture Guidelines

ST Support Tools / ICM

NT Integration

SE Security

DV Devices

PR Infrastructure (On Premise) & IT for IT

NE Networks

SA SAP

Azure Data Factory


Generic naming conventions for ADF

There are a few standard naming conventions which apply to all elements in Azure Data factory.

Unique with in a data factory. Names are case-insensitive.


Object names must start with a letter or a number, and can contain only letters, numbers, and the dash (-)
character.
Maximum number of characters for a name: 260.
Following characters are not allowed:

. + ? / < > * % & : \

Linked services and datasets

Linked service connects data from a source to a destination (sink), so the names of linked service would be similar
to datasets.

Linked Service Naming:

LS_[<Type of Service Abbreviation>]_[<Service Name>]

Dataset Naming:

DS_[<Type of Data Set>]_[<Dataset Name>]

The below table has a column for Linked services and Datasets.

Type Linked Service Prefix as per Linked Service Dataset Example


Store Prefix Prefix

Azure Azure Blob storage ABLB_ LS_ABLB_ DS_ABLB_ LS_ABLB_


{Dataset}

Version 9.1 Published on 5th August 2020 Page 30 of 321


I&A Azure Solution Architecture Guidelines

Azure Data Lake Store ADLS_ LS_ADLS_ DS_ADLS_ LS_ADLS_


{Dataset}

Azure SQL Database ASQL_ LS_ASQL_ DS_ASQL_ LS_ASQL_


{Dataset}

Azure SQL Data ASDW_ LS_ASDW_ DS_ASDW_ LS_ASDW_


Warehouse {Dataset}

Azure Table storage ATBL_ LS_ATBL_ DS_ATBL_ LS_ATBL_


{Dataset}

Azure DocumentDB ADOC_ LS_ADOC_ DS_ADOC_ LS_ADOC_


{Dataset}

Azure Search Index ASER_ LS_ASER_ DS_ASER_ LS_ASER_


{Dataset}

Key Vault KV_ LS_KV_ DS_KV_ LS_KV_{Dataset}

Databas SQL Server* MSQL_ LS_SQL_ DS_SQL_ LS_SQL_


es {Dataset}

Oracle* ORAC_ LS_ORAC_ DS_ORAC_ LS_ORAC_


{Dataset}

MySQL* MYSQ_ LS_MYSQ_ DS_MYSQ_ LS_MYSQ_


{Dataset}

DB2* DB2_ LS_DB2_ DS_DB2_ LS_DB2_


{Dataset}

Teradata* TDAT_ LS_TDAT_ DS_TDAT_ LS_TDAT_


{Dataset}

PostgreSQL* POST_ LS_POST_ DS_POST_ LS_POST_


{Dataset}

Sybase* SYBA_ LS_SYBA_ DS_SYBA_ LS_SYBA_


{Dataset}

Cassandra* CASS_ LS_CASS_ DS_CASS_ LS_CASS_


{Dataset}

MongoDB* MONG_ LS_MONG_ DS_MONG_ LS_MONG_


{Dataset}

Amazon Redshift ARED_ LS_ARED_ DS_ARED_ LS_ARED_


{Dataset}

File File System* FILE_ LS_FILE_ DS_FILE_ LS_FILE_


{Dataset}

HDFS* HDFS_ LS_HDFS_ DS_HDFS_ LS_HDFS_


{Dataset}

Amazon S3 AMS3_ LS_AMS3_ DS_AMS3_ LS_AMS3_


{Dataset}

Version 9.1 Published on 5th August 2020 Page 31 of 321


I&A Azure Solution Architecture Guidelines

FTP FTP_ LS_FTP_ DS_FTP_ LS_FTP_


{Dataset}

Others Salesforce SAFC_ LS_SAFC_ DS_SAFC_ LS_SAFC_


{Dataset}

Generic ODBC* ODBC_ LS_ODBC_ DS_ODBC_ LS_ODBC_


{Dataset}

Generic OData ODAT_ LS_ODAT_ DS_ODAT_ LS_ODAT_


{Dataset}

Web Table (table from WEBT_ LS_WEBT_ DS_WEBT_ LS_WEBT_


HTML) {Dataset}

GE Historian* GEHI_ LS_GEHI_ DS_GEHI_ LS_GEHI_


{Dataset}

Pipelines & Activities

PIPELINE NAMING

PL_[<Capability>]_[<TypeofLoad>]_[<MeaningName>]

Note:

Pipelines could have more than one Activities.


Capability/Business Area acronym for which the ADF pipeline is defined for.
Type of Load – could be D (Daily), W(Weekly), M (Monthly), Y(Yearly)

ACTIVITIES NAMING

AT_[<COMPUTE ENIRONMENT>]_ [SOURCETYPE]_[SOURCE_DATASET/TABLE]_TO_


[TARGETTYPE] [TARGET_DATASET/TABLE]_[OPERATION – MeaningName]

Note:

SOURCETYPE & TARGETTYPE – can be FF/LS/DB


In case there is no movement of data but only some operation within the compute environment then Source
is optional and “Operation: naming should be some meaningful name.

Activity Type Activity Prefix Compute environment Example

Data movement Activity (Copy) AT_COPY_ CPY AT_CPY_

Data transformation Activity AT_ SP - Stored Procedure AT_SP_

DNET - Script AT_DNET_

HIVE - Hive AT_HIVE_

PIG - Pig AT_PIG_

Version 9.1 Published on 5th August 2020 Page 32 of 321


I&A Azure Solution Architecture Guidelines

MAPR - MapReduce AT_MAPR_

HADP - Hadoop Stream AT_HADP_

AML - Azure Machine Learning AT_AML_

ADLA-Azure Dake Analytics AT_ADLA_

SPK-Spark on HDinsight AT_SPK_

For each loop ADF AT_FEACH_

Look up ADF AT_LKP_

Execute Pipeline ADF AT_EXEP_

Filter ADF AT_FLTR_

Get Meta Data ADF AT_META_

If condition ADF AT_IF_

Until Activity ADF AT_UNTIL_

Wait Activity ADF AT_WAIT_

Web Activity ADF AT_WEB_

Version 9.1 Published on 5th August 2020 Page 33 of 321


I&A Azure Solution Architecture Guidelines

Section 1.4 - File Management Tool - FMT


Overview

FMT is a file management tool (web app) that is used to upload manual files to a blob storage in Azure.

Pre-requisites

The users should have AD group created for their project in the format Sec-Azo-FMTTool-<ProjectName>.
Please note that each project needs to have a unique AD group
The team members part of the AD group (mentioned in point no 1) will only have access to upload the files.
The user should provide the DMR/Schema in the attached format to the Unilever Data Architect Team for
approval
The user needs to share the attached DMR & Data Catalogue (in the approved format only) with the
Development/ DevOps team for the FMT tool Admin to upload it from the backend.
Data files which will be validated against the schema / DMR.

INSTRUCTIONS FOR CREATING A DATA CATALOGUE

1. Please select the "Data Catalogue" tab to input a data catalogue entry. Please note that all the fields are
mandatory to be filled.
2. Please make sure there are no empty rows above the header row (data should start from A1)
3. Please note that for every project there is a unique AD group. Kindly input the same. If you donot have the
AD group information for your project kindly make sure to create an AD group.
4. Please make sure the Project+DataSet+DMR-Entity combination is unique
5. For all naming conventions (except AD group), kindly do not use hypen "-" consider using underscore
instead "_"

INSTRUCTIONS FOR CREATING A DMR/ SCHEMA

1. For each Data Catalogue entry, please make sure there is a corresponding DMR/ schema entry
2. Please select the "DMR Or Schema" tab to make the DMR entry. Please note that all the fields are
mandatory to be filled.
3. Please make sure that there are no empty rows above the header row (data should start from A1)
4. The supported datatypes are as shown in the table below. Kindly choose from these.

S No Datatypes Format

1 numeric requires scale and


precision

2 decimal requires scale and


precision

3 char normal

4 varchar normal

5 nvarchar normal

6 date normal

7 time hh:mm:ss

Key Highlights:

Version 9.1 Published on 5th August 2020 Page 34 of 321


I&A Azure Solution Architecture Guidelines

The (user) project team needs to create an AD group specific (unique)to that project
Kindly follow this format for the naming convention "Sec-Azo-FMTTool-<ProjectName>"
Please make sure that all the members who need access to the tool should be added to the created AD
group.

Data upload using FMT

The upload page allows the user to upload a file (one at a time) for validating against a pre-defined schema

STEPS TO BE FOLLOWED

1. Select the ‘Project’ for which the file has to be uploaded


2. Select the corresponding ‘Dataset’ for the selected project and similarly ‘File Schema Name’ respective to
the selected dataset
3. The ‘Schema Information’ for the selected Project, dataset and File Schema Name gets populated
automatically. These are read- only fields and should be verified to check if the information displayed is
related to the selected project.
4. If the selections made need to be changed, click the reset button to clear the previous selections and make
new selections again.
5. Click ‘Select File’ to select the file from the user system which needs to be uploaded.
6. Click upload File and wait for the file to upload.
7. Auto pop- up message ensures that the file has been uploaded successfully to the landed folder. The user
can also view the status of the file upload into Blob Storage in the Upload Status.
8. The user will receive an email notification on his/ her Unilever email ID from the FMT Tool regarding the
upload status (success/ failure) into Blob Storage after a short while.

If the user is unable to see the respective project/ dataset/ schema information, please ensure that the
schema (DMR) and data catalogue has been approved by the Unilever data Architecture team (TDA). Post
the approval, please send the DMR and Data Catalogue to the FMT admin team to load it from the backend

Key Highlights:

The tool accepts only flat files (.txt), excel files with single worksheet (xls and xlsx) and parameter separated
files (comma, tab and pipe)

Version 9.1 Published on 5th August 2020 Page 35 of 321


I&A Azure Solution Architecture Guidelines

The file size should not exceed 1GB


If the user is unable to see the respective project/ dataset/ schema information, please refer the prerequisites
The user has to wait while the tool prepares the file to upload. The time taken will depend upon the size of
the file. In case, the user logs-out or closes the browser window during the file preparation process, it will
terminate the upload session and the user will have to restart from the first step of selecting the project from
the Project dropdown on the next login for uploading the file
The time taken to upload a file also depends on the speed/congestion/bandwidth of the user network
While uploading, the user can navigate to another task, but cannot logout

Upload Status Page

Once, the file has been uploaded successfully, the user can check the status of the uploaded File here.

TYPES OF STATUS

Waiting for Validation – The tool is busy validating other files & the user file is in queue
Validating – The file has been picked by the tool and is being processed against the selected schema to
validate
Processed – The file has passed all the rules as mentioned in the schema selected and is processed
successfully
Failed- The file has failed for one or more rule(s) defined in the selected schema
Unexpected system error: Incase there is an unexpected system error wherein the tool is not working
properly or the uploaded file was not found in Azure Blob container for validation or if a schema against
which the validation has to be performed is not present

STEPS TO BE FOLLOWED

1. By default, the user can view the status of the latest uploaded file on this page (last uploaded shows first in
queue). However, the user can also sort and filter the records as per requirement.
2. The user can sort and filter the records by selecting any of the 7 dropdowns to apply the filter based on
which the upload status of the uploaded file will be shown.
3. Click Apply Filter to see the status of the files uploaded based on the selections made in the top section.
4. View Details mentioned in the status table can be clicked to understand the status better

UPLOAD STATUS PAGE- FILTERS AND STATUS

Search Filter Fields

Filter Description
Fields

Project Lists all the projects that the user is allowed to access

File The File Schema Name related to the project which had been selected during the time of upload.
Schema
Name

Source The source of the data. For example: All region, Cordillera etc.
Details

Type of Master, transaction or reference data


Data

Dataset The dataset selected during the time of upload

Version 9.1 Published on 5th August 2020 Page 36 of 321


I&A Azure Solution Architecture Guidelines

Extracti The frequency at which the data is extracted from the Source system. For example: weekly,
on monthly, annually etc
Frequen
cy

Upload Lists the status of the file upload process, from which the file has to be selected. For example:
Status Waiting for Validation, Validating etc. The list of the Upload Status is provided in TypesofStatus

These read-only fields are auto-populated once the selections are made in the top section of the Upload Status
Page. These help the user to get a better understanding of the status of the uploaded file.

Status Table

Status Description
Headings

Uploade Shows the server date and time when the file is uploaded.
d On

File The name of the uploaded file


Name

Status The user can view the status of the file if it is pending for validation or has been validated and
uploaded successfully or has failed validation as shown in TypesofStatus

File On clicking ‘View Details’ the description of the upload status


Upload
Details

Project The user can view the project for which the file has been uploaded

Dataset The dataset associated with the uploaded file.

Source The source system from which the data in the uploaded file is taken. e.g. Cordillera
Details

Type Of The data type can be Master, transaction or reference data


Data

Extractio The frequency at which the data is extracted from the Source system. For example: weekly,
n monthly, annually etc.
Frequen
cy

Uploade The email id of the user who has uploaded the file
d By

UPLOAD STATUS PAGE DESCRIPTION

Version 9.1 Published on 5th August 2020 Page 37 of 321


I&A Azure Solution Architecture Guidelines

UPLOAD STATUS PAGE- STATUS DESCRIPTION

View Details in the upload status page can be clicked to understand the details of the uploaded file.

Types of Status Descriptions

Waiting for Validation – The tool is busy validating other files & the user file is in queue waiting to be picked
up for validation
Validating – The file has been picked by the tool and is being processed against the selected schema to
validate
Processed – The file has passed all the rules as mentioned in the selected schema and is processed
successfully
Failed- The file has failed validation and errors will be shown with respect to row and column
Unexpected system error: Incase the file could not be validated due to an unexpected system error wherein
the tool is not working properly or the uploaded file was not found in Azure Blob container for validation or if a
schema against which the validation has to be performed is not present

Steps to be followed:

1. Click on View Details under File Upload Details.


2. The status will be shown as mentioned in Types of Status Descriptions.
3. If the file has failed validation, the errors will be shown with respect to rows and columns. The user can view
these errors by clicking on the subsequent error pages.

UPLOAD STATUS- VIEW DETAILS DESCRIPTION

Version 9.1 Published on 5th August 2020 Page 38 of 321


I&A Azure Solution Architecture Guidelines

Upload Status Page - Highlights

By default, the user can view the status for 10 files uploaded by him. However, the user can choose the
number of records that can be viewed at a time and change it using a dropdown to 25, 50 and 100 as well.
The time of upload shows the time of the server during which the file has been uploaded.
The user can view the status of the files which are uploaded only by him, and not by other users.
Only 30 days of history of the files uploaded by users is maintained

Audit Log

The history of the files uploaded can be viewed in this section. User can view all the operations that the uploaded
file is going through before being ingested into UDL.

Operations performed on the File:

1. File Upload : The file is uploaded into the tool and waits for validation
2. File Validation: The file is validated against the File Schema selected during the time of upload and checked
for errors.
3. File Processed/ Failed: If errors are found in the file, the file is moved to the Failed Zone and, if the file has
been successfully validated, it is moved to the Processed Zone and thereafter, into UDL.

Steps to be followed:

1. By default, the user can view the logs of the latest uploaded files on this page (last uploaded shows first in
queue). However, the user can also sort and filter the records as per requirement.
2. To sort and filter, the user needs to select any of the 3 dropdowns based on which the logs of the uploaded
file will be shown
3. Click Apply Filter to see the logs against the status of the files uploaded based on the selections made in the
top section

AUDIT LOG- FILTERS AND LOG

Filter Headings:

Version 9.1 Published on 5th August 2020 Page 39 of 321


I&A Azure Solution Architecture Guidelines

Searc Description
h
Filter
Fields

Uploa Lists the status of the file upload process, from which the file has to be selected. For example: Waiting
d for Validation, Validating etc. The list of the Upload Status is provided in TypesofStatus
Status

Start The date from which the log has to be viewed


Date

End The date till which the log has to be viewed.


Date

The audit log Details:

Log Description
Headi
ngs

Time Shows the server date and time when the file is taken for an operation. View FileOperations.

Uploa The email id of the user who has uploaded the file
ded
By

File The name of the uploaded file


Name

File The size of the uploaded file


Size

Status The user can view the status of the file. View the types of status in TypesofStatus

Details It explains the status details of the uploaded file

AUDIT LOG DESCRIPTION

Version 9.1 Published on 5th August 2020 Page 40 of 321


I&A Azure Solution Architecture Guidelines

Audit Log - Highlights:

The user can view 10 files at a time, on a single page. The view can also be changed using a dropdown to
25, 50 and 100 as well.
The time of upload shows the time of the server during which the file is uploaded.
The user can view the status of the files which are uploaded by him
Data will be shown for past 30 days
The user can however search as per the required dates and retrieve the history of uploaded files

Version 9.1 Published on 5th August 2020 Page 41 of 321


I&A Azure Solution Architecture Guidelines

Section 1.5 - HA (High Availability) & DR (Disaster Recovery)


Overview

High Availability: High availability strategies are intended for handling temporary failure conditions to allow the
system to continue functioning.

Disaster recovery is the process of restoring application functionality in the wake of a catastrophic loss.

Organization’s tolerance for reduced functionality during a disaster is a business decision that varies from one
application to the next. It might be acceptable for some applications to be unavailable or to be partially available
with reduced functionality or delayed processing for a period of time. For other applications, any reduced
functionality is unacceptable.

DR rating for each application is calculated based on the business criticality by business or technology owners.

SC/DR – Unilever General Guidelines

UPTIME & SERVICE CRITICALITY

% Service Unplanned Best Possible Fix Time Service Class (SC)


Uptime Availability Downtime (Hours) Equivalent
per Year

99.9 Critical 8hrs 46mins 4 SC1

99.4 High 52hrs 43mins 4 SC2

98.5 Medium 131hrs 24mins 8 SC3

97 Low 262hrs 48mins 24 SC4

DR & RTO

DR Class RTO

1 12 hours

2 > 12 hours & 24 hours

3 > 24 hours & 72 Hours

4 > 72 hours & 14 Days

5 > 14 days & 2 Months

UDL to be internally treated as SC1/DR1 service though service catalog calls this out as SC3/DR3 service

All other components to be tagged as SC3/DR3 unless there is a separate tagging undertaken by projects as
per business requirement.

Microsoft SLA for Azure Components

Below tables summarizes the SLA for each component that Microsoft already offers for business continuity,

Component SLA

Version 9.1 Published on 5th August 2020 Page 42 of 321


I&A Azure Solution Architecture Guidelines

Azure Active Directory 99.9%

Azure Analysis Services 99.9%

API Management Service – Standard 99.9%

API Management Service – Premium 99.95%

Azure Devops 99.9%

Azure Databricks 99.95%

ADLS 99.9%

HDInsight 99.9%

Key Vault 99.9%

Azure Data Factory 99.9%

Azure SQLDW 99.9%

Log Analytics 99.9%

Load Balancer 99.99%

LogicApps 99.9%

ML Studio - Request Response Service 99.95%

PBI embedded 99.9%

Service Bus 99.9%

Azure SQL DB (ZRS) 99.995%

Azure SQLDB (GP, Stnd, Basic) 99.99%

Storage Accounts - RA-GRS - Read 99.99%

Write 99.9%

Azure Stream Analytics 99.9%

Azure Data Share 99.9%

Single Instance Virtual Machine 99.9%

Microsoft SLA and Downtime Details

Microsoft SLA translates to downtimes per week, month and year as depicted below. In most cases 99% SLA
suffices Unilever’s SC3/DR3 service requirement for PaaS components.

SLA Downtime per week Downtime per month Downtime per year

99% 1.68 hours 7.2 hours 3.65 days

99.90% 10.1 minutes 43.2 minutes 8.76 hours

Version 9.1 Published on 5th August 2020 Page 43 of 321


I&A Azure Solution Architecture Guidelines

99.95% 5 minutes 21.6 minutes 4.38 hours

99.99% 1.01 minutes 4.32 minutes 52.56 minutes

100.00% 6 seconds 25.9 seconds 5.26 minutes

Now lets look at each of the I&A platforms business continuity design in detail.

Business Continuity High Level Plan for Unilever Data Lake

UDL & BDL ( HA/DR)

BUSINESS CONTINUITY DETAILED LEVEL PLAN FOR UDL & BDL (ADLS GEN1)

BUSINESS CONTINUITY DETAILED LEVEL PLAN FOR UDL & BDL (ADLS GEN2)

Version 9.1 Published on 5th August 2020 Page 44 of 321


I&A Azure Solution Architecture Guidelines

Product (HA/DR)

OPTION 1: MICROSOFT SLA

Go with Microsoft SLA with the assumption that the service will be up and running as per the SLA summarized
earlier (Not suggested for SC2 & SC1 applications)

Advantages

No additional environment is required and hence no additional operational cost.

Risk

Dependency on Microsoft to bring up component as per SLA. If Microsoft cannot bring up the component
within the agreed SLA, business will be impacted.
No guarantee on the service restoration.
Data lost may or may not be recovered. No guarantee from MS team on data restoration if service goes
down.

OPTION 2: SETUP DR ENVIRONMENT – (ONLY DATA BACKUP)- (ACTIVE - PASSIVE)

Set up a DR environment in the paired region that acts a passive environment with only data backed up regularly
and during Outage projects can connect to the DR environment to ensure business continuity.

Setup a DR environment in paired region to avoid any region failures or outage.

Version 9.1 Published on 5th August 2020 Page 45 of 321


I&A Azure Solution Architecture Guidelines

Advantages

Parallel environment with data backup is kept as standby.


Minimum downtime in case of disaster as RTO < day.
RPO can be 100% with delta data extracted for source for missing days.
Well defined step for recovery.

Risk

Minimum additional cost to keep the backup of the data in secondary region.

OPTION 3: SETUP DR ENVIRONMENT AS ACTIVE STAND BY – (ACTIVE - ACTIVE)

Advantages

Parallel environment with data backup is kept active.


Minimum downtime incase of disaster as RTO < day.
RPO can be 100% with delta data extracted for source for missing days.
Well defined step for recovery.

Risk

Additional cost to keep the secondary region of the data in second region.

DR ENVIRONMENT SETUP ACTIVITIES (COMPARISON)

Co O Option 2 Option 3
m p
po ti (Active-Passive) (Active-Active)
ne o
nts n
1

(
M
S
S
L
A)

Version 9.1 Published on 5th August 2020 Page 46 of 321


I&A Azure Solution Architecture Guidelines

A N Daily Incremental Backup Hourly Incremental Backup


DLS o
ne

S N Geo- Replication. Spin up new User defined backup every 8 hours. Restore from backup to
QL o instance and restore from backup DR instance and pause.
DW ne during outage

AAS N Daily Backup to Azure Storage. Every 8 hours Backup to Azure Storage. Restore from the
o Spin up new instance and restore backup to DR AAS instance daily and Pause.
ne from backup during outage.

Az N RA-GRS provides read access from RA-GRS for read availability. Copy data from storage account
ur o DR instance. in the secondary region to another storage account. Point
e ne applications to that storage account for both read and write
St availability.
or
age

A N Every release into primary to also Every release into primary to also have a release to redundant
DF o have a release to redundant instance
, ne instance
A
DB

Ke N Every release if changes are added Every release if changes are added to key vault to be backed
yV o to key vault to be backed up to Blob up to Blob and restored to a new instance of Key Vault.
ault ne and restored from there during
Outage

COST COMPARISON OF 3 OPTIONS

Option 1 Option 2 Option 3

(MS SLA) (Active-Passive) (Active-Active)

ADLS Gen 1 No Cost Storage - €33.68 Storage - €33.68

Write Txn 300 – Write Txn- 900-

€12.65 €37.95

SQLDW Basic RA-GRS Cost - €10 Basic RA-GRS Cost - €10 Compute - €70.33

Storage - €113.99

AAS No Cost Blob Storage- € 10 Blob Storage - €10

AAS Cost - €205.43

Azure Storage Basic RA-GRS Cost - €10 Basic RA-GRS Cost ~ €10 Basic RA-GRS Cost ~ €10

New Blob ~ €20

KeyVault No Cost Blob Storage- € 10 Blob Storage - €10

KEY ASSUMPTIONS

Region - North Europe,

Version 9.1 Published on 5th August 2020 Page 47 of 321


I&A Azure Solution Architecture Guidelines

ADLS - 1TB ; SQLDW -100DWU- 60 hours, 1TB storage ; AAS -S2 - 60 hours – 50GB;

Blob Storage- LRS- GPV2- Std- 50GB; Keyvault – Storage Size – 50GB

Version 9.1 Published on 5th August 2020 Page 48 of 321


I&A Azure Solution Architecture Guidelines

Section 1.6 - Data Management


1 Data Quality
2 Data Validation
3 Data Profiling Rules
4 Archival and Purging
4.1 Purging
5 MCS Framework Reference

Data Quality

Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data
quality, but data is generally considered high quality if it is "fit for [its] intended uses in operations, decision making
and planning"

Data Profiling: To examine the data health initially for modelling & integration process design purposes. (Not an
assessment task to improve data quality, more to understand data)

Data Validation

Data Validation: To validate incoming data against set of pre-defined data quality validation rules as per data
requirement needs. Incase the validation fails, reject & log the error record, this record would not be processed any
further. (i.e. Does not correct data)

Data Conversion/Transformation: To convert incoming data based on the rule/condition defined based on the
requirement. (i.e. Corrects data)

Data Quality – Validation Rules

Below are mentioned data validation rules

DQ Rule DQ Rule Name Details


Type

Validation Null Check Not null

Lookup Check (Reference or Master data) Referential validation of input data


against Master Data.

Version 9.1 Published on 5th August 2020 Page 49 of 321


I&A Azure Solution Architecture Guidelines

Numeric Check

Dupe Check (Single, Multiple column..) Duplicate checks

Value (greater than, equal to, less than, not equal to,
contains, begins with, ends with)

Date/Time Check incorrect data formats such as date

Length (greater than, equal to, less than, not equal to) lengths of fields incorrect

Email Validation (contains @)

Phone Number Validation (is numeric)

Custom Formula

Conversion Change Case (Upper, Lower, Proper/Capitalize)

Change String to Date

Change String to Number

Change based on Lookup

Trimming data (LTRIM/ RTRIM) trim, ltrim, rtrim

Remove Leading zeroes strip leading zeros

Pad with zeroes pad with zeros

Custom Formula

Additional RI checks / Regular expressions


Checks
Cross column checks e.g. Max
(columnA, columnB) < column C

Closed list checks, Computed fields,

High level Implementation Process:

Refer high level flow diagrams for the UDL data validation implementation Strategy :

Version 9.1 Published on 5th August 2020 Page 50 of 321


I&A Azure Solution Architecture Guidelines

Refer high level flow diagrams for the BDL data validation implementation Strategy :

These are the reference validation rules in ADLS.

Version 9.1 Published on 5th August 2020 Page 51 of 321


I&A Azure Solution Architecture Guidelines

Recommendations:

Data correction should always happen at the source side only and not in the Data Lake, apart from standardization
of date/time, decimal value etc.

To maintain consistency the NULLs /Blanks/Spaces should be converted to same value for example
“UNKNOWN” based on the requirement.

Action and correction strategy to be finalized by the Project/Function team

In BDL only logical check like RI check and lookup check should happen.

Mechanism to correct the unknowns to be made in place.

Business specific rules should be agreed as part of the functional specification document.
Standard ways of failing the Pipeline and communicating to the User/AM team.

Data Profiling Rules

Data Profiling is to examine the data health initially for modeling & integration process design purposes. (Not an
assessment task to improve data quality, more to understand data)

For each data column/field generate profile of the dataset or data domain, primarily:

Data Type (text, date, numeric)


Total number of values
Total Unique values
Null values and %
Min/Max Length
Min/Max Value
Distinct Values etc

Archival and Purging

Version 9.1 Published on 5th August 2020 Page 52 of 321


I&A Azure Solution Architecture Guidelines

Archival is the process of maintaining the historical data based on some policies like the retention of the data i.e.
the time till which the data has to be maintained in the archive folder. Once the data crosses the retention period, it
can be then purged. The data copied from the source systems to the Raw folder and then processed further is
archived once its processed.

Purging

The files archived in the archival folder should adhere to the defined retention policy. The files crossing the defined
retention period should be deleted permanently. The Azure data factory pipeline that would purge the data from
different folders would read the retention period from the pipeline parameter that need to be assigned while
triggering it.

The pipeline can be scheduled on any frequency as need basis. Ideally the purge process pipeline should be
executed daily. On each run, the pipeline would do the following activities.

MCS Framework Reference

Refer the document “Unilever_DataIngestion-LLD_v1.0.docx” to understand on the MCS framework ,Note that this
document is owned by UDL team.

Version 9.1 Published on 5th August 2020 Page 53 of 321


I&A Azure Solution Architecture Guidelines

Deletion Logic

Ingestion Type

Full Load: Erase complete contents of one or more tables and reload all data everytime.
There is no need to have deleted logic implemented for Full Load.
Incremental Load: Apply only ongoing changes to one or more tables based on a predefined requirements.
If source provides deletion flag(s), the same should be used by the downstream systems.
If source provides primary key only then, UDL will implement deletion logic using UDL_Flag (I- Insert,
U-Update, D-Delete ONLY).
Primary keys used in UDL/BDL/PDS should be same as source system.

Deletion logic implementation

Hard v/s Soft delete should be driven based on business requirements.


For shareable data, it is recommended to use soft delete as this provides downstream consumers
ability to identify deleted data.
Based on business requirements, PDS can decide whether to hard / soft delete record or not.
Soft delete will result in additional storage volume as UDL/BDL will keep copy of deleted data.
It is recommended to have uniformity in the implementation of deletion logic. Using both hard and soft
delete might lead to confusion.
It is recommended to use ADB delta tables across all layers.
BDL should read from UDL delta table based on the UDL_Flag and processed date/updated date from
the UDL.
PDS should read from BDL delta table based on the BDL_Flag and processed date/updated date from
the BDL..
Archival of data should be performed to manage data volume.

Scheduling timing

UDL, BDL, PDS data ingestion schedules should be aligned as per business requirements.
BDL and PDS MUST take into consideration dependency on UDL data ingestion jobs. Please refer Section
3.8 - Job Management

Notes

1. Micro batch deletion logic is not in scope of batch deletion logic explained above.
a. UDL WILL NOT provide deletion logic for micro batch landing folder. To implement deletion for for
microbatch there are 2 options:
i. UDL will provide deletion logic in processed parquet in UDL. BDL can use UDL_Flag from
processed parquet on 24 hour frequency (rows highlighted in below table - Table1) to delete
records from microbatch objects in BDL.
ii. The deletion logic need to be handled in the end applications , as it doesnt make sense to scan
through complete history to mark a record as deleted, every 15 min,
2. UDL/BDL to publish deletion flag & deletion logic for each source for consumption by downstream
applications.

Table1

Object 1 MicroBatch 15 minute No deletion logic

Landed per day No deletion logic

Processed Parquet per day Deletion logic

Version 9.1 Published on 5th August 2020 Page 54 of 321


I&A Azure Solution Architecture Guidelines

Object 2 MicroBatch 15 minute No deletion logic

Landed per day No deletion logic

Processed Parquet per day Deletion logic

Version 9.1 Published on 5th August 2020 Page 55 of 321


I&A Azure Solution Architecture Guidelines

Section 2 - Approved Components


Following is the list of Approved PAAS Components and their uses.

Note: Anything not explicitly approved below i.e. components mentioned as “Approved (case by case basis)”
requires I&A Architect review and approval

Process Compone Approval Approved for Link


Type nts Status

Storage Data Lake Approved PAAS tool used for Data Storage Section 2.4 - Azure
Store Data Lake Storage
(Gen1 & (ADLS)
Gen2)

Blob Approved Used for external data ingestion and for Section 2.3 - Azure
Storage component Logs BLOB Storage

Database SQL Data Approved MPP database for data > 50 GB Section 2.6 - Azure
Ware Data Warehouse
House /Azure Synapse
(Gen2) Analytics

SQL Approved Approved only for Metadata capture and for TBA
Database data of of volume < 50 GB

In Azure Approved Used for faster report response (In Memory limit Section 2.5 - Azure
Memory Analysis of 400 GB) Analysis Services
Services

Compute Databricks Approved Compute. Processing engine (aggregation, KPI, Section 2.1 - Azure
Business logic) Databricks (ADB)

Visualiza Power BI Approved Dashboard reporting and Self Service Power BI


tion performance

Web App Approved Approved for User interaction reports with


analytical models. Customization not supported
in PBI.

Only Windows based .Net or Node Js core


approved.

Power Approved Customization for dashboards


App (Case by
Case
Basis)

Ingestion Azure Approved Job Scheduling/ Orchestration in PDS Section 2.2 - Azure
and Data Data Factory V2
Scheduli factory Integration with different source system for data
ng (V2) ingestion in to UDL.

ADF Data Flow for ETL

Security Azure key Approved Credential Management. Provided to all projects TBA
& Vault (Case by by default
Access Case
control Basis)

Version 9.1 Published on 5th August 2020 Page 56 of 321


I&A Azure Solution Architecture Guidelines

Azure Approved Access control TBA


Active
Directory

Data Databricks Approved Data Science models using Spark ML , Spark R TBA
Science and PySpark

Azure ML Approved Older version of Studio allowed for limited TBA


Studio (Case by usecases
(V1) Case
Basis)

Azure ML Approved UI tool and compute for Python, R and data Section 2.7 - Azure
Services science models. Approved for Visual Interface Machine Learning
(ML Studio V2 )as well

Code Azure Approved Approved for Code Repo, Continuous TBA


Manage DevOps Deployment.
ment
and CI
/CD

Others Logic Approved Approved for Alerting and monitoring and Job Section 2.10 - Azure
Apps (Case by Triggering Logic App
Case
Basis)

Automati Approved Automated Pause and resume of TBA


on (Case by components (Used centrally through
Account Case webhooks). Component not approved for
Basis) individual projects

Log Approved Approved for Logging and Monitoring. Section 2.8 - Azure
Analytics (Case by Monitor & Log
Case Analytics
Basis)

HDInsight Approved Limited to Kafka . Not suggested as spark TBA


(Case by compute
Case
Basis)

SSIS Approved Limited to on premise SQL Migration TBA


PAAS (Case by
Case
Basis)

Batch Approved Uses case by case basis, as agreed with I&A TBA
Account, (Case by Tech Architect
Case
functions Basis)
,

Service
Bus,

Azure
Monitor

Version 9.1 Published on 5th August 2020 Page 57 of 321


I&A Azure Solution Architecture Guidelines

New Azure Approved Provided only as exceptional approval, Section 2.12 - Azure
Compon Cache for (Case by considering these components are used for Cache for Redis
ents Redis Case specific cases.
Basis)

Azure Approved Provided only as exceptional approval, Section 2.14 - Azure


Search (Case by considering these components are used for Search
Case specific cases.
Basis)

Refer the Microsoft link to understand on components Limitations:

https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits

Components being reviewed (Pilot and Security approval pending)

1. Azure Cognitive services


2. Azure API management gateway.

Version 9.1 Published on 5th August 2020 Page 58 of 321


I&A Azure Solution Architecture Guidelines

Section 2.1 - Azure Databricks (ADB)


Overview
Databricks Premium vs Standard Workspace
When to use Databricks Premium
When to use Databricks Standard
Access management on Databricks Workspaces

Overview

Azure Databricks (ADB) will be the processing service to transform and process source data and get it in an
enriched business useful form. Azure Databricks is an Apache Spark-based analytics platform optimized for the
Microsoft Azure cloud services platform. ADB has auto-scaling and auto-termination (like a pause/resume), has a
workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over
traditional Apache Spark.

ADB will be used as a primary processing engine for all forms of data (structured, semi-structured). It will be used to
perform delta processing (with use of Databricks Delta), data quality checks and enrichment of this data with
business KPIs and other business logic. DB is 100% based on Spark and is extensible with support for Scala, Java,
R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib).

Architecture Standard: Scala should be used as scripting language for Spark for DQ and delta processing etc.
PySpark or SparkR should be used for Analytics

Databricks Premium vs Standard Workspace

Databricks Premium workspace allows you to:

Restrict the use of secret scope to the creator. Without this, any user of the workspace will be able to use the
secret scope and read the secrets from KeyVault provided:
The user knows the name of the secret scope
The user knows the secret names in the KeyVault
Collect Databricks diagnostic logs. This option is not available in standard workspaces. If I’m not wrong, UDL
and SC BDL teams have started redirecting these logs to log analytics
User access management. Premium workspaces allow you to decide who has admin permissions on the
workspace. Without this, everyone who has access to the workspace is admin
Cluster management. With premium, you can create a cluster and control who has access to it. You can
restrict users who are not admins from creating or editing clusters.
AD credential passthrough. With premium workspaces you can create clusters with passthrough access. You
can also create mountpoints that work with passthrough credentials.
And few more features that we don’t use in Unilever.

When to use Databricks Premium

As a general guideline, the premium workspace should be used when:

The workspace is in a production environment


AD Credential Passthrough is required. Some experimentation environments will need this
A TDA architect has specified it in the architecture design

When to use Databricks Standard

For all Databricks Standard Workspaces, Landscape will make sure that the secret scope is deleted after creation
of mountpoints (if any). This will prevent the workspace users from accessing the secret scope and hence the
secrets in the KeyVault.

Version 9.1 Published on 5th August 2020 Page 59 of 321


I&A Azure Solution Architecture Guidelines

All non-prod workspaces should use Standard from now on


Identify how to migrate from Premium to Standard - Work with DevOps
New workspace will have new name
parameter file will need regeneration

Any secret that is required at the run time should be stored in Databricks backed secret scope.

Access management on Databricks Workspaces

Until 2019 the users were granted contributor permission on the Databricks instance on Azure portal. Starting April
2020, Landscape runs a script every day that adds/removes users from the workspace based on the users in the
dev/test/support AD user groups maintained by the project teams.

Version 9.1 Published on 5th August 2020 Page 60 of 321


I&A Azure Solution Architecture Guidelines

Design Standards
General Guidelines
Workspace Standards
Spark Style Guide
Automated Code Formatting Tools
Variables
Chained Method Calls
Spark SQL
Columns
Immutable Columns
Open Source
User Defined Functions
Custom transformations
Naming conventions
Schema Dependent DataFrame Transformations
Schema Independent DataFrame Transformations
What type of DataFrame transformation should be used
Null
JAR Files
Documentation
Column Functions
DataFrame Transformations
Testing
Cluster Configuration Standards
Cluster Sizing Starting Points
Different Azure Instance Types
Recommended VM Family
Recommended VM Family Series
Choose cluster VMs to match workload class
Arrive at correct cluster size by iterative performance testing
Cluster Tags
To configure cluster tags:

GENERAL GUIDELINES

While reading the input file use option repartition(sc.defaultParallelism * 2) to increase the read performance
as option("multiline","true") will invoke only one executor all the time.
Follow coding standards (either fully use Uppercase letters or only lowercase letters).
Calibrate the execution times of each command to analyse the notebook performance.
Provide the comment line for commands/cells to highlight the functions.
Utilize the Spark optimization techniques available.
Use the available options to invoke Spark Parallelism and try avoiding commands that utilizes extensive
memory, CPU time and shuffling.
Avoid the operations with respect to dbutils like dbutils.cp (CPY command).
Use Databricks delta wherever possible.
Consider using val for variable assignment wherever possible instead of using var.
Consider chunking out the data and provide them as input to SPARK.

Version 9.1 Published on 5th August 2020 Page 61 of 321


I&A Azure Solution Architecture Guidelines

WORKSPACE STANDARDS

UDL to have workspace specific to functional area to ensure metastore limits are not reached. In the new
foundation design UDL to have workspaces specifically for CD, SC, Finance, HR etc instead of one single
workspace for all functional areas. Similarly if projects have too many jobs running then suggestion is to have
multiple workspaces as the Hive metastore has a limit of 250 connections.

SPARK STYLE GUIDE

Automated Code Formatting Tools

Scalafmt and scalariform are automated code formatting tools. scalariform's default settings format code similar to
the Databricks scala-style-guide and is a good place to start. The sbt-scalariform plugin automatically reformats
code upon compile and is the best way to keep code formatted consistely without thinking. Here are some
scalariform settings that work well with Spark code.

SbtScalariform.scalariformSettings

ScalariformKeys.preferences := ScalariformKeys.preferences.value

.setPreference(DoubleIndentConstructorArguments, true)

.setPreference(SpacesAroundMultiImports, false)

.setPreference(DanglingCloseParenthesis, Force)

Variables

Variables should use camelCase. Variables that point to DataFrames, Datasets, and RDDs should be suffixed
accordingly to make your code readable:

Variables pointing to DataFrames should be suffixed with DF (following conventions in the Spark
Programming Guide)

peopleDF.createOrReplaceTempView("people")

Variables pointing to Datasets should be suffixed with DS

val stringsDS = sqlDF.map {

case Row(key: Int, value: String) => s"Key: $key, Value: $value"

Variables pointing to RDDs should be suffixed with RDD

val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")

Use the variable col for Column arguments.

def min(col: Column)

Use col1 and col2 for methods that take two Column arguments.

def corr(col1: Column, col2: Column)

Use cols for methods that take an arbitrary number of Column arguments.

def array(cols: Column*)

For methods that take column name String arguments, follow the same pattern and use colName, colName1,
colName2, and colNames as variables.

Version 9.1 Published on 5th August 2020 Page 62 of 321


I&A Azure Solution Architecture Guidelines

Collections of things should use plural variable names.

var animals = List("dog", "cat", "goose")

// DONT DO THIS

var animalList = List("dog", "cat", "goose")

Singular nouns should be used for single objects.

val myCarColor = "red"

Chained Method Calls

Spark methods are often deeply chained and should be broken up on multiple lines.

jdbcDF.write

.format("jdbc")

.option("url", "jdbc:postgresql:dbserver")

.option("dbtable", "schema.tablename")

.option("user", "username")

.option("password", "password")

.save()

Here's an example of a well formatted extract:

val extractDF = spark.read.parquet("someS3Path")

.select(

"name",

"Date of Birth"

.transform(someCustomTransformation())

.withColumnRenamed("Date of Birth", "date_of_birth")

.filter(

col("date_of_birth") > "1999-01-02"

Spark SQL

Use multiline strings to write properly indented SQL code:

val coolDF = spark.sql("""

select

`first_name`,

`last_name`,

`hair_color`

Version 9.1 Published on 5th August 2020 Page 63 of 321


I&A Azure Solution Architecture Guidelines

from people

""")

Columns

Columns have name, type, nullable, and metadata properties.

Columns that contain boolean values should use predicate names like is_nice_person or has_red_hair. Use
snake_case for column names, so it's easier to write SQL code.

You can write (col("is_summer") && col("is_europe")) instead of (col("is_summer") === true && col("is_europe")
=== true). The predicate column names make the concise syntax nice and readable.

Columns should be typed properly. Don't overuse StringType in your schema.

Columns should only be nullable if null values are allowed. Code written for nullable columns should always
address null values gracefully.

Use acronyms when needed to keep column names short. Define any acronyms used at the top of the data file, so
other programmers can follow along.

Use the following shorthand notation for columns that perform comparisons.

gt: greater than


lt: less than
leq: less than or equal to
geq: greater than or equal to
eq: equal to
between

Here are some example column names:

player_age_gt_20
player_age_gt_15_leq_30
player_age_between_13_19
player_age_eq_45

Immutable Columns

Custom transformations shouldn't overwrite an existing field in a schema during a transformation. Add a new
column to a DataFrame instead of mutating the data in an existing column.

Suppose you have a DataFrame with name and nickname columns and would like a column that coalesces the
name and nickname columns.

+-----+--------+

| name|nickname|

+-----+--------+

| joe| null|

| null| crazy|

|frank| bull|

+-----+--------+

Version 9.1 Published on 5th August 2020 Page 64 of 321


I&A Azure Solution Architecture Guidelines

Don't overwrite the name field and create a DataFrame like this:

+-----+--------+

| name|nickname|

+-----+--------+

| joe| null|

|crazy| crazy|

|frank| bull|

+-----+--------+

Create a new column, so existing columns aren't changed and column immutability is preserved.

+-----+--------+---------+

| name|nickname|name_meow|

+-----+--------+---------+

| joe| null| joe|

| null| crazy| crazy|

|frank| bull| frank|

+-----+--------+---------+

Open Source

You should write generic open source code whenever possible. Open source code is easily reusable (especially
when it's uploaded to Spark Packages / Maven Repository) and forces you to design code without business logic.

The org.apache.spark.sql.functions class provides some great examples of open source functions.

The Dataset and Column classes provide great examples of code that facilitates DataFrame transformations.

User Defined Functions

Spark let’s you define custom SQL functions called user defined functions (UDFs). UDFs are great when built-in
SQL functions aren’t sufficient, but should be used sparingly because they’re not performant. If you need to write a
UDF, make sure to handle the null case as this is a common cause of errors.

Custom transformations

Use multiple parameter lists when defining custom transformations, so you can chain your custom transformations
with the Dataset#transform method. You should disregard this advice from the Databricks Scala style guide: "Avoid
using multiple parameter lists. They complicate operator overloading, and can confuse programmers less familiar
with Scala."

You need to use multiple parameter lists to write awesome code like this:

def withCat(name: String)(df: DataFrame): DataFrame = {

df.withColumn("cats", lit(s"$name meow"))

The withCat() custom transformation can be used as follows:

Version 9.1 Published on 5th August 2020 Page 65 of 321


I&A Azure Solution Architecture Guidelines

val niceDF = df.transform(withCat("puffy"))

Naming conventions

with precedes transformations that add columns:


withCoolCat() adds the column cool_cat to a DataFrame
withIsNicePerson adds the column is_nice_person to a DataFrame.
filter precedes transformations that remove rows:
filterNegativeGrowthPath() removes the data rows where the growth_path column is negative
filterBadData() removes the bad data
enrich precedes transformations that clobber columns. DataFrame transformations should not be clobbered
and enrich transformations should ideally never be used.
explode precedes transformations that add rows to a DataFrame by "exploding" a row into multiple rows.

Schema Dependent DataFrame Transformations

Schema dependent DataFrame transformations make assumptions about the underlying DataFrame schema.
Schema dependent DataFrame transformations should explicitly validate DataFrame dependencies to make the
code and error messages more readable.

The following withFullName() DataFrame transformation assumes that the underlying DataFrame has first_name
and last_name columns.

def withFullName()(df: DataFrame): DataFrame = {

df.withColumn(

"full_name",

concat_ws(" ", col("first_name"), col("last_name"))

You should use spark-daria to validate the schema requirements of a DataFrame transformation.

def withFullName()(df: DataFrame): DataFrame = {

validatePresenceOfColumns(df, Seq("first_name", "last_name"))

df.withColumn(

"full_name",

concat_ws(" ", col("first_name"), col("last_name"))

See this blog post for a detailed description on validating DataFrame dependencies.

Schema Independent DataFrame Transformations

Schema independent DataFrame transformations do not depend on the underlying DataFrame schema, as
discussed in this blog post.

def withAgePlusOne(

ageColName: String,

Version 9.1 Published on 5th August 2020 Page 66 of 321


I&A Azure Solution Architecture Guidelines

resultColName: String

)(df: DataFrame): DataFrame = {

df.withColumn(resultColName, col(ageColName) + 1)

What type of DataFrame transformation should be used

Schema dependent transformations should be used for functions that rely on a large number of columns or
functions that are only expected to be run on a certain schema (e.g. a data lake with a schema that doensn't
change).

Schema independent transformations should be run for functions that will be run on a variety of DataFrame
schemas.

Null

null should be used in DataFrames for values that are unknown, missing, or irrelevant.

Spark core functions frequently return null and your code can also add null to DataFrames (by returning None or
explicitly returning null).

In general, it's better to keep all null references out of code and use Option[T] instead. Option is a bit slower and
explicit null references may be required for performance sensitve code. Start with Option and only use explicit null
references if Option becomes a performance bottleneck.

The schema for a column should set nullable to false if the column should not take null values.

JAR Files

JAR files should be named like this:

spark-testing-base_2.11-2.1.0_0.6.0.jar

Generically:

spark-testing-base_scalaVersion-sparkVersion_projectVersion.jar

If you're using sbt assembly, you can use the following line of code to build a JAR file using the correct naming
conventions.

assemblyJarName in assembly := s"${name.value}_${scalaBinaryVersion.value}-${sparkVersion.value}_${version.


value}.jar"

If you're using sbt package, you can add this code to your build.sbt file to generate a JAR file that follows the
naming conventions.

artifactName := { (sv: ScalaVersion, module: ModuleID, artifact: Artifact) =>

artifact.name + "_" + sv.binary + "-" + sparkVersion.value + "_" + module.revision + "." + artifact.extension

Documentation

The following documentation guidelines generally copy the documentation in the Spark source code. For example,
here's how the rpad method is defined in the Spark source code.

/**

*Right-pad the string column with pad to a length of len. If the string column is longer

Version 9.1 Published on 5th August 2020 Page 67 of 321


I&A Azure Solution Architecture Guidelines

*than len, the return value is shortened to len characters.

*@group string_funcs

*@since 1.5.0

*/

def rpad(str: Column, len: Int, pad: String): Column = withExpr {

StringRPad(str.expr, lit(len).expr, lit(pad).expr)

Here's an example of the the Column#equalTo() method that contains an example code snippet.

/**

*Equality test.

*{{{

*// Scala:

*df.filter( df("colA") === df("colB") )

*// Java

*import static org.apache.spark.sql.functions.*;

*df.filter( col("colA").equalTo(col("colB")) );

*}}}

*@group expr_ops

*@since 1.3.0

*/

def equalTo(other: Any): Column = this === other

The @since annotation should be used to document when features are added to the API.

The @note annotaion should be used to detail important information about a function, like the following example.

/**

*Aggregate function: returns the level of grouping, equals to

*{{{

* (grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn)

*}}}

Version 9.1 Published on 5th August 2020 Page 68 of 321


I&A Azure Solution Architecture Guidelines

*@note The list of columns should match with grouping columns exactly, or empty (means all the

*grouping columns).

*@group agg_funcs

*@since 2.0.0

*/

def grouping_id(cols: Column*): Column = Column(GroupingID(cols.map(_.expr)))

Column Functions

Column functions should be annotated with the following groups, consistent with the Spark functions that return
Column objects.

@groupname udf_funcs UDF functions

@groupname agg_funcs Aggregate functions

@groupname datetime_funcs Date time functions

@groupname sort_funcs Sorting functions

@groupname normal_funcs Non-aggregate functions

@groupname math_funcs Math functions

@groupname misc_funcs Misc functions

@groupname window_funcs Window functions

@groupname string_funcs String functions

@groupname collection_funcs Collection functions

@groupname Ungrouped Support functions for DataFrames

Here's an example of a well-documented Column function in the spark-daria project.

/**

*Removes all whitespace in a string

*@group string_funcs

*@since 2.0.0

*/

def removeAllWhitespace(col: Column): Column = {

regexp_replace(col, "\\s+", "")

DataFrame Transformations

Version 9.1 Published on 5th August 2020 Page 69 of 321


I&A Azure Solution Architecture Guidelines

Custom transformations can add/remove rows and columns from a DataFrame. DataFrame transformation
documentation should specify how the custom transformation is modifying the DataFrame and list the name of
columns added to the DataFrame as appropriate.

Testing

Use the spark-fast-tests library for writing DataFrame / Dataset / RDD tests with Spark. spark-testing-base should
be used for streaming tests.

Read this blog post for a gentle introduction to testing Spark code, this blog post on how to design easily testable
Spark code, and this blog post on how to cut the run time of a Spark test suite.

Instance methods should be preceded with a pound sign (e.g. #and) and static methods should be preceded with a
period (e.g. .standardizeName) in the describe block. This follows Ruby testing conventions.

Here is an example of a test for the #and instance method defined in the functions class as follows:

class FunctionsSpec extends FunSpec with DataFrameComparer {

import spark.implicits._

describe("#and") {

it ("returns true if both columns are true") {

// some code

Here is an example of a test for the .standardizeName static method:

describe(".standardizeName") {

it("consistenly formats the name") {

// some code

CLUSTER CONFIGURATION STANDARDS

Cluster Sizing Starting Points

Few General Rules

Fewer big instances rather than more small instances


Reduce network shuffle; Databricks has 1 executor / machine
Applies to batch ETL mainly (for streaming, one could start with smaller instances depending on
complexity of transformation)
Not set in stone, and reverse would make sense in many cases - so sizing exercise matters
Size based on the number of tasks initially, tweak later
Run the job with a small cluster to get idea of no of tasks (use 2-3x tasks per core for base sizing)

Different Azure Instance Types

Version 9.1 Published on 5th August 2020 Page 70 of 321


I&A Azure Solution Architecture Guidelines

Compute Optimized Memory Optimized Storage Optimized General Purpose

Fs DSv2 L DSv2 and DSv3


Haswell processor (Skylake not Haswell processor 1 core ~ 8GB RAM DSv2 - 1 core ~ 3.5
supported yet) 1 core ~ 7GB RAM SSD Storage: 1 core GB RAM
1 core ~ 2GB RAM SSD Storage: 1 core ~ 14 GB ~ 170GB DSv3 - 1 core ~
SSD Storage: 1 core ~ 16GB ESv3 Price : .156 4GB RAM
H High-performance SSD Storage:
High-performance (Broadwell processor) DSv2 - 1 core ~
1 core ~ 7GB RAM 1 core ~ 8GB RAM 7GB
SSD Storage: 1 core ~ 125GB SSD Storage: 1 core ~ 16GB DSv3 - 1 core ~
8GB

Projects should use only the below recommended VM family series and VM Types for running all their project
workloads. Choose based on the workloads so sizing exercise is the key to identify which cluster best works.

Recommended VM Family

UDL Env Type VM Family Series

Dev, QA General Purpose Dsv2, Dsv3

UAT, PPD, PROD General Purpose Dsv2, Dsv3

UAT, PPD, PROD Memory Optimized Dsv2 Memory Optimized

UAT, PPD, PROD Compute Optimized Fsv2

Recommended VM Family Series

GP – General Purpose, CO – Compute Optimized, MO – Memory Optimized

Env Type VM Family Series VM Type

Dev, QA GP –Small Dsv2 Ds3v2

Dev, QA GP – Medium Dsv2, Dsv3 Ds4v2, D8sv3

Dev, QA GP – Large Dsv2, Dsv3 Ds4v2, D8sv3

Dev, QA GP – XL Dsv2, Dsv3 Ds5v2, D16sv3

PPD, PROD GP – Small Dsv2 Ds3v2

PPD, PROD GP – Medium Dsv2, Dsv3 Ds4v2, D8sv3

PPD, PROD GP – Large Dsv2, Dsv3 Ds4v2, D8sv3

PPD, PROD GP – XL Dsv2, Dsv3 Ds5v2, D16sv3

PPD, PROD CO – Small Fsv2 Fs4v2

PPD, PROD CO – Medium Fsv2 F8sv2

PPD, PROD CO – Large Fsv2 F8sv2

PPD, PROD CO – XL Fsv2 F16sv2

PPD, PROD MO –Small Dsv2 memory optimized Ds12v2

PPD, PROD MO – Medium Dsv2 memory optimized DS13v2

Version 9.1 Published on 5th August 2020 Page 71 of 321


I&A Azure Solution Architecture Guidelines

PPD, PROD MO – Large Dsv2 memory optimized Ds13v2

PPD, PROD MO – XL Dsv2 memory optimized Ds14v2

CHOOSE CLUSTER VMS TO MATCH WORKLOAD CLASS

Impact: High

To allocate the right amount and type of cluster resource for a job, we need to understand how different types of
jobs demand different types of cluster resources.

Machine Learning - To train machine learning models it’s usually required cache all of the data in memory.
Consider using memory optimized VMs so that the cluster can take advantage of the RAM cache. To size
the cluster, take a % of the data set cache it see how much memory it used extrapolate that to the rest of
the data. The tungsten data serializer optimizes the data in-memory. Which means you’ll need to test the
data to see the relative magnitude of compression.
Streaming - You need to make sure that the processing rate is just above the input rate at peak times of the
day. Depending peak input rate times, consider compute optimized VMs for the cluster to make sure
processing rate is higher than your input rate.
ETL - In this case, data size and deciding how fast a job needs to be will be a leading indicator. Spark doesn’
t always require data to be loaded into memory in order to execute transformations, but you’ll at the very
least need to see how large the task sizes are on shuffles and compare that to the task throughput you’d like.
To analyze the performance of these jobs start with basics and check if the job is by CPU, network, or local I
/O, and go from there. Consider using a general purpose VM for these jobs.
Interactive / Development Workloads - The ability for a cluster to auto scale is most important for these types
of jobs. Azure Databricks has a cluster manager and Serverless clusters to optimize the size of cluster during
peak and low times. In this case taking advantage of Serverless clusters and Autoscaling will be your best
friend in managing the cost of the infrastructure.

ARRIVE AT CORRECT CLUSTER SIZE BY ITERATIVE PERFORMANCE TESTING

Impact: High

It is impossible to predict the correct cluster size without developing the application because Spark and Azure
Databricks use numerous techniques to improve cluster utilization. The broad approach you should follow for sizing
is:

1. Develop on a medium sized cluster of 2-8 nodes, with VMs matched to workload class as explained earlier.
2. After meeting functional requirements, run end to end test on larger representative data while measuring
CPU, memory and I/O used by the cluster at an aggregate level.
3. Optimize cluster to remove bottlenecks found in step 2:
a. CPU bound: add more cores by adding more nodes
b. Network bound: use fewer, bigger SSD backed machines to reduce network size and improve remote
read performance
c. Disk I/O bound: if jobs are spilling to disk, use VMs with more memory.
4. Repeat steps 2 and 3 by adding nodes and/or evaluating different VMs until all obvious bottlenecks have
been addressed.

Performing these steps will help you to arrive at a baseline cluster size which can meet SLA on a subset of data.
Because Spark workloads exhibit linear scaling, you can arrive at the production cluster size easily from here. For
example, if it takes 5 nodes to meet SLA on a 100TB dataset, and the production data is around 1PB, then prod
cluster is likely going to be around 50 nodes in size.

CLUSTER TAGS

Version 9.1 Published on 5th August 2020 Page 72 of 321


I&A Azure Solution Architecture Guidelines

Cluster tags allow you to easily monitor the cost of cloud resources used by various groups in the organization. One
can specify tags as key-value pairs when you create a cluster, and Azure Databricks applies these tags to cloud
resources like VMs and disk volumes.

Cluster tags propagate to these cloud resources along with pool tags and workspace (resource group) tags. Azure
Databricks applies four default tags to each cluster: Vendor, Creator, ClusterName, and ClusterId. TDA
recommends adding the below Cluster Tags as a mandatory process for all Databricks clusters in use for a project.

Tag Purpose
Name

Busines Name of the Platform like:


s Supply Chain, Marketing, Finance etc.
Platform

Service Name of your Service as in CMDB like:


Name Supply Chain BDL, Livewire Europe etc.

CostCe CostCenter for the Service


nter

ICC International Cost Code for the Service

ProjectS If there is a group email for your team, please put that here. Otherwise, you can put the name of the
upportT person who administers the cluster.
eam
For cases where the cluster is managed by DevOps pipelines, put the group email for the DevOps
team. If a group email doesn’t exist, put the name of the Unilever colleague who is responsible for
the cluster.

Environ Choose from the following:


ment Dev, QA, UAT, Pre-prod, Prod, Experiment

Purpose Describe the use of the cluster. You can pick some of the following or add new descriptions.

Development, Data Analysis, Logging, Historical Data Processing workloads, Data Processing
Workloads, Machine Learning, Testing, etc

To configure cluster tags:

1. On the cluster configuration page, click the Advanced Options toggle.


2. At the bottom of the page, click the Tags tab.

Version 9.1 Published on 5th August 2020 Page 73 of 321


I&A Azure Solution Architecture Guidelines

3. Add a key-value pair for each custom tag as per the recommended Tags above from TDA team.

Version 9.1 Published on 5th August 2020 Page 74 of 321


I&A Azure Solution Architecture Guidelines

Databricks Best Practices


Introduction
Provisioning ADB: Guidelines for Networking and Security
Azure Databricks 101
Map Workspaces to Business Units
Deploy Workspaces in Multiple Subscriptions
ADB Workspace Limits
Azure Subscription Limits
Consider Isolating Each Workspace in its own VNet
Select the largest CIDR possible for a VNet
Do not store any production data in default DBFS folders
Always hide secrets in Key Vault and do not expose them openly in Notebooks
Developing applications on ADB: Guidelines for selecting clusters
Support Interactive analytics using shared High Concurrency clusters
Support Batch ETL workloads with single user ephemeral Standard clusters
Favor Cluster Scoped Init scripts over Global and Named scripts
Send logs to blob store instead of default DBFS using Cluster Log delivery
Choose cluster VMs to match workload class
Arrive at correct cluster size by iterative performance testing
Tune shuffle for optimal performance
Store Data In Parquet Partitions
Monitoring
Collect resource utilization metrics across Azure Databricks cluster in a Log Analytics workspace
Querying VM metrics in Log Analytics once you have started the collection using the above
document

Introduction

Planning, deploying, and running Azure Databricks (ADB) at scale requires one to make many architectural
decisions.

While each ADB deployment is unique to an organization's needs we have found that some patterns are common
across most successful ADB projects. Unsurprisingly, these patterns are also in-line with modern Cloud-centric
development best practices.

This short guide summarizes these patterns into prescriptive and actionable best practices for Azure Databricks.
We follow a logical path of planning the infrastructure, provisioning the workspaces, developing Azure Databricks
applications, and finally, running Azure Databricks in production.

The audience of this guide are system architects, field engineers, and development teams of customers, Microsoft,
and Databricks. Since the Azure Databricks product goes through fast iteration cycles, we have avoided
recommendations based on roadmap or Private Preview features.

Our recommendations should apply to a typical Fortune 500 enterprise with at least intermediate level of Azure and
Databricks knowledge. We've also classified each recommendation according to its likely impact on solution's
quality attributes. Using the Impact factor, you can weigh the recommendation against other competing choices.
Example: if the impact is classified as “Very High”, the implications of not adopting the best practice can have a
significant impact on your deployment.

As ardent cloud proponents, we value agility and bringing value quickly to our customers. Hence, we’re releasing
the first version somewhat quickly, omitting some important but advanced topics in the interest of time. We will
cover the missing topics and add more details in the next round, while sincerely hoping that this version is still
useful to you.

Version 9.1 Published on 5th August 2020 Page 75 of 321


I&A Azure Solution Architecture Guidelines

Provisioning ADB: Guidelines for Networking and Security

Azure Databricks (ADB) deployments for very small organizations, PoC applications, or for personal education
hardly require any planning. You can spin up a Workspace using Azure Portal in a matter of minutes, create a
Notebook, and start writing code. Enterprise-grade large scale deployments are a different story altogether. Some
upfront planning is necessary to avoid cost overruns, throttling issues, etc. In particular, you need to understand:

Networking requirements of Databricks


The number and the type of Azure networking resources required to launch clusters
Relationship between Azure and Databricks jargon: Subscription, VNet., Workspaces, Clusters, Subnets, etc.
Overall Capacity Planning process: where to begin, what to consider?

Let’s start with a short Azure Databricks 101 and then discuss some best practices for scalable and secure
deployments.

AZURE DATABRICKS 101

ADB is a Big Data analytics service. Being a Cloud Optimized managed PaaS offering, it is designed to hide the
underlying distributed systems and networking complexity as much as possible from the end user. It is backed by a
team of support staff who monitor its health, debug tickets filed via Azure, etc. This allows ADB users to focus on
developing value generating apps rather than stressing over infrastructure management.

You can deploy ADB using Azure Portal or using ARM templates. One successful ADB deployment produces
exactly one Workspace, a space where users can log in and author analytics apps. It comprises the file browser,
notebooks, tables, clusters, DBFS storage, etc. More importantly, Workspace is a fundamental isolation unit in
Databricks. All workspaces are expected to be completely isolated from each other -- i.e., we intend that no action
in one workspace should noticeably impact another workspace.

Each workspace is identified by a globally unique 53-bit number, called Workspace ID or Organization ID. The URL
that a customer sees after logging in always uniquely identifies the workspace they are using:

https://regionName.azuredatabricks.net/?o=workspaceId

Azure Databricks uses Azure Active Directory (AAD) as the exclusive Identity Provider and there’s a seamless out
of the box integration between them. Any AAD member belonging to the Owner or Contributor role can deploy
Databricks and is automatically added to the ADB members list upon first login. If a user is not a member of the
Active Directory tenant, they can’t login to the workspace.

Azure Databricks comes with its own user management interface. You can create users and groups in a
workspace, assign them certain privileges, etc. While users in AAD are equivalent to Databricks users, by default
AAD roles have no relationship with groups created inside ADB. ADB also has a special group called Admin, not to
be confused with AAD’s admin.

The first user to login and initialize the workspace is the workspace owner. This person can invite other users to the
workspace, create groups, etc. The ADB logged in user’s identity is provided by AAD, and shows up under the user
menu in Workspace:

With this basic understanding lets discuss how to plan a typical ADB deployment. We first grapple with the issue of
how to divide workspaces and assign them to users and teams.

MAP WORKSPACES TO BUSINESS UNITS

Impact: Very High

Though partitioning of workspaces depends on the organization structure and scenarios, it is generally
recommended to partition workspaces based on a related group of people working together collaboratively. This
also helps in streamlining your access control matrix within your workspace (folders, notebooks etc.) and also

Version 9.1 Published on 5th August 2020 Page 76 of 321


I&A Azure Solution Architecture Guidelines

across all your resources that the workspace interacts with (storage, related data stores like Azure SQL DB, Azure
SQL DW etc.). This type of division scheme is also known as the Business Unit Subscription design pattern and
aligns well with Databricks chargeback model.

DEPLOY WORKSPACES IN MULTIPLE SUBSCRIPTIONS

Impact: Very High

Customers commonly partition workspaces based on teams or departments and arrive at that division naturally. But
it is also important to partition keeping Azure Subscription and ADB Workspace level limits in mind.

ADB Workspace Limits

Azure Databricks is a multitenant service and to provide fair resource sharing to all regional customers, it imposes
limits on API calls. These limits are expressed at the Workspace level and are due to internal ADB components. For
instance, you can only run up to 150 concurrent jobs in a workspace. Beyond that, ADB will deny your job
submissions. There are also other limits such as max hourly job submissions, etc.

Key workspace limits are:

There is a limit of 1000 scheduled jobs that can be seen in the UI.
The maximum number of jobs that a workspace can create in an hour is 1000.
At any time, you cannot have more than 150 jobs simultaneously running in a workspace.
There can be a maximum of 150 notebooks or execution contexts attached to a cluster.

Azure Subscription Limits

Next, there are Azure limits to consider since ADB deployments are built on top of the Azure infrastructure.

Key Azure limits are:

Storage accounts per region per subscription: 250


Maximum egress for general-purpose v2 and Blob storage accounts (all regions): 50 Gbps
VMs per subscription per region: 25,000.
Resource groups per subscription: 980

Due to security reasons, we also highly recommend separating the production and dev/stage environments into
separate subscriptions.

It is important to divide your workspaces appropriately using different subscriptions based on your business
keeping in mind the Azure limits.

Version 9.1 Published on 5th August 2020 Page 77 of 321


I&A Azure Solution Architecture Guidelines

CONSIDER ISOLATING EACH WORKSPACE IN ITS OWN VNET

Impact: Low

While you can deploy more than one Workspace in a VNet by keeping the subnets separate, we recommend that
you follow the hub and spoke model and separate each workspace in its own VNet. Recall that a Databricks
Workspace is designed to be a logical isolation unit, and that Azure’s VNets are designed for unconstrained
connectivity among the resources placed inside it. Unfortunately, these two design goals are at odds with each
other since VMs belonging to two different workspaces in the same VNet can therefore communicate. While this is
normally innocuous from our experience, it should be avoided if as much as possible.

Select the largest CIDR possible for a VNet

Impact: High

Recall that each Workspace can have multiple clusters:

Each cluster node requires 1 Public IP and 2 Private IPs


These IPs and are logically grouped into 2 subnets named “public” and “private”
For a desired cluster size of X, number of Public IPs = X, number of Private IPs = 4X
The 4X requirement for Private IPs is due to the fact that for each deployment:
Half of address space is reserved for future use
The other half is equally divided into the two subnets: private and public
The size of private and public subnets thus determines total number of VMs available for clusters.
/22 mask is larger than /23, so setting private and public to /22 will have more VMs available for
creating clusters, than say /23 or below

Version 9.1 Published on 5th August 2020 Page 78 of 321


I&A Azure Solution Architecture Guidelines

But, because of the address space allocation scheme, the size of private and public subnets is constrained
by the VNet’s CIDR
The allowed values for the enclosing VNet CIDR are from /16 through /24
The private and public subnet masks must be:
Equal
At least two steps down from enclosing VNet CIDR mask
Must be greater than /26

With this info, we can quickly arrive at the table below, showing how many nodes one can use across all clusters for
a given VNet CIDR. It is clear that selection of VNet CIDR has far reaching implications in terms of maximum
cluster size.

Enclosing VNet CIDR’s Allowed Masks on Max number of nodes across all clusters in the
Mask where ADB Private and Public Workspace, assuming higher subnet mask is
Workspace is deployed Subnets (should be chosen
equal)

/16 /18 through /26 16000

/17 /19 through /26 8000

/18 /20 through /26 4000

/19 /21 through /26 2000

/20 /22 through /26 1024

/21 /23 through /26 512

/22 /24 through /26 256

/23 /25 through /26 128

/24 /26 only 64

DO NOT STORE ANY PRODUCTION DATA IN DEFAULT DBFS FOLDERS

Impact: High

This recommendation is driven by security and data availability concerns. Every Workspace comes with a default
DBFS, primarily designed to store libraries and other system-level configuration artifacts such as Init scripts. You
should not store any production data in it, because:

1. The lifecycle of default DBFS is tied to the Workspace. Deleting the workspace will also delete default DBFS
and permanently remove its contents.
2. One can’t restrict access to this default folder and its contents.

Note that this recommendation doesn’t apply to Blob or ADLS folders explicitly mounted as DBFS by the
end user.

ALWAYS HIDE SECRETS IN KEY VAULT AND DO NOT EXPOSE THEM OPENLY IN NOTEBOOKS

Impact: High

It is a significant security risk to expose sensitive data such as access credentials openly in Notebooks or other
places such as job configs, etc. You should instead use a vault to securely store and access them. You can either
use ADB’s internal Key Vault for this purpose or use Azure’s Key Vault (AKV) service.

Version 9.1 Published on 5th August 2020 Page 79 of 321


I&A Azure Solution Architecture Guidelines

If using Azure Key Vault, create separate AKV-backed secret scopes and corresponding AKVs to store credentials
pertaining to different data stores. This will help prevent users from accessing credentials that they might not have
access to. Since access controls are applicable to the entire secret scope, users with access to the scope will see
all secrets for the AKV associated with that scope.

DEVELOPING APPLICATIONS ON ADB: GUIDELINES FOR SELECTING CLUSTERS

After understanding how to provision the workspaces, best practices in networking, etc., let’s put on the developer’s
hat and see the design choices typically faced by them:

What type of clusters should I use?


How many drivers and how many workers?

In this chapter we will address such concerns and provide our recommendations, while also explaining the internals
of Databricks clusters and associated topics. Some of these ideas seem counterintuitive but they will all make
sense if you keep these important design attributes of the ADB service in mind:

1. Cloud Optimized: Azure Databricks is a product built exclusively for cloud environments, like Azure. No on-
prem deployments currently exist. It assumes certain features are provided by the Cloud, is designed
keeping Cloud best practices, and conversely, provides Cloud-friendly
features.
2. Platform/Software as a Service Abstraction: ADB sits somewhere between the PaaS and SaaS ends of the
spectrum, depending on how you use it. In either case ADB is designed to hide infrastructure details as
much as possible so the user can focus on application development. It is
not, for example, an IaaS offering exposing the guts of the OS Kernel to you.
3. Managed Service: ADB guarantees a 99.95% uptime SLA. There’s a large team of dedicated staff members
who monitor various aspects of its health and get alerted when something goes wrong. It is run like an
always-on website and the staff strives to minimize any downtime.

These three attributes make ADB very different than other Spark platforms such as HDP, CDH, Mesos, etc. which
are designed for on-prem datacenters and allow the user complete control over the hardware. The concept of a
cluster is pretty unique in Azure Databricks. Unlike YARN or Mesos clusters which are just a collection of worker
machines waiting for an application to be scheduled on them, clusters in ADB come with a pre-configured Spark
application. ADB submits all subsequent user requests like notebook commands, SQL queries, Java jar jobs, etc. to
this primordial app for execution. This app is called the “Databricks Shell.”

Under the covers Databricks clusters use the lightweight Spark Standalone resource allocator.

When it comes to taxonomy, ADB clusters are divided along notions of “type”, and “mode.” There are two types of
ADB clusters, according to how they are created. Clusters created using UI are called Interactive Clusters,
whereas those created using Databricks API are called Jobs Clusters. Further, each cluster can be of two modes:
Standard and High Concurrency. All clusters in Azure Databricks can automatically scale to match the workload,
called Autoscaling.

Standard Mode High Concurrency Mode

Targeted User Data Engineers Data Scientists, Business Analysts

Languages Scala, Java, SQL, Python, R SQL, Python, R

Best Use Batch Jobs for ETL Data Exploration

Security Model Single User/Job Cluster Shared Cluster

Isolation Medium High

Table-level security No Yes

Query Preemption No Yes

Version 9.1 Published on 5th August 2020 Page 80 of 321


I&A Azure Solution Architecture Guidelines

AAD Passthrough No Yes

Autoscaling Yes Yes

Recommended Concurrency 1 10

CM Resource Allocator Spark Standalone Spark Standalone

SUPPORT INTERACTIVE ANALYTICS USING SHARED HIGH CONCURRENCY CLUSTERS

Impact: Medium

There are three steps for supporting Interactive workloads on ADB:

1. Deploy a shared cluster instead of letting each user create their own cluster.
2. Create the shared cluster in High Concurrency mode instead of Standard mode.
3. Configure security on the shared High Concurrency cluster, using one of the following options:
a. Turn on AAD Credential Passthrough if you’re using ADLS
b. Turn on Table Access Control for all other stores

If you’re using ADLS, we currently recommend that you select either Table Access Control or AAD
Credential Passthrough. Do not combine them together.

To understand why, let’s quickly see how interactive workloads are different from batch workloads:

Workload Interactive Batch


Attribute

Optimization Metric: Low execution time: low individual query latency. Maximizing jobs executed over
What matters to end some time period: high throughput.
users?

Submission Pattern: By users manually. Either executing Notebook Automatically submitted by a


How is the work queries or exploring data in a connected BI tool. scheduler or external workflow
submitted to ADB? tool without user input.

Cost: Are the No. Understanding data via interactive Yes, because a Job’s logic is fixed
workload’s demands exploration requires multitude of queries and doesn’t change with each run.
predictable? impossible to predict ahead of time.

Because of these differences, supporting Interactive workloads entails minimizing cost variability and optimizing for
latency over throughput, while providing a secure environment. These goals are satisfied by shared High
Concurrency clusters with Table access controls or AAD Passthrough turned on (in case of ADLS):

1. Minimizing Cost: By forcing users to share an autoscaling cluster you have configured with maximum node
count, rather than say, asking them to create a new one for their use each time they log in, you can control
the total cost easily. The max cost of shared cluster can be calculated by assuming it is running 24X7 at
maximum size with the particular VMs. You can’t achieve this if each user is given free reign over creating
clusters of arbitrary size and VMs.
2. Optimizing for Latency: Only High Concurrency clusters have features which allow queries from different
users share cluster resources in a fair, secure manner. HC clusters come with Query Watchdog, a process
which keeps disruptive queries in check by automatically pre-empting rogue queries, limiting the maximum
size of output rows returned, etc.
3. Security: Table Access control feature is only available in High Concurrency mode and needs to be turned
on so that users can limit access to their database objects (tables, views, functions...) created on the shared

Version 9.1 Published on 5th August 2020 Page 81 of 321


I&A Azure Solution Architecture Guidelines
3.

cluster. In case of ADLS, we recommend restricting access using the AAD Credential Passthrough feature
instead of Table Access Controls.

That said, irrespective of the mode (Standard or High Concurrency), all Azure Databricks clusters use Spark
Standalone cluster resource allocator and hence execute all Java and Scala user code in the same JVM. A shared
cluster model is secure only for SQL or Python programs because:

1. It is possible to isolate each user’s Spark SQL configuration storing sensitive credentials, temporary tables,
etc. in a Spark Session. ADB creates a new Spark Session for each Notebook attached to a High
Concurrency cluster. If you’re running SQL queries, then this isolation model works because there’s no way
to examine JVM’s contents using SQL.
2. Similarly, PySpark runs user queries in a separate process, so ADB can isolate DataFrames and DataSet
operations belonging to different PySpark users.

In contrast a Scala or Java program from one user could easily steal secrets belonging to another user sharing the
same cluster by doing a thread dump. Hence the isolation model of HC clusters, and this recommendation, only
applies to interactive queries expressed in SQL or Python. In practice this is rarely a limitation because Scala and
Java languages are seldom used for interactive exploration. They are mostly used by Data Engineers to build data
pipelines consisting of batch jobs. Those type of scenarios involve batch ETL jobs and are covered by the next
recommendation.

SUPPORT BATCH ETL WORKLOADS WITH SINGLE USER EPHEMERAL STANDARD CLUSTERS

Impact: Medium

Unlike Interactive workloads, logic in batch Jobs is well defined and their cluster resource requirements are known a
priori. Hence to minimize cost, there’s no reason to follow the shared cluster model and we recommend letting each
job create a separate cluster for its execution. Thus, instead of submitting batch ETL jobs to a cluster already
created from ADB’s UI, submit them using the Jobs APIs. These APIs automatically create new clusters to run Jobs
and also terminate them after running it. We call this the Ephemeral Job Cluster pattern for running jobs because
the clusters short life is tied to the job lifecycle.

Azure Data Factory uses this pattern as well - each job ends up creating a separate cluster since the underlying call
is made using the Runs-Submit Jobs API.

Version 9.1 Published on 5th August 2020 Page 82 of 321


I&A Azure Solution Architecture Guidelines

Just like the previous recommendation, this pattern will achieve general goals of minimizing cost, improving the
target metric (throughput), and enhancing security by:

1. Enhanced Security: ephemeral clusters run only one job at a time, so each executor’s JVM runs code from
only one user. This makes ephemeral clusters more secure than shared clusters for Java and Scala code.
2. Lower Cost: if you run jobs on a cluster created from ADB’s UI, you will be charged at the higher Interactive
DBU rate. The lower Data Engineering DBUs are only available when the lifecycle of job and cluster are
same. This is only achievable using the Jobs APIs to launch jobs on ephemeral
clusters.
3. Better Throughput: cluster’s resources are dedicated to one job only, making the job finish faster than while
running in a shared environment.

For very short duration jobs (< 10 min) the cluster launch time (~ 7 min) adds a significant overhead to total
execution time. Historically this forced users to run short jobs on existing clusters created by UI -- a costlier and less
secure alternative. To fix this, ADB is coming out with a new feature called Warm Pools in Q3 2019 bringing down
cluster launch time to 30 seconds or less.

FAVOR CLUSTER SCOPED INIT SCRIPTS OVER GLOBAL AND NAMED SCRIPTS

Impact: High

Init Scripts provide a way to configure cluster’s nodes and can be used in the following modes:

1. Global: by placing the init script in /databricks/init folder, you force the script’s execution every time any
cluster is created or restarted by users of the workspace.
2. Cluster Named: you can limit the init script to run only on for a specific cluster’s creation and restarts by
placing it in /databricks/init/<cluster_name> folder.
3. Cluster Scoped: in this mode the init script is not tied to any cluster by its name and its automatic execution
is not a virtue of its dbfs location. Rather, you specify the script in cluster’s configuration by either writing it
directly or providing its location on DBFS. Any location under
DBFS /databricks folder except /databricks/init can be used for this purpose. eg,
/databricks/<my-directory>/set-env-var.sh

You should treat Init scripts with extreme caution because they can easily lead to intractable cluster launch failures.
If you really need them, a) try to use the Cluster Scoped execution mode as much as possible, and, b) write them
directly in the cluster’s configuration rather than placing them on default DBFS and specifying the path. We say this
because:

1.

Version 9.1 Published on 5th August 2020 Page 83 of 321


I&A Azure Solution Architecture Guidelines

1. ADB executes the script’s body in each cluster node’s LxC container before starting Spark’s executor or
driver JVM in it -- the processes which ultimately run user code. Thus, a successful cluster launch and
subsequent operation is predicated on all nodal init scripts executing in a timely manner without any errors
and reporting a zero exit code. This process is highly error prone, especially for scripts downloading artifacts
from an external service over unreliable and/or misconfigured networks.
2. Because Global and Cluster Named init scripts execute automatically due to their placement in a special
DBFS location, it is easy to overlook that they could be causing a cluster to not launch. By specifying the Init
script in the Configuration, there’s a higher chance that you’ll consider them while debugging launch failures.
3. As we explained earlier, all folders inside default DBFS are accessible to workspace users. Your init scripts
containing sensitive data can be viewed by everyone if you place them there.

SEND LOGS TO BLOB STORE INSTEAD OF DEFAULT DBFS USING CLUSTER LOG DELIVERY

Impact: Medium

By default, Cluster logs are sent to default DBFS but you should consider sending the logs to a blob store location
using the Cluster Log delivery feature. The Cluster Logs contain logs emitted by user code, as well as Spark
framework’s Driver and Executor logs. Sending them to blob store is recommended over DBFS because:

1. ADB’s automatic 30-day DBFS log purging policy might be too short for certain compliance scenarios. Blob
store is the solution for long term log archival.
2. You can ship logs to other tools only if they are present in your storage account and a resource group
governed by you. The root DBFS, although present in your subscription, is launched inside a Microsoft-Azure
Databricks managed resource group and is protected by a read lock. Because of this lock the logs are only
accessible by privileged Azure Databricks framework code which shows them on UI. Constructing a pipeline
to ship the logs to downstream log analytics tools requires logs to be in a lock-free location first.

CHOOSE CLUSTER VMS TO MATCH WORKLOAD CLASS

Impact: High

To allocate the right amount and type of cluster resource for a job, we need to understand how different types of
jobs demand different types of cluster resources.

Machine Learning - To train machine learning models it’s usually required cache all of the data in memory.
Consider using memory optimized VMs so that the cluster can take advantage of the RAM cache. To size
the cluster, take a % of the data set cache it see how much memory it used extrapolate that to the rest of
the data. The tungsten data serializer optimizes the data in-memory. Which means you’ll need to test the
data to see the relative magnitude of compression.
Streaming - You need to make sure that the processing rate is just above the input rate at peak times of the
day. Depending peak input rate times, consider compute optimized VMs for the cluster to make sure
processing rate is higher than your input rate.
ETL - In this case, data size and deciding how fast a job needs to be will be a leading indicator. Spark doesn’
t always require data to be loaded into memory in order to execute transformations, but you’ll at the very
least need to see how large the task sizes are on shuffles and compare that to the task throughput you’d like.
To analyze the performance of these jobs start with basics and check if the job is by CPU, network, or local I
/O, and go from there. Consider using a general purpose VM for these jobs.
Interactive / Development Workloads - The ability for a cluster to auto scale is most important for these types
of jobs. Azure Databricks has a cluster manager and Serverless clusters to optimize the size of cluster during
peak and low times. In this case taking advantage of Serverless clusters and Autoscaling will be your best
friend in managing the cost of the infrastructure.

ARRIVE AT CORRECT CLUSTER SIZE BY ITERATIVE PERFORMANCE TESTING

Impact: High

Version 9.1 Published on 5th August 2020 Page 84 of 321


I&A Azure Solution Architecture Guidelines

It is impossible to predict the correct cluster size without developing the application because Spark and Azure
Databricks use numerous techniques to improve cluster utilization. The broad approach you should follow for sizing
is:

1. Develop on a medium sized cluster of 2-8 nodes, with VMs matched to workload class as explained earlier.
2. After meeting functional requirements, run end to end test on larger representative data while measuring
CPU, memory and I/O used by the cluster at an aggregate level.
3. Optimize cluster to remove bottlenecks found in step 2:
a. CPU bound: add more cores by adding more nodes
b. Network bound: use fewer, bigger SSD backed machines to reduce network size and improve remote
read performance
c. Disk I/O bound: if jobs are spilling to disk, use VMs with more memory.
4. Repeat steps 2 and 3 by adding nodes and/or evaluating different VMs until all obvious bottlenecks have
been addressed.

Performing these steps will help you to arrive at a baseline cluster size which can meet SLA on a subset of data.
Because Spark workloads exhibit linear scaling, you can arrive at the production cluster size easily from here. For
example, if it takes 5 nodes to meet SLA on a 100TB dataset, and the production data is around 1PB, then prod
cluster is likely going to be around 50 nodes in size.

TUNE SHUFFLE FOR OPTIMAL PERFORMANCE

Impact: High

A shuffle occurs when we need to move data from one node to another in order to complete a stage. Depending on
the type of transformation you are doing you may cause a shuffle to occur. This happens when all the executors
require seeing all of the data in order to accurately perform the action. If the Job requires a wide transformation, you
can expect the job to execute slower because all of the partitions need to be shuffled around in order to complete
the job. Eg: Group by, Distinct.

Version 9.1 Published on 5th August 2020 Page 85 of 321


I&A Azure Solution Architecture Guidelines

You’ve got two control knobs of a shuffle you can use to optimize

The number of partitions being shuffled:

spark.conf.set("spark.sql.shuffle.partitions", 10)

The amount of partitions that you can compute in parallel.


This is equal to the number of cores in a cluster.

These two determine the partition size, which we recommend should be in the Megabytes to 1 Gigabyte range. If
your shuffle partitions are too small, you may be unnecessarily adding more tasks to the stage. But if they are too
big, you may get bottlenecked by the network.

STORE DATA IN PARQUET PARTITIONS

Impact: High

Azure Databricks has an optimized Parquet reader, enhanced over the Open Source Spark implementation and it is
the recommended data storage format. In addition, storing data in partitions allows you to take advantage of
partition pruning and data skipping. Most of the time partitions will be on a date field but choose your partitioning
field based on the relevancy to the queries the data is supporting. For example, if you’re always going to be filtering
based on “Region,” then consider partitioning your data by region.

Evenly distributed data across all partitions (date is the most common)
10s of GB per partition (~10 to ~50GB)
Small data sets should not be partitioned
Beware of over partitioning

Monitoring

Once you have your clusters setup and your Spark applications running, there is a need to monitor your Azure
Databricks pipelines. These pipelines are rarely executed in isolation and need to be monitored along with a set of
other services. Monitoring falls into four broad areas:

1. Resource utilization (CPU/Memory/Network) across an Azure Databricks cluster. This is referred to as VM


metrics
2. Spark metrics which enables monitoring of Spark applications to help uncover bottlenecks
3. Spark application logs which enables administrators/developers to query the logs, debug issues and
investigate job run failures. This is specifically helpful to also understand exceptions across your workloads
4. Application instrumentation which is native instrumentation that you add to your application for custom
troubleshooting

For the purposes of this version of this document we will focus on (1). This is the most common ask from customers.

COLLECT RESOURCE UTILIZATION METRICS ACROSS AZURE DATABRICKS CLUSTER IN A LOG ANALYTICS WORKSPACE

Impact: Medium

An important facet of monitoring is understanding the resource utilization across an Azure Databricks cluster. You
can also extend this to understand utilization across all your Azure Databricks clusters in a workspace. This could
be useful in arriving at a cluster size and VM sizes given each VM size does have a set of limits (cores/disk
throughput/network throughput) and could play a role in the performance profile of an Azure Databricks job.

In order to get utilization metrics of the Azure Databricks cluster, we use the Azure Linux diagnostic extension as an
init script into the clusters we want to monitor. Note: This could increase your cluster startup time by a minute.

Version 9.1 Published on 5th August 2020 Page 86 of 321


I&A Azure Solution Architecture Guidelines

You can reach out to your respective TDA architect to install the Log Analytics agent on Azure Databricks agent to
collect VM metrics in your Log Analytics workspace.

Querying VM metrics in Log Analytics once you have started the collection using the above document

You can use Log analytics directly to query the Perf data. Here is an example of a query which charts out CPU for
the VM’s in question for a specific cluster ID. See log analytics overview for further documentation on log analytics
and query syntax.

Perf
| where TimeGenerated > now() - 7d and TimeGenerated < now() - 6d
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where InstanceName == "_Total"
| where _ResourceId contains "databricks-rg-"
| where Computer has "0408-235319-boss755" //clusterID
| project ObjectName , CounterName , InstanceName , TimeGenerated ,
CounterValue , Computer
| summarize avg(CounterValue) by bin(TimeGenerated, 1min),Computer
| render timechart

Version 9.1 Published on 5th August 2020 Page 87 of 321


I&A Azure Solution Architecture Guidelines

Delta handling

Databricks Delta is a single data management tool that combines the scale of a data lake, the reliability and
performance of a data warehouse, and the low latency of streaming in a single system.

Delta lets organizations remove complexity by getting the benefits of multiple storage systems in one. By combining
the best attributes of existing systems over scalable, low-cost cloud storage, Delta will enable dramatically simpler
data architectures that let organizations focus on extracting value from their data.

The core abstraction of Databricks Delta is an optimized Spark table that

Stores data as Parquet files in DBFS.


Maintains a transaction log that efficiently tracks changes to the table.

I&A Tech recommends use of Databricks Delta for complex delta operations. Seek advice from your Solution
Architect representative.

Version 9.1 Published on 5th August 2020 Page 88 of 321


I&A Azure Solution Architecture Guidelines

Databricks Delta to DW

SQL Data warehouse can read data from Databricks delta Parquet files.

-- Create a db master key if one does not already exist, using your own
password.
CREATE MASTER KEY;

-- Create a database scoped credential.


CREATE DATABASE SCOPED CREDENTIAL [ADLSCredential]
WITH IDENTITY = N'<SPNID>@<OAUTH URL>',
SECRET = '<SECRET>';

CREATE EXTERNAL DATA SOURCE [DataLakeStore]


WITH (
TYPE = HADOOP,
LOCATION = N'adl://neudlstlivewiredev01.azuredatalakestore.net',
CREDENTIAL = [ADLSCredential]
);

CREATE EXTERNAL FILE FORMAT [ParquetFileFormat]


WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = N'org.apache.hadoop.io.compress.SnappyCodec'
);

CREATE SCHEMA [Ext];

CREATE EXTERNAL TABLE [Ext].[Events] (


[action] VARCHAR(100) NULL,
[date] VARCHAR (200) NULL
)
WITH (
DATA_SOURCE = [DataLakeStore],
LOCATION = N'/root/delta/events/',
FILE_FORMAT = [ParquetFileFormat],
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);

SELECT * FROM Ext.Events -- This will throw an error complaining about


the number of columns

DROP EXTERNAL TABLE [Ext].[Events] ;

CREATE EXTERNAL TABLE [Ext].[Events] (


[action] VARCHAR(100) NULL

Version 9.1 Published on 5th August 2020 Page 89 of 321


I&A Azure Solution Architecture Guidelines

)
WITH (
DATA_SOURCE = [DataLakeStore],
LOCATION = N'/root/delta/events/',
FILE_FORMAT = [ParquetFileFormat],
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);

SELECT * FROM [Ext].[Events]


SELECT count(1) FROM [Ext].[Events]

CREATE EXTERNAL TABLE [Ext].[EventsNoPartition] (


[action] VARCHAR(100) NULL,
[date] VARCHAR (200) NULL
)
WITH (
DATA_SOURCE = [DataLakeStore],
LOCATION = N'/root/delta/events-no-partition/',
FILE_FORMAT = [ParquetFileFormat],
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);

SELECT top 100 * FROM [Ext].[EventsNoPartition]


SELECT count(1) FROM [Ext].[EventsNoPartition]

CREATE EXTERNAL TABLE [Ext].[Amounts] (


[CustomerId] VARCHAR(100) null
, [ProductId] VARCHAR(100) null
, [CurrencyId] VARCHAR(100) null
, [GrossStandardVolume] VARCHAR(100) null
, [NetInvoiceValue] VARCHAR(100) null
, [Date] date null
, [EventId] VARCHAR(100) null
, [BucketKey] int null
)
WITH (
DATA_SOURCE = [DataLakeStore],
LOCATION = N'/root/enriched/Amounts/Delta',
FILE_FORMAT = [ParquetFileFormat],
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);

SELECT top 100 * FROM [Ext].[Amounts]


SELECT COUNT(1) FROM [Ext].[Amounts]

Version 9.1 Published on 5th August 2020 Page 90 of 321


I&A Azure Solution Architecture Guidelines

DROP EXTERNAL TABLE [Ext].[Amounts]

CREATE EXTERNAL TABLE [Ext].[Amounts] (


[CustomerId] VARCHAR(100) null
, [ProductId] VARCHAR(100) null
, [CurrencyId] VARCHAR(100) null
, [GrossStandardVolume] VARCHAR(100) null
, [NetInvoiceValue] VARCHAR(100) null
, [Date] date null
, [EventId] VARCHAR(100) null
)
WITH (
DATA_SOURCE = [DataLakeStore],
LOCATION = N'/root/enriched/Amounts/Delta',
FILE_FORMAT = [ParquetFileFormat],
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);

SELECT top 100 * FROM [Ext].[Amounts]

Version 9.1 Published on 5th August 2020 Page 91 of 321


I&A Azure Solution Architecture Guidelines

Databricks Cluster Pools


What are Cluster Pools?
How does a Pool work?
What are the advantages of Pools?
What’s the value of this feature? (e.g. without pools vs. with pools)
Key Cluster Pools Configurations
Orchestrating Pools with Instance Pools REST API from a Notebook
Documentation
Notebook Setup
List all Pools in the workspace
Create a new Pool
Get Pool info using Pool ID
Edit an existing Pool
Delete an existing Pool

What are Cluster Pools?

Pools hold warm instances so Clusters (Automated or Interactive) can start blazingly fast.

How does a Pool work?

To reduce cluster start time, you can attach a cluster to a predefined Pool of idle instances.
When attached to a Pool, a cluster allocates its driver and worker nodes from the pool.
If the Pool does not have sufficient idle resources to accommodate the cluster’s request, the Pool expands
by allocating new instances from Azure.
When an attached cluster is terminated, the instances it used are returned to the Pool and can be reused by
a different clusters.

What are the advantages of Pools?

Without Pools, instances can take 5-10 minutes to fetch usually. This could mean a 1 minute job could take
11 minutes overall.
With Pools, it will probably load in 10s.

What’s the value of this feature? (e.g. without pools vs. with pools)

Pools make clusters start and scale much faster across all workloads.

Refer Databricks documentation on how to create and manage pools.

Version 9.1 Published on 5th August 2020 Page 92 of 321


I&A Azure Solution Architecture Guidelines

https://docs.microsoft.com/en-us/azure/databricks/clusters/instance-pools/create

Key Cluster Pools Configurations

Minimum Idle Instances

The minimum number of instances the pool keeps idle. These instances do not terminate, regardless of the
setting specified in Idle Instance Auto Termination. If a cluster consumes idle instances from the pool, Azure
Databricks provisions additional instances to maintain the minimum. Projects could use scripts to reduce this
to a lower number and practically zero when not in use basis their job frequency.

Maximum Capacity

The maximum number of instances that the pool will provision. If set, this value constrains all instances (idle
+ used). If a cluster using the pool requests more instances than this number during autoscaling, the request
will fail with an INSTANCE_POOL_MAX_CAPACITY_FAILURE error. You should restrict the capacity to
ensure project uses a defined capacity and threshold from cost and capacity threshold on the subscription.

Idle Instance Auto Termination

The time in minutes that instances above the value set in Minimum Idle Instances can be idle before being
terminated by the pool.

Ideally set this time to 5 or 10 mins. This will ensure we don’t keep many idle VMs in pool and get costed for
the same.

Instance types

A pool consists of both idle instances kept ready for new clusters and instances in use by running clusters.
All of these instances are of the same instance provider type, selected when creating a pool.

A pool’s instance type cannot be edited. Clusters attached to a pool use the same instance type for the driver
and worker nodes
Basis the workloads you run look at recommended cluster configurations and choose the right VM types.

Pool Tags

Ensure you add appropriate tags while creating pools for easier tracking and analysis. Follow the Tagging
guidelines in the design standards page.

Orchestrating Pools with Instance Pools REST API from a Notebook

Please note that the process described in this section can be used by projects that want to manage create, edit,
delete of pools using Databricks notebooks. There are various code samples and a combination of them can be
used to serve your purpose.

This process makes use of a Databricks Token belonging to a user who has admin permissions on
Databricks workspace. This token can be used to do almost anything on the workspace, hence be careful
while using it.

Azure Databricks pools reduce cluster start and auto-scaling times by maintaining a set of idle, ready-to-use
instances.
When a cluster attached to a pool needs an instance, it first attempts to allocate one of the pool’s idle
instances.
If the pool has no idle instances, the pool expands by allocating a new instance from the instance provider in
order to accommodate the cluster’s request.
When a cluster releases an instance, it returns to the pool and is free for another cluster to use. Only clusters
attached to a pool can use that pool’s idle instances.

Version 9.1 Published on 5th August 2020 Page 93 of 321


I&A Azure Solution Architecture Guidelines

DOCUMENTATION

Instance Pools - Intro -- https://docs.microsoft.com/en-us/azure/databricks/clusters/instance-pools/


Instance Pools - REST API -- https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/instance-
pools

“Please note we are using Databricks Secrets for the Bearer Token in this notebook.

key is bearer_token and is created in a Databricks-backed scope: pools-secrets-scope”

NOTEBOOK SETUP

import requests
import json
from string import Template

# Read bearer_token from secrets scope. Please note you'll have to


create this scope and secret in the vault.
bearer_token = dbutils.secrets.get("pools-secrets-scope",
"bearer_token")
# Change the domain based on where your Databricks instance is hosted.
domain = "https://westus2.azuredatabricks.net/api/2.0"
headers = {"Authorization": f"Bearer {bearer_token}"}

def log_response(resp):
"""
Logs the JSON response to the console
:param resp: JSON response object.
:return: None.
"""
print(json.dumps(resp.json(), indent=2))
return None

class ApiError(Exception):
"""An API Error Exception"""

def __init__(self, status):


self.status = status

def __str__(self):
return "ApiError: {}".format(self.status)

LIST ALL POOLS IN THE WORKSPACE

Version 9.1 Published on 5th August 2020 Page 94 of 321


I&A Azure Solution Architecture Guidelines

def list_pools():
endpoint = "/instance-pools/list"

resp = requests.get(domain + endpoint, headers=headers)


if resp.status_code != 200:
raise ApiError(f"Request to: '{endpoint}' -- status code: {resp.
status_code}")
else:
log_response(resp)

CREATE A NEW POOL

endpoint = "/instance-pools/create"
pool_name = "restapi-pool-prashanth"

template = Template("""{"instance_pool_name": "${pool_name}",


"node_type_id": "Standard_D3_v2",
"min_idle_instances": "2",
"max_capacity": "5",
"idle_instance_autotermination_minutes": "10"
}
"""
)
data = template.substitute(pool_name=pool_name)

pool_id = ""
resp = requests.post(domain + endpoint, data=data, headers=headers)
if resp.status_code != 200:
raise ApiError(f"Request to: '{endpoint}' -- status code: {resp.
status_code}")
else:
log_response(resp)
pool_id = resp.json()['instance_pool_id']

GET POOL INFO USING POOL ID

endpoint = "/instance-pools/get"

template = Template('{"instance_pool_id": "${pool_id}"}')


data = template.substitute(pool_id=pool_id)

resp = requests.get(domain + endpoint, data=data, headers=headers)


if resp.status_code != 200:
raise ApiError(f"Request to: '{endpoint}' -- status code: {resp.
status_code}")

Version 9.1 Published on 5th August 2020 Page 95 of 321


I&A Azure Solution Architecture Guidelines

else:
log_response(resp)

EDIT AN EXISTING POOL

endpoint = "/instance-pools/edit"

template = Template("""{"instance_pool_id": "${pool_id}",


"instance_pool_name": "test-pool-prashanth",
"node_type_id": "Standard_D3_v2",
"min_idle_instances": "0",
"max_capacity": "3",
"idle_instance_autotermination_minutes": "30"
}
"""
)
data = template.substitute(pool_id=pool_id)

resp = requests.post(domain + endpoint, data=data, headers=headers)


if resp.status_code != 200:
raise ApiError(f"Request to: '{endpoint}' -- status code: {resp.
status_code}")
else:
log_response(resp)

DELETE AN EXISTING POOL

endpoint = "/instance-pools/delete"

template = Template('{"instance_pool_id": "${pool_id}"}')


data = template.substitute(pool_id=pool_id)

resp = requests.post(domain + endpoint, data=data, headers=headers)


if resp.status_code != 200:
raise ApiError(f"Request to: '{endpoint}' -- status code: {resp.
status_code}")
else:
log_response(resp)

Version 9.1 Published on 5th August 2020 Page 96 of 321


I&A Azure Solution Architecture Guidelines

Section 2.2 - Azure Data Factory V2


Overview

Azure Data Factory v2 (ADF) will be the primary tool used for the orchestration of data into ADLS. Azure Data
Factory is a hybrid data integration service that allows you to create, schedule and orchestrate your ETL/ELT
workflows at scale wherever your data lives, in the cloud or a self-hosted network. Meet your security and
compliance needs while taking advantage of ADF’s extensive capabilities. Azure Data Factory is used to create and
schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. It can process
and transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake
Analytics, and Azure Machine Learning

Azure Data Factory is the platform that solves such data scenarios. It is the cloud-based ETL and data integration
service that allows you to create data-driven workflows for orchestrating data movement and transforming data at
scale. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can
ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data
flows or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.

Additionally, you can publish your transformed data to data stores such as Azure SQL Data Warehouse for
business intelligence (BI) applications to consume. Ultimately, through Azure Data Factory, raw data can be
organized into meaningful data stores and data lakes for better business decisions.

Connect and Collect

Enterprises have data of various types that are located in disparate sources on-premises, in the cloud, structured,
unstructured, and semi-structured, all arriving at different intervals and speeds.

The first step in building an information production system is to connect to all the required sources of data and
processing, such as software-as-a-service (SaaS) services, databases, file shares, and FTP web services. The next
step is to move the data as needed to a centralized location for subsequent processing.

With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and cloud
source data stores to a centralization data store in the cloud for further analysis. For example, you can collect data
in Azure Data Lake Storage and transform the data later by using an Azure Data Lake Analytics compute service.
You can also collect data in Azure Blob storage and transform it later by using an Azure HDInsight Hadoop cluster

Transform and enrich

After data is present in a centralized data store in the cloud, process or transform the collected data by using ADF
mapping data flows. Data flows enable data engineers to build and maintain data transformation graphs that
execute on Spark without needing to understand Spark clusters or Spark programming.

If you prefer to code transformations by hand, ADF supports external activities for executing your transformations
on compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning etc .Preferred
to use the Azure data-bricks,Azure dataflows to transform or enrich the data the data

CI/CD and publish

Data Factory offers full support for CI/CD of your data pipelines using Azure DevOps and GitHub. This allows you to
incrementally develop and deliver your ETL processes before publishing the finished product. After the raw data
has been refined into a business-ready consumable form, load the data into Azure Data Warehouse, Azure SQL
Database, Azure CosmosDB, or whichever analytics engine your business users can point to from their business
intelligence tools.

Use configuration files to assist in deploying to multiple environments (dev/test/prod). Configuration files are used in
Data Factory projects in Visual Studio. When Data Factory assets are published, Visual Studio uses the content in
the configuration file to replace the specified JSON attribute values before deploying to Azure. A Data Factory
configuration file is a JSON file that provides a name-value pair for each attribute that changes based upon the

Version 9.1 Published on 5th August 2020 Page 97 of 321


I&A Azure Solution Architecture Guidelines

environment to which you are deploying. This could include connection strings, usernames, passwords, pipeline
start and end dates, and more. When you publish from Visual Studio, you can choose the appropriate deployment
configuration through the deployment wizard.

Monitor

After you have successfully built and deployed your data integration pipeline, providing business value from refined
data, monitor the scheduled activities and pipelines for success and failure rates. Azure Data Factory has built-in
support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the
Azure portal.

An Azure subscription can have multiple resource groups based on environment.

Resource groups might have one or more Azure Data Factory instances (or data factories) based on project
requirement. Azure Data Factory is composed of four key components.

Pipeline

A data factory might have one or more pipelines. A pipeline is a logical grouping of activities that performs a unit of
work. Together, the activities in a pipeline perform a task. For example, a pipeline can contain a group of activities
that ingests data from an Azure blob, and then runs a Hive query on an HDInsight cluster to partition the data.

Version 9.1 Published on 5th August 2020 Page 98 of 321


I&A Azure Solution Architecture Guidelines

Pipeline allows to manage the activities as a set instead of managing each one individually. The activities in a
pipeline can be chained together to operate sequentially, or they can operate independently in parallel.

Mapping data flows

Create and manage graphs of data transformation logic that can be used to transform any-sized data. Build-up a
reusable library of data transformation routines and execute those processes in a scaled-out manner from your ADF
pipelines. Data Factory will execute logic on a Spark cluster that spins-up and spins-down based on the
requirement.

Activity

Activities represent a processing step in a pipeline. For example, use a copy activity to copy data from one data
store to another data store. Similarly, might use a Hive activity, which runs a Hive query on an Azure HDInsight
cluster, to transform or analyze your data. Data Factory supports three types of activities: data movement activities,
data transformation activities, and control activities.

Datasets

Datasets represent data structures within the data stores, which simply point to or reference the data you want to
use in your activities as inputs or outputs.

Linked services

Linked services are much like connection strings, which define the connection information that's needed for Data
Factory to connect to external resources. a linked service defines the connection to the data source, and a dataset
represents the structure of the data. For example, an Azure Storage-linked service specifies a connection string to
connect to the Azure Storage account. Additionally, an Azure blob dataset specifies the blob container and the
folder that contains the data.

The credentials of the Linked services should be stored in Azure key vault(AKV)

Triggers

Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off. There
are different types of triggers for different types of events.

Pipeline runs

A pipeline run is an instance of the pipeline execution. Pipeline runs are typically instantiated by passing the
arguments to the parameters that are defined in pipelines. The arguments can be passed manually or within the
trigger definition.

Parameters

Parameters are key-value pairs of read-only configuration. Parameters are defined in the pipeline. The arguments
for the defined parameters are passed during execution from the run context that was created by a trigger or a
pipeline that was executed manually. Activities within the pipeline consume the parameter values.

A dataset is a strongly typed parameter and a reusable/reference able entity. An activity can reference datasets and
can consume the properties that are defined in the dataset definition.

A linked service is also a strongly typed parameter that contains the connection information to either a data store or
a compute environment. It is also a reusable/reference able entity.

Control flow

Version 9.1 Published on 5th August 2020 Page 99 of 321


I&A Azure Solution Architecture Guidelines

Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching,
defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or from a
trigger. It also includes custom-state passing and looping containers, that is, For-each iterators.

Variables

Variables can be used inside of pipelines to store temporary values and can also be used in conjunction with
parameters to enable passing values between pipelines, data flows, and other activities.

Naming Standards

LINKED SERVICES AND DATASETS

Linked Service Linked Service Dataset Example

Azure Blob storage LS_ABLB_ DS_ABLB_{Purpose}

Azure Data Lake Store LS_ADLS_ DS_ADLS_{Purpose}

Azure SQL Database LS_ASQL_ DS_ASQL_{Purpose}

Azure SQL Data Warehouse LS_ASDW_ DS_ASDW_{Purpose}

PIPELINES

Type Name Example

Tranformation PL_TRAN_{Purpose} PL_TRAN_PROCESS_DIMENSIONS

Tranformation PL_TRAN_{Purpose} PL_TRAN_HELPER_EXECUSQL

ACTIVITY

{ActivityShortName}_{Purpose}
Ex: LK_LogStart where LK=LookUp Activity and LogStart is the Purpose

Activity Name Example

Lookup LK_{Purpose} LK_LogStart

Filter FL_{Purpose} FL_FilterFiles

Copy Data CD_{Purpose} CD_CopyFiles

StoreProcedure SP_{Purpose} SP_LogEnd

ForEach FE_{Purpose} FE_ForEachFileMapping

If Condition IF_{Purpose} IF_CheckFilePattern

Get Metadata GM_{Purpose} GM_GetAllFilesName_Regex

Set Variable SV_{Purpose} SV_GetTheActualFileName

VARIABLES

For variable naming follow a lowerCamelCase


Ex: sourceSystemCode = "SAP"

Version 9.1 Published on 5th August 2020 Page 100 of 321


I&A Azure Solution Architecture Guidelines

Best Practices

Should have function specific Azure data factory based on data sensitivity
Should not exceed the 5000 objects in the ADF
Should follow the standard naming conventions for all objects
Use service principal authentication in the linked service to connect to Azure Data Lake Store. Azure Data
Lake Store uses Azure Active Directory for authentication. One can use service principal authentication in an
Azure Data Factory linked service used to connect to Azure Data Lake Store. This alleviates some of the
issues where tokens expired at inopportune times and removes the need to manage another unattended
service account in Azure Active Directory. Creating the service principal can be automated using PowerShell.

Integration Runtime

Azure Integration Runtime (IR) (formerly DMG) is the compute infrastructure used by Azure Data Factory to provide
the following data integration capabilities across different network environments:

Data movement: Move data between data stores in public network and data stores in private network (on-premise
or virtual private network). It provides support for built-in connectors, format conversion, column mapping, and
performant and scalable data transfer.

Activity dispatch: Dispatch and monitor transformation activities running on a variety of compute services such as
Azure HDInsight, Azure Machine Learning, Azure SQL Database, SQL Server, and more.

SSIS package execution: Natively execute SQL Server Integration Services (SSIS) packages in a managed Azure
compute environment.

In Data Factory, an activity defines the action to be performed. A linked service defines a target data store or a
compute service. An integration runtime provides the bridge between the activity and linked Services. It is
referenced by the linked service, and provides the compute environment where the activity either runs on or gets
dispatched from. This way, the activity can be performed in the region closest possible to the target data store or
compute service in the most performant way while meeting security and compliance needs.

These are the IR Types are approved on case to case basis:

IR type Public network Private network

Azure Data Flow


Data movement
Activity dispatch

Self-hosted Data movement Data movement from private network


Activity dispatch Activity dispatch

Azure-SSIS SSIS package execution SSIS package execution

IR NAMING STANDARD

For naming follow a lowerCase followed by the RG and unique identifier as shown below,based on environment
standard naming should be followed
Ex: bieno-da-d-80011-dfgw-05

Azure IR

By default, each data factory has an Azure IR in the backend that supports operations on cloud data stores and
compute services in public network. The location of that Azure IR is auto-resolve. If connectVia property is not
specified in the linked service definition, the default Azure IR is used. You only need to explicitly create an Azure IR
when you would like to explicitly define the location of the IR, or if you would like to virtually group the activity
executions on different IRs for management purpose.

Version 9.1 Published on 5th August 2020 Page 101 of 321


I&A Azure Solution Architecture Guidelines

Self Hosted IR

CONSIDERATIONS FOR USING A SELF-HOSTED IR

You can use a single self-hosted integration runtime for multiple on-premises data sources. You can also
share it with another data factory within the same Azure Active Directory (Azure AD) tenant.
You can install only one instance of a self-hosted integration runtime on any single machine. If you have two
data factories that need to access on-premises data sources, either use the self-hosted IR sharing feature to
share the self-hosted IR, or install the self-hosted IR on two on-premises computers, one for each data
factory.
The self-hosted integration runtime doesn't need to be on the same machine as the data source. However,
having the self-hosted integration runtime close to the data source reduces the time for the self-hosted
integration runtime to connect to the data source. We recommend that you install the self-hosted integration
runtime on a machine that differs from the one that hosts the on-premises data source. When the self-hosted
integration runtime and data source are on different machines, the self-hosted integration runtime doesn't
compete with the data source for resources.
You can have multiple self-hosted integration run-times on different machines that connect to the same on-
premises data source. For example, if you have two self-hosted integration run-times that serve two data
factories, the same on-premises data source can be registered with both data factories.
If you already have a gateway installed on your computer to serve a Power BI scenario, install a separate
self-hosted integration runtime for Data Factory on another machine.
Use a self-hosted integration runtime to support data integration within an Azure virtual network.
Treat your data source as an on-premises data source that is behind a firewall, even when you use Azure
Express Route. Use the self-hosted integration runtime to connect the service to the data source.
Use the self-hosted integration runtime even if the data store is in the cloud on an Azure Infrastructure as a
Service (IaaS) virtual machine.

SCALE CONSIDERATIONS

Scale out: When processor usage is high and available memory is low on the self-hosted IR, add a new node
to help scale out the load across machines. If activities fail because they time out or the self-hosted IR node
is offline, it helps if you add a node to the gateway.
Scale up: When the processor and available RAM aren't well utilized, but the execution of concurrent jobs
reaches a node's limits, scale up by increasing the number of concurrent jobs that a node can run. You might
also want to scale up when activities time out because the self-hosted IR is overloaded

Azure-SSIS Integration Runtime

To lift and shift existing SSIS workload, you can create an Azure-SSIS IR to natively execute SSIS packages.Azure-
SSIS IR can be provisioned in either public network or private network. On-premises data access is supported by
joining Azure-SSIS IR to a Virtual Network that is connected to your on-premises network.

AZURE-SSIS IR COMPUTE RESOURCE AND SCALING

Azure-SSIS IR is a fully managed cluster of Azure VMs dedicated to run SSIS packages. You can bring your own
Azure SQL Database or Managed Instance server to host the catalog of SSIS projects/packages (SSISDB) that is
going to be attached to it. You can scale up the power of the compute by specifying node size and scale it out by
specifying the number of nodes in the cluster. You can manage the cost of running your Azure-SSIS Integration
Runtime by stopping and starting it as you see fit.

Performance Tuning

Refer to the following link on Microsoft docs to understand more on ADF Copy performance tuning

https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance

Version 9.1 Published on 5th August 2020 Page 102 of 321


I&A Azure Solution Architecture Guidelines

Azure Data Factory Best Practices


Implement Pipeline Hierarchies
Grandparent Pipelines
Parent Pipelines
Child Pipelines
Infant Pipelines (Optional)
Organise Data Factory
Linked Service Security via Azure Key Vault
Use Generic Datasets
Use Factory Component Folders
Pipeline & Activity Descriptions
CI/CD lifecycle

IMPLEMENT PIPELINE HIERARCHIES

Grandparent Pipelines

Firstly, the grandparent pipeline, the most senior level of our ADF pipelines. Our approach here would be to build
and consider two main operations:

1. Attaching Data Factory Triggers to start our solution execution. Either scheduled or event based. From Logic
Apps or called by PowerShell etc. The grandparent starts the processing.
2. Grouping the execution of our processes, either vertically through the layers of our logical data warehouse or
maybe horizontally from ingestion to output. In each case we need to handle the high level dependencies
within out wider platform.

We might have something like the below.

In the above mock-up pipeline I’ve used extract, transform and load (ETL) as a common example for where we
would want all our data ingestion processes to complete before starting any downstream pipelines.

You might also decide to controlling the scale and state of the services we are about to invoke. For example, when
working with:

Azure SQL Database (SQLDB), scale it up ready for processing (DTU’s).


Azure SQL Data Warehouse (SQLDW), start the cluster and set the scale (DWU’s).
Azure Analysis Service, resume the compute, maybe also sync our read only replica databases and pause
the resource if finished processing.
Azure Databricks, start up the cluster if interactive.

Parent Pipelines

Next, at the parent level we read metadata about the processes that need to run, and the different configurations for
each of those executions. A metadata driven approach is vital to scale out processing needed for parallal execution.
To support and manage the parallel execution of our child transformations/activities, the Data Factory ForEach
activity helps. Let’s think about these examples, when working with:

Version 9.1 Published on 5th August 2020 Page 103 of 321


I&A Azure Solution Architecture Guidelines

Azure SQLDB or Azure SQLDW, how many stored procedures do we want to execute at once.
Azure Databricks, how many notebooks do we want to execute.
Azure Analysis Service, how many models do we want to process at once.

Using this hierarchical structure, we aim to call the first stage ForEach activity which will contain calls to child
pipeline(s).

Child Pipelines

Next, at our child level we handle the actual execution of our data transformations. Plus, the nesting of the ForEach
activities in each parent and child level then gives us the additional scale out processing needed for some services.

At this level we are getting the configurations for each child run passed from the parent level. This is where running
we will be running the lowest level transformation operations against the given compute. Logging the outcome at
each stage should also happen at this level.

Version 9.1 Published on 5th August 2020 Page 104 of 321


I&A Azure Solution Architecture Guidelines

Infant Pipelines (Optional)

Our infants contain reusable handlers and calls that could potentially be used at any level in our solution. The best
example of an infant is where an ‘Error Handler’ that does bit more than just calling a Stored Procedure.

If created in Data Factory, we might have and ‘Error Handler’ infant/utility containing something like the below.

ORGANISE DATA FACTORY

Linked Service Security via Azure Key Vault

Azure Key Vault is now a core component of any solution, it should be in place holding the credentials for all our
service interactions. In the case of Data Factory most linked service connections support the obtaining of values
from Key Vault. Where ever possible we should be including this extra layer of security and allowing only Data
Factory to retrieve secrets from Key Vault using its own Managed Identity.

Use Generic Datasets

Where design allows it, always try to simplify the number of datasets listed in a Data Factory. In version 1 of the
product a hard coded dataset was required as the input and output for every stage in our processing. Thankfully
those days are in the past. Now we can use a completely metadata driven dataset for dealing with a particular type
of object against a linked service. For example, one dataset of all CSV files from Blob Storage and one dataset for
all SQLDB tables.

At runtime the dynamic content underneath the datasets are created in full so monitoring is not impacted by making
datasets generic. If anything, debugging becomes easier because of the common/reusable code.

Where generic datasets are used pass following values as parameters. Typically from the pipeline, or resolved at
runtime within the pipeline.

Location – the file path, table location or storage container.


Name – the file or table name.

Version 9.1 Published on 5th August 2020 Page 105 of 321


I&A Azure Solution Architecture Guidelines

Structure – the attributes available provided as an array at runtime.

Use Factory Component Folders

Folders and sub-folders are such a great way to organise our Data Factory components, we should all be using
them to help ease of navigation. Be warned though, these folders are only used when working with the Data
Factory portal UI. They are not reflected in the structure of our source code repo.

Adding components to folders is a very simple drag and drop exercise or can be done in bulk if you want to attack
the underlying JSON directly. Subfolders get applied using a forward slash, just like other file paths.

Pipeline & Activity Descriptions

Every Pipeline and Activity within Data Factory has a none mandatory description field. I want to encourage all of us
to start making better use of it. When writing any other code we typically add comments to things to offer others our
understanding. I want to see these description fields used in ADF in the same way.

CI/CD LIFECYCLE

Below is a sample overview of the CI/CD lifecycle in an Azure data factory that's configured with Azure Repos Git.
For more information on how to configure a Git repository, see Source control in Azure Data Factory.

1. A development data factory is created and configured with Azure Repos Git. All developers should have
permission to author Data Factory resources like pipelines and datasets.
2. As the developers make changes in their feature branches, they debug their pipeline runs with their most
recent changes.
3. After the developers are satisfied with their changes, they create a pull request from their feature branch to
the master or collaboration branch to get their changes reviewed by peers.
4. After a pull request is approved and changes are merged in the master branch, the changes can be
published to the development factory.
5. When the team is ready to deploy the changes to the test factory and then to the production factory, the team
exports the Resource Manager template from the master branch.
6. The exported Resource Manager template is deployed with different parameter files to the test factory and
the production factory.

Here is the link to a video that takes us through these steps

Version 9.1 Published on 5th August 2020 Page 106 of 321


I&A Azure Solution Architecture Guidelines

Ingestion Patterns

This page delivers a graphical representation of the Ingestion Design Patterns agreed for each different source
system.

APPROVED SOURCE INGESTION PATTERNS

DATA SOURCE Internal SOURCE STAGING LAYER EXTRACTION CONNECT


/ TYPE/ METHOD ION
External TECHNOLOGY DRIVER

Cordillera, U2K2, Internal SAP BW Open Hub File Destination ADF + IR File
Fusion- BW Connector

SAP ECC( Internal SAP ECC (DB2 SAP HANA ADF + IR


Cordillera,Sirius, /oracle)
U2K2,Fusion)

APO (S1) Internal SAP ABAP BLOB ADF Direct BODS

APO (S2) Internal SAP ABAP SAP HANA ADF + IR File


Connector

GMRDR Internal Oracle ADLS ADF Direct + ODBC


IR Driver

CRM(S1) Internal SAP ABAP BLOB ADF Direct BODS

CRM(S2) Internal SAP ABAP SAP HANA ADF + IR File


Connector

PLM (S2) Internal SAP ABAP SAP HANA ADF + IR File


Connector

Manual Mapping Internal CSV, flat files FMT ADF Direct Blob
Files (S0) / (Excl. Excel)
External

Manual Mapping Internal CSV, flat files Blob / ADLS Generation 2 ADF Direct ADF
Files (S1) / (Excl. Excel) Connector
External

Manual Mapping External CSV, flat files External SFTP ADF FTP –
Files (S2) (Excl. Excel) Source

Manual Mapping External Excel Blob / ADLS Generation 2 ADF SSIS IR


Files (S4) /External SFTP/Landing
zone

Victory Internal SAP BW Open Hub File Destination ADF + IR File


Connector

Leveredge BW External Files OnPrem File Share ADF + IR File


(Online Countries) Connector

Leveredge ISR External Files OnPrem File Share ADF + IR File


(Offline Countries) Connector

Newspage External SFTP NA ADF +IR (NAT SFTP


Gateway) connector

Version 9.1 Published on 5th August 2020 Page 107 of 321


I&A Azure Solution Architecture Guidelines

Darwin External Fractal Cloud Blob ADF Direct ADF


Connector

DMS (Distributor Internal SQL Server NA ADF Direct


mangement system)

Merlin Oracle ADF FTP –


Source

Manual Mapping Internal CSV - Fileshare ADLS ADF + API call


Files (S5) (Custom
activity)

Neogrid ADF FTP –


Source

WalMart External Files on Blob Blob ADF FTP –


storage Source

Infoscout External Files on Blob Blob ADF FTP –


storage Source

Target External Files on Blob Blob ADF FTP –


storage Source

SA Business Internal File Manual upload

IMRB(Kantar) External Manual upload

Customer Service ADF FTP –


Source

CODA Internal Oracle NA ADF Direct

Global Cash Up External AWS ADF FTP –


Source

Enterprise Forecast ADF FTP –


Source

Work Day External Manual upload

SAP Fieldglass SaaS NA ADLS ADF Direct Web API

File External SFTP, Central NA ADF FTP,


/Internal File Share ADF Direct

Teradata Internal Teradata NA ADF Direct

Nielsen-Cubes External Files Nielson SFTP server ADF Direct

Nielsen DAR External Files Blob ADF Direct

Victory File ADF FTP –


Source

Connect MIP File ADF FTP –


Source

Nielsen AOD External Files OnPrem File Share ADF + IR

Kantar

Version 9.1 Published on 5th August 2020 Page 108 of 321


I&A Azure Solution Architecture Guidelines

ADF FTP –
Source

Lowe ADF FTP –


Source

Teradata ADF FTP –


[Mindshare] Source

Pine ADF FTP –


Source

Mindshare Manual upload

Milward Brown Manual upload

IMRB Manual upload

IPM - ASQL Internal Azure SQL NA ADF Direct

SAP APO Internal SAP APO SAP HANA ADF Direct

WeatherTrends External CSV - SFTP NA ADF Direct

PLM Internal SAP SAP HANA ADF Direct

Controller Site Excel ADF SSIS IR ADF Direct

LTP SQL Internal SQL NA ADF Direct

SAP CRM SAP CRM SAP HANA ADF Direct

SAP AR Collect SAP AR Collect SAP HANA ADF Direct

SAP EM SAP EM SAP HANA ADF Direct

BPC-BW Internal SAP BW Open Hub File Destination ADF + IR File


Connector

UDW - Kalido Internal Oracle NA ADF + IR ODBC


Driver

SAP - CLM Internal SAP - CLM SAP HANA ADF Direct

Cordillera APO Internal SAP APO SAP HANA ADF Direct

Ariba - COUPA Internal Ariba ADLS SFTP ADF


Connector

Ariba - SRS Internal Ariba API Integration ADF FTP

Oracle Transport Internal Oracle ADLS Staging Golden Gate


Management(OTM) connector +
ADF

Version 9.1 Published on 5th August 2020 Page 109 of 321


I&A Azure Solution Architecture Guidelines

Version 9.1 Published on 5th August 2020 Page 110 of 321


I&A Azure Solution Architecture Guidelines

Version 9.1 Published on 5th August 2020 Page 111 of 321


I&A Azure Solution Architecture Guidelines

Version 9.1 Published on 5th August 2020 Page 112 of 321


I&A Azure Solution Architecture Guidelines

Version 9.1 Published on 5th August 2020 Page 113 of 321


I&A Azure Solution Architecture Guidelines

Version 9.1 Published on 5th August 2020 Page 114 of 321


I&A Azure Solution Architecture Guidelines

Version 9.1 Published on 5th August 2020 Page 115 of 321


I&A Azure Solution Architecture Guidelines

Version 9.1 Published on 5th August 2020 Page 116 of 321


I&A Azure Solution Architecture Guidelines

Version 9.1 Published on 5th August 2020 Page 117 of 321


I&A Azure Solution Architecture Guidelines

Anaplan Integration with Azure

INTRODUCTION

Anaplan is a planning platform that enables organizations to accelerate decision making by connecting data,
people, and plans across the business. Unilever has identified a number of use cases for Anaplan and it is being
adopted in various parts of the business. While Anaplan is good for generating insights and has some nice
graphical capabilities, Power BI is the tool of choice for most of the reporting and visualization requirements across
the business. Interactions with Power BI reports is simpler and the business users are comfortable using the tool.
This page provides an example of how to extract data from Anaplan and copy it in the Azure environment to make it
available for reporting.

Data integration scenarios

Anaplan uses a limited type of files which include excel and csv. One of the reasons is that these file types can
easily be broken in 'chunks' so that each file is less than 10 mb.

Copy batch data from Anaplan into Azure Data Lake Store
Copy data from Azure Data Lake Store to Anaplan

DATA INTEGRATION TOOLS

The requirement is for a tool that would allow data integration between Anaplan and Azure. This would include both
download and upload of data from Anaplan.

API Support

Anaplan supports APIs to export and import data. The latest API version needs an authentication token which can
be obtained by following the Authentication Service API documentation.

Mulesoft and Dell Boomi

Both the tools are approved tools by EA for data movement operations. However ADF is much more scalable and
allows to promote code very easily across the environments, the proposal is to use ADF.

HyperConnect

HyperConnect is another out of the box tool provided by Anaplan for data movement operations but it requires an
IaaS VM to operate. It again isn't very easy to maintain software versions or to promote code. IaaS comes with a
number of other maintenance overheads hence isn't an optimal solution. Hyperconnect doesn't have an approval
from Unilever EA.

Anaplan Connect

Yet another tool offered by Anaplan and it also requires a VM hence is not being considered at this stage.

SCENARIO 1 - ANAPLAN TO DATA LAKE USING AZURE DATA FACTORY

The image below shows an architecture where data from Anaplan is stored in ADLS, procesed by Azure
Databricks, modelled using Azure Datawarehouse and Analysis services and finally visualized by Power BI. Azure
Data Factory can be used to expract data using APIs provided by Anaplan. This article only focuses on ADF
integration with Anaplan APIs.

Version 9.1 Published on 5th August 2020 Page 118 of 321


I&A Azure Solution Architecture Guidelines

Azure Data Factory Pipelines

Create exports - Pipeline

Before any file can be downloaded, the exports need to run. We can use ADF pipelines to call API endpoints and
run the exports.

Here is the JSON code for creating this pipeline

Version 9.1 Published on 5th August 2020 Page 119 of 321


I&A Azure Solution Architecture Guidelines

"name": "Anaplan Exports",


"properties": {
"activities": [
{
"name": "Get Auth Token",
"type": "WebActivity",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": "https://auth.anaplan.com/token
/authenticate",
"method": "POST",
"headers": {
"Authorization": "Basic
dmlzaGFsLmd1cHRhQHVuaWxldmVyLmNvbTpIb2xpZGF5MTIz"
},
"body": "{name:test}"
}
},
{
"name": "List Exports",
"type": "WebActivity",
"dependsOn": [
{
"activity": "Get Auth Token",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": {
"value": "@concat(pipeline().parameters.
AnaplanBaseURL, '/exports')",
"type": "Expression"

Version 9.1 Published on 5th August 2020 Page 120 of 321


I&A Azure Solution Architecture Guidelines

},
"method": "GET",
"headers": {
"Authorization": {
"value": "@concat('AnaplanAuthToken ',
string(activity('Get Auth Token').output.tokenInfo.tokenValue))",
"type": "Expression"
},
"Content-Type": "application/json"
}
}
},
{
"name": "Iterate Exports",
"type": "ForEach",
"dependsOn": [
{
"activity": "List Exports",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "@activity('List Exports').output.
exports",
"type": "Expression"
},
"isSequential": false,
"activities": [
{
"name": "Create Export Task",
"type": "WebActivity",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": {
"value": "@concat(pipeline().
parameters.AnaplanBaseURL, '/exports/', item().id, '/tasks')",
"type": "Expression"
},

Version 9.1 Published on 5th August 2020 Page 121 of 321


I&A Azure Solution Architecture Guidelines

"method": "POST",
"headers": {
"Authorization": {
"value": "@concat
('AnaplanAuthToken ', string(activity('Get Auth Token').output.
tokenInfo.tokenValue))",
"type": "Expression"
},
"Content-Type": "application/json"
},
"body": {
"localeName": "en_GB"
}
}
}
]
}
}
],
"parameters": {
"AnaplanBaseURL": {
"type": "string",
"defaultValue": "https://api.anaplan.com/2/0/workspaces
/8a81b08e654f3cef0165a5bcd2935f29/models
/2AC252F199AB4C71B69AB49A807BAA15"
}
},
"annotations": []
},
"type": "Microsoft.DataFactory/factories/pipelines"
}

Download Files - Master pipeline

Version 9.1 Published on 5th August 2020 Page 122 of 321


I&A Azure Solution Architecture Guidelines

Here is the JSON code to generate this pipeline

{
"name": "Anaplan Download files",
"properties": {
"activities": [
{
"name": "Get Auth Token",
"type": "WebActivity",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": "https://auth.anaplan.com/token
/authenticate",
"method": "POST",
"headers": {
"Authorization": "Basic
dmlzaGFsLmd1cHRhQHVuaWxldmVyLmNvbTpIb2xpZGF5MTIz"
},
"body": "{name:test}"
}
},
{
"name": "List Files",
"type": "WebActivity",
"dependsOn": [
{
"activity": "Get Auth Token",
"dependencyConditions": [

Version 9.1 Published on 5th August 2020 Page 123 of 321


I&A Azure Solution Architecture Guidelines

"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": "https://api.anaplan.com/2/0/workspaces
/8a81b08e654f3cef0165a5bcd2935f29/models
/2AC252F199AB4C71B69AB49A807BAA15/files",
"method": "GET",
"headers": {
"Authorization": {
"value": "@concat('AnaplanAuthToken ',
string(activity('Get Auth Token').output.tokenInfo.tokenValue))",
"type": "Expression"
},
"Content-Type": "application/json"
}
}
},
{
"name": "Iterate File List",
"type": "ForEach",
"dependsOn": [
{
"activity": "List Files",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "@activity('List Files').output.files",
"type": "Expression"
},
"isSequential": false,
"activities": [
{
"name": "Check Chunk Counts",
"type": "IfCondition",
"dependsOn": [],

Version 9.1 Published on 5th August 2020 Page 124 of 321


I&A Azure Solution Architecture Guidelines

"userProperties": [],
"typeProperties": {
"expression": {
"value": "@greater(item().
chunkCount, 0)",
"type": "Expression"
},
"ifTrueActivities": [
{
"name": "Execute Copy Pipeline",
"type": "ExecutePipeline",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"pipeline": {
"referenceName":
"Anaplan Copy",
"type":
"PipelineReference"
},
"waitOnCompletion": true,
"parameters": {
"auth": {
"value": "@{concat
('Authorization : AnaplanAuthToken ', string(activity('Get Auth Token').
output.tokenInfo.tokenValue))}",
"type": "Expression"
},
"filename": {
"value": "@item().
name",
"type": "Expression"
},
"id": {
"value": "@item().
id",
"type": "Expression"
},
"chunkCount": {
"value": "@item().
chunkCount",
"type": "Expression"
}
}
}
}
]
}
}
]

Version 9.1 Published on 5th August 2020 Page 125 of 321


I&A Azure Solution Architecture Guidelines

}
}
],
"annotations": []
},
"type": "Microsoft.DataFactory/factories/pipelines"
}

Download Files - Child pipeline

Here is the JSON code for this pipeline

{
"name": "Anaplan Copy",
"properties": {
"activities": [
{
"name": "Download Each Chunk",
"type": "ForEach",

Version 9.1 Published on 5th August 2020 Page 126 of 321


I&A Azure Solution Architecture Guidelines

"dependsOn": [],
"userProperties": [],
"typeProperties": {
"items": {
"value": "@range(0,pipeline().parameters.
chunkCount)",
"type": "Expression"
},
"activities": [
{
"name": "Download chunk",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [
{
"name": "Source",
"value": "files/116000000023"
},
{
"name": "Destination",
"value": "root/tmp/AnaplanPoC
/somefile1.xls"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "HttpReadSettings",
"requestMethod": "GET",
"additionalHeaders": {
"value": "@{pipeline().
parameters.auth}",
"type": "Expression"
},
"requestTimeout": ""
},
"formatSettings": {
"type":
"DelimitedTextReadSettings"
}
},
"sink": {

Version 9.1 Published on 5th August 2020 Page 127 of 321


I&A Azure Solution Architecture Guidelines

"type": "DelimitedTextSink",
"storeSettings": {
"type":
"AzureDataLakeStoreWriteSettings"
},
"formatSettings": {
"type":
"DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ""
}
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "DS_Anaplan_HTTP",
"type": "DatasetReference",
"parameters": {
"id": {
"value": "@pipeline().
parameters.id",
"type": "Expression"
},
"chunk": "0"
}
}
],
"outputs": [
{
"referenceName": "DS_AnaplanSink",
"type": "DatasetReference",
"parameters": {
"filename": {
"value": "@pipeline().
parameters.filename",
"type": "Expression"
},
"filepath": {
"value": "@concat('root/tmp
/AnaplanPoC/', formatDateTime(utcnow(), 'yyyyMMddhhmm'), '/')",
"type": "Expression"
}
}
}
]
}
]
}
}

Version 9.1 Published on 5th August 2020 Page 128 of 321


I&A Azure Solution Architecture Guidelines

],
"parameters": {
"auth": {
"type": "string",
"defaultValue": "Authorization : AnaplanAuthToken
zJIO0MiQw4KiiNMYK5In3A==.bcughHx6b7s5TpXl/dMzWfO+dLO1rPP
/T2oRI38UpZxBbCKbZcTvroQuKB5mQETGe93ZZyX6P
/7bAfAFxPN+Re9Q97+DWlNuqRvKvHlsnP0tA3fc446ZlRT+j86+E2ypBO2nkGdidk3LL
/reqzmBVRKHoko3mss+z3Ou5Z5IB3ZC+I/SWOcRJwBtp4rS3GAZ6NYe/T05qdIXxKM7Vzjx
/6DjSfbHFkAMzf7UGxElZkN9G6i7LyuoFUxe9nYxCRxrZcaC3s4puPPTA0/S0jgZeRTdsKjQ
/ewzs8hQC/mVe7QinjrElPjx3zNJf3Atb5Ntab3TTC0hQnTom
/s5spZaRbaHPPOA02mecFXbBMmOtWOxN9RzXB1HWcislEfcvcZmFT3yzSFDIwmcvsmSYnbI
/FRO8jzjGRkr9lWd0our+05ABLyRcvv2z60Y3JKZASpsLQixwR6/J0kLLC/kgZD1tQ==.
8wlQMs1+WJTrvzuIv0oAx+ZiU/JJ5vRVu52wHJyULhA="
},
"filename": {
"type": "string",
"defaultValue": "ICAT Export.xlsx"
},
"id": {
"type": "string",
"defaultValue": "116000000055"
},
"chunkCount": {
"type": "int",
"defaultValue": 1
}
},
"annotations": []
},
"type": "Microsoft.DataFactory/factories/pipelines"
}

Linked Services

Following Linked Services are required to connect to the source and destination.

Linked service to connect to Anaplan workspace

A service account can be used for this connection. For this example, I have used my account on Anaplan. Create a
HTTP Linked Service as below with basic authentication (password for basic authentication can be stored within
Azure Key Vault). See the documentation to use a certificate for authentication.

Version 9.1 Published on 5th August 2020 Page 129 of 321


I&A Azure Solution Architecture Guidelines

Here is the JSON code for the linked service:

{
"name": "LS_Anaplan",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"annotations": [],
"type": "HttpServer",
"typeProperties": {
"url": "https://api.anaplan.com/2/0/workspaces
/8a81b08e654f3cef0165a5bcd2935f29/models
/2AC252F199AB4C71B69AB49A807BAA15/",
"enableServerCertificateValidation": false,
"authenticationType": "Basic",
"userName": "[email protected]",
"encryptedCredential": "********"
}
}
}

Linked service to connect to Azure Data Lake Store

Version 9.1 Published on 5th August 2020 Page 130 of 321


I&A Azure Solution Architecture Guidelines

Here is the JSON code for the linked service:

{
"name": "LS_ADLS_DataLake",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {

Version 9.1 Published on 5th August 2020 Page 131 of 321


I&A Azure Solution Architecture Guidelines

"annotations": [],
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "adl://*********.azuredatalakestore.
net",
"servicePrincipalId": "*******-****-****-****-************",
"servicePrincipalKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "LS_KYVA_KeyVault",
"type": "LinkedServiceReference"
},
"secretName": "SPN-PocDevApp-Cloud-dev"
},
"tenant": "*******-****-****-****-************",
"subscriptionId": "*******-****-****-****-************",
"resourceGroupName": "*****************"
}
}
}

Datasets

Source Dataset

Create a HTTP based dataset for your source connection. Since all the files are going to be delimited text files, the
dataset would look something like the following image

It accepts 2 parameters as follows

Version 9.1 Published on 5th August 2020 Page 132 of 321


I&A Azure Solution Architecture Guidelines

Sink Dataset

You would also reqire a dataset to save the files to the data lake store. It should be a ADLS based dataset (I've
used ADLS Gen 1 in the example) and the file type should be Delimited Text. It also accepts two parameters

Version 9.1 Published on 5th August 2020 Page 133 of 321


I&A Azure Solution Architecture Guidelines

It accepts 2 parameters as follows

SCENARIO 2 - DATA LAKE TO ANAPLAN USING AZURE DATA FACTORY & ANAPLAN API

This requirement hasn’t been prioritised and other options are suggested by I&A Tech as a work around. Still, if this
becomes a real requirement, this section will be updated.

APPENDIX

References

Anaplan API Guide and Reference


Anaplan API v2.0
Anaplan Authentication Token
Anaplan API reference for Export
Anaplan RESTful API Best Practices

Version 9.1 Published on 5th August 2020 Page 134 of 321


I&A Azure Solution Architecture Guidelines

ADF Job Triggers

ADF pipeline are executed based on occurrence of trigger. There are 3 types of trigger as listed below and all are
approved for usage.

Schedule trigger runs pipelines on a wall-clock schedule. This trigger supports periodic and advanced
calendar options.
Tumbling window triggers are a type of trigger that fires at a periodic time interval from a specified start time,
while retaining state. Tumbling windows are a series of fixed-sized, non-overlapping, and contiguous time
intervals.
Event-based trigger runs pipelines in response to an event, such as the arrival of a file, or the deletion of a
file, in Azure Blob Storage.

There are some applications that use logic app to sense blob event and then trigger ADF pipeline from logic app
action. It is recommended such pipelines are moved to event-based triggers.

If you have ADF pipelines that requires a event based triggering below architecture can be used.

EVENT BASED JOB TRIGGER PATTERNS

Pattern 1: Blob Event Based Trigger

Scenario is applicable for 3 cases:

When the data from external system is landing in Blob storage.


When the data needs to be pulled from source system. Source system will place a blank file in Blob storage.
When multiple files needs to be processed from same container,source system will place a blank file after
placing all required files in Blob storage.

Version 9.1 Published on 5th August 2020 Page 135 of 321


I&A Azure Solution Architecture Guidelines

Pattern 2: Get Metadata Job Triggers for Source systems which are file based

Schedule Get Metadata pipeline as the first pipeline to verify the existence of the data in the source system. When
the source file is available, get metadata activity will trigger the child pipeline. Run the get metadata job frequently
to avoid any delay in triggering of actual job.

Get metadata can be used for On premise file system, Amazon S3,Google Cloud Storage,Azure Blob storage,
Azure Data Lake Storage Gen1,Azure Data Lake Storage Gen2,Azure Files,File system,SFTP,FTP.

When using Get Metadata activity against a folder, make sure to have LIST/EXECUTE permission to the given
folder.

Pattern 3: Hybrid architecture where on prem and cloud systems are interacting

Event based triggers work as follows for AAS refresh where on-premise SSIS writes a dummy file to blob to mark
completion of data load.

Version 9.1 Published on 5th August 2020 Page 136 of 321


I&A Azure Solution Architecture Guidelines

JOB DEPENDENCY

Job Dependency can be used based on the below approaches.First approach is being followed today for UDL.

Poll the Metadata table’s entries which specifies job completion and downstream pipelines can be triggered
accordingly.

Use service bus approach to notify the completion of job and consumers have to subscribe to the service bus.

ADLS GEN 2 TRIGGERS

Azure Data Lake Storage (ADLS) Gen2 can publish events to Azure Event Grid to be processed by subscribers
such as WebHooks, Azure Event Hubs, Azure Functions and Logic Apps. With this capability, individual changes to
files and directories in ADLS Gen2 can automatically be captured and made available to data engineers for creating
rich big data analytics platforms that use event-driven architectures.

Version 9.1 Published on 5th August 2020 Page 137 of 321


I&A Azure Solution Architecture Guidelines

Version 9.1 Published on 5th August 2020 Page 138 of 321


I&A Azure Solution Architecture Guidelines

ADF Data Flow


Overview
Supported source connectors in mapping data flow
Transformations available in Data Flow

OVERVIEW

Mapping data flows are visually designed data transformations in Azure Data Factory. Data flows allow data
engineers to develop graphical data transformation logic without writing code. The resulting data flows are executed
as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters. Data flow activities can
be engaged via existing Data Factory scheduling, control, flow, and monitoring capabilities.

Mapping data flows provide an entirely visual experience with no coding required. Your data flows run on your
execution cluster for scaled-out data processing. Azure Data Factory handles all the code translation, path
optimization, and execution of your data flow jobs.

SUPPORTED SOURCE CONNECTORS IN MAPPING DATA FLOW

Mapping Data Flow follows an extract, load, transform (ELT) approach and works with staging datasets that are all
in Azure. Currently the following datasets can be used in a source transformation:

Azure Blob Storage (JSON, Avro, Text, Parquet)


Azure Data Lake Storage Gen1 (JSON, Avro, Text, Parquet)
Azure Data Lake Storage Gen2 (JSON, Avro, Text, Parquet)
Azure Synapse Analytics
Azure SQL Database
Azure CosmosDB

Settings specific to these connectors are located in the Source options tab.

Azure Data Factory has access to over 90 native connectors. To include data from those other sources in your data
flow, use the Copy Activity to load that data into one of the supported staging areas.

TRANSFORMATIONS AVAILABLE IN DATA FLOW

Below is a list of the transformations currently supported in mapping data flow. Click on each transformations to
learn its configuration details.

Name Category Description

Aggregate Schema Define different types of aggregations such as SUM, MIN, MAX, and COUNT
modifier grouped by existing or computed columns.

Alter row Row modifier Set insert, delete, update, and upsert policies on rows.

Condition Multiple inputs Route rows of data to different streams based on matching conditions.
al split /outputs

Derived Schema generate new columns or modify existing fields using the data flow expression
column modifier language.

Exists Multiple inputs Check whether your data exists in another source or stream.
/outputs

Filter Row modifier Filter a row based upon a condition.

Flatten

Version 9.1 Published on 5th August 2020 Page 139 of 321


I&A Azure Solution Architecture Guidelines

Schema Take array values inside hierarchical structures such as JSON and unroll them
modifier into individual rows.

Join Multiple inputs Combine data from two sources or streams.


/outputs

Lookup Multiple inputs Reference data from another source.


/outputs

New Multiple inputs Apply multiple sets of operations and transformations against the same data
branch /outputs stream.

Pivot Schema An aggregation where one or more grouping columns has its distinct row values
modifier transformed into individual columns.

Select Schema Alias columns and stream names, and drop or reorder columns
modifier

Sink Sink A final destination for your data

Sort Row modifier Sort incoming rows on the current data stream

Source Source A data source for the data flow

Surrogate Schema Add an incrementing non-business arbitrary key value


key modifier

Union Multiple inputs Combine multiple data streams vertically


/outputs

Unpivot Schema Pivot columns into row values


modifier

Window Schema Define window-based aggregations of columns in your data streams.


modifier

For more details read Microsoft documentation

https://docs.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview

Version 9.1 Published on 5th August 2020 Page 140 of 321


I&A Azure Solution Architecture Guidelines

Section 2.3 - Azure BLOB Storage


Approved use of BLOB storage
Types of Blobs
BLOB Storage - Key Capabilities
Redundancy and Location
Storage Scenarios
Best Practices for Blob Storage
Grant limited access to Azure Blob using shared access signatures (SAS)
Types of shared access signatures
Best practices when using SAS

Azure BLOB Storage is a service that stores unstructured data in the cloud as objects/blobs. Blob storage can store
any type of text or binary data, such as a document, media file, or application installer. Blob storage is also referred
to as object storage.

As per I&A Tech guidelines, BLOB storage should only be used as a transitory storage method, in the scenario
where externally hosted source systems are unable to support the ‘pull’ of data. Data will data be “pushed” from the
source into a blob store. This ensures access to ADLS is not shared with these 3rd party data providers.

The Containers should be created based on the project requirement and data sensitivity.

Approved use of BLOB storage

I&A Tech Architecture approves BLOB for the following:

Store and serve unstructured data


App and Web scale data
Backups and Archive
Big Data from Internet of Things (IoT), etc.
Store log information
ADF monitoring logs are usually stored on BLOB Storage
FMT
File Management Tool (FMT) makes use of BLOB store to upload data

Types of Blobs

Block Blobs: Block blobs can store binary media files, documents, or text. A single block blob can store up to
50,000 blobs of 100 MB each. The total block size can reach more than 4.75 TB. Most object storage scenarios

Append Blobs: Append blobs are optimized for appending operations like logging scenarios. The difference
between append blobs and block blobs is the storage capacity. Append blob can store only up to 4MB of data. As a
result, append blocks are limited to a total size of 195 GB.

Page Blobs: Page blobs have a storage capacity of about 8 TB, which makes them useful for high reading and
writing scenarios. There are two different page blob categories, Premium and Standard. Standard blobs are used
for average Virtual Machines (VMs) read/write operations. Premium is used for intensive VM operations. Page
blobs are useful for all Azure VM storage disks including the operating system disk. Used for random reads and
writes.

BLOB Storage - Key Capabilities

Strong consistency
Multiple Redundancy types
Tiered storage – Hot & Cool

Version 9.1 Published on 5th August 2020 Page 141 of 321


I&A Azure Solution Architecture Guidelines

Broad platform and language support

Redundancy and Location

LRS: 3 Copies, 1 ZRS: 6 Copies, Same or 2 Separate Regions GRS: 6 Copies, 2 Separate Regions
Datacenter

GRS – Regions could be Hundreds/Thousands of miles/kilometers away from each other

Storage Scenarios

Blob Storage is used as the default storage solution for a wide range of Azure services

Best Practices for Blob Storage

The list below reviews the essential best practices for controlling and maintaining Blob storage costs and availability.

1. Define the Type of Content - When you upload files to blob storage, usually all files are stored as an
application/octet-stream by default. The problem is that most browsers start to download this type of file
instead of showing it. This is why you have to change the file type when uploading videos or images. To
change the file type, you have to parse each file and update the properties of that file.
2. Define the Cache-Control Header - The HTTP cache-control header allows you to improve availability. In
addition, the header decreases the number of transactions made in each storage control. For example, a
cache-control header in a static website hosted on Azure blob storage can decrease the server traffic loads
by placing the cache on the client-side
3. Parallel Uploads and Downloads - Uploading large volumes of data to blob storage is time-consuming and
affects the performance of an application. Parallel uploads can improve the upload speed in both Block blobs

Version 9.1 Published on 5th August 2020 Page 142 of 321


I&A Azure Solution Architecture Guidelines
3.

and Page blobs. For example, an upload of 70GB can take approximately 1,700 hours. However, a parallel
upload can reduce the time to just 8 hours.
4. Choose the Right Blob Type - Each blob type has its own characteristics. You have to choose the most
suitable type for your needs. Block blobs are suitable for streaming content. You can easily render the blocks
for streaming solutions. Make sure to use parallel uploads for large blocks. Page blobs enable you to read
and write to a particular blob part. As a result, all other files are not affected.
5. Improve Availability and Caching with Snapshots - Blob snapshots increase the availability of Azure
storage by caching the data. Snapshots allow you to have a backup copy of the blob without paying extra.
You can increase the availability of the entire system by creating several snapshots of the same blob and
serving them to customers. Assign snapshots as the default blob for reading operations and leave the
original blob for writing.
6. Enable a Content Delivery Network (CDN) - A content delivery network is a network of servers that can
improve availability and reduce latency by caching content on servers that are close to end-users. When
using CDNs for Blob storage, the network places a blob duplicate closer to the client. Accordingly, each
client is redirected to the closest CDN node of blobs.

Grant limited access to Azure Blob using shared access signatures (SAS)

A shared access signature (SAS) provides secure delegated access to resources in your storage account without
compromising the security of your data. With a SAS, you have granular control over how a client can access your
data. You can control what resources the client may access, what permissions they have on those resources, and
how long the SAS is valid, among other parameters. Unilever internal and external applications can connect to Blob
storage using a SAS token.

TYPES OF SHARED ACCESS SIGNATURES

Azure Storage supports three types of shared access signatures:

User delegation SAS: A user delegation SAS is secured with Azure Active Directory (Azure AD) credentials
and also by the permissions specified for the SAS. A user delegation SAS applies to Blob storage only.
Service SAS: A service SAS is secured with the storage account key. A service SAS delegates access to a
resource in only one of the Azure Storage services: Blob storage, Queue storage, Table storage, or Azure
Files.
Account SAS: An account SAS is secured with the storage account key. An account SAS delegates access
to resources in one or more of the storage services. All of the operations available via a service or user
delegation SAS are also available via an account SAS. Additionally, with the account SAS, you can delegate
access to operations that apply at the level of the service, such as Get/Set Service Properties and Get
Service Stats operations. You can also delegate access to read, write, and delete operations on blob
containers, tables, queues, and file shares that are not permitted with a service SAS.

BEST PRACTICES WHEN USING SAS

The following recommendations for using shared access signatures:

Always use HTTPS to create or distribute a SAS. If a SAS is passed over HTTP and intercepted, an attacker
performing a man-in-the-middle attack is able to read the SAS and then use it just as the intended user could
have, potentially compromising sensitive data or allowing for data corruption by the malicious user.
Use a user delegation SAS when possible. A user delegation SAS provides superior security to a service
SAS or an account SAS. A user delegation SAS is secured with Azure AD credentials, so that you do not
need to store your account key with your code.
Use near-term expiration times on an ad hoc SAS service SAS or account SAS. In this way, even if a SAS is
compromised, it's valid only for a short time. This practice is especially important if you cannot reference a
stored access policy. Near-term expiration times also limit the amount of data that can be written to a blob by
limiting the time available to upload to it.
Be careful with SAS start time. If you set the start time for a SAS to now, then due to clock skew (differences
in current time according to different machines), failures may be observed intermittently for the first few

Version 9.1 Published on 5th August 2020 Page 143 of 321


I&A Azure Solution Architecture Guidelines

minutes. In general, set the start time to be at least 15 minutes in the past. Or, don't set it at all, which will
make it valid immediately in all cases. The same generally applies to expiry time as well--remember that you
may observe up to 15 minutes of clock skew in either direction on any request. For clients using a REST
version prior to 2012-02-12, the maximum duration for a SAS that does not reference a stored access policy
is 1 hour, and any policies specifying longer term than that will fail.
Be specific with the resource to be accessed. A security best practice is to provide a user with the minimum
required privileges. If a user only needs read access to a single entity, then grant them read access to that
single entity, and not read/write/delete access to all entities. This also helps lessen the damage if a SAS is
compromised because the SAS has less power in the hands of an attacker.
Understand that your account will be billed for any usage, including via a SAS. If you provide write access to
a blob, a user may choose to upload a 200 GB blob. If you've given them read access as well, they may
choose to download it 10 times, incurring 2 TB in egress costs for you. Again, provide limited permissions to
help mitigate the potential actions of malicious users. Use short-lived SAS to reduce this threat.
Validate data written using a SAS. When a client application writes data to your storage account, keep in
mind that there can be problems with that data. If your application requires that data be validated or
authorized before it is ready to use, you should perform this validation after the data is written and before it is
used by your application. This practice also protects against corrupt or malicious data being written to your
account, either by a user who properly acquired the SAS, or by a user exploiting a leaked SAS.

If a SAS is leaked, it can be used by anyone who obtains it, which can potentially compromise your
storage account.
If a SAS provided to a client application expires and the application is unable to retrieve a new SAS
from your service, then the application's functionality may be hindered.

Version 9.1 Published on 5th August 2020 Page 144 of 321


I&A Azure Solution Architecture Guidelines

Section 2.4 - Azure Data Lake Storage (ADLS)


Azure Data Lake Storage (ADLS) is where data is physically stored (persisted) within the UDL.

Azure Data Lake Store is the Web implementation of the Hadoop Distributed File System (HDFS). Meaning that
files are split up and distributed across an array of cheap storage. Each file you place into the store is split into
250MB chunks called extents. This enables parallel read and write. For availability and reliability, extents are
replicated into three copies. As files are split into extents, bigger files have more opportunities for parallelism than
smaller files. If you have a file smaller than 250MB it is going to be allocated to one extent and one vertex (which is
the work load presented to the Azure Data Lake Analytics), whereas a larger file will be split up across many
extents and can be accessed by many vertexes.

The format of the file has a huge implication for the storage and parallelisation. Splittable formats – files which are
row oriented, such as CSV – are parallelizable as data does not span extents. Non-splittable formats, however, –
files what are not row oriented and data is often delivered in blocks, such as XML or JSON – cannot be parallelized
as data spans extents and can only be processed by a single vertex.

In addition to the storage of unstructured data, Azure Data Lake Store also stores structured data in the form of row-
oriented, distributed clustered index storage, which can also be partitioned. The data itself is held within the
“Catalog” folder of the data lake store, but the metadata is contained in the data lake analytics. For many, working
with the structured data in the data lake is very similar to working with SQL databases.

ADLS is the primary storage component for both the UDL and the BDL’s.

Recommended practices:

1. Data in ADLS MUST be in .csv, parquet format.


2. For .csv, use “|” as delimiter and use double quotes ““ as text qualifier.
3. It is recommended to use .parquet wherever possible.
4. For data science workloads, do not use new tools without approval from EA and/or TDA.
5. For ADLS access related best practices refer to Section 4.1 - Environment and data access management.
6. Access to ADLS MUST be provided through User AD groups and SPN’s added into base data access AD
groups. No direct access is encouraged on folders.
7. Product MUST use SPN to access data from ADLS.
8. Access to data scientists WILL be provided using user’s Unilever credentials.
9. MFA is must for all access to ADLS.
10. IF ADLS is the primary storage of data, make sure to enable RA-GRS or backup on the data.

Security Considerations :

Azure Data Lake Storage Gen2 offers POSIX access controls for Azure Active Directory (Azure AD) users, groups,
and service principals. These access controls can be set to existing files and directories. The access controls can
also be used to create default permissions that can be automatically applied to new files or directories.

When working with big data in Data Lake Storage Gen2, it is likely that a service principal is used to allow services
such as Azure Databricks or ADF to work with the data. However, there are cases where individual users need
access to the data as well. In all cases, consider using Azure Active Directory security groups instead of assigning
individual users to directories and files.

Once a security group is assigned permissions, adding or removing users from the group doesn’t require any
updates to Data Lake Storage Gen2. This also helps ensure not exceed the maximum number of access control
entries per access control list (ACL). Currently, that number is 32, Each directory can have two types of ACL, the
access ACL and the default ACL, for a total of 64 access control entries.

For more information about these ACLs, see Access control in Azure Data Lake Storage Gen2.

Version 9.1 Published on 5th August 2020 Page 145 of 321


I&A Azure Solution Architecture Guidelines

Data Lake Storage Gen2 supports the option of turning on a firewall and limiting access only to Azure services,
which is recommended to limit the vector of external attacks. Firewall can be enabled on a storage account in the
Azure portal via the Firewall > Enable Firewall (ON) > Allow access to Azure services options. This is
suggested mostly when the consumers of the applications are Azure services.

High Availability and Disaster Recovery:

ADLS provides 3 copies of the data in the same region in order to avoid Hardware failures. These copies are
managed internally by Microsoft . Incase of hardware failures within Microsoft center, Microsoft will manage pointing
to one of the working copy of the data, customer doesn't have a way to identify the copy or access different copies
of data.

Apart from the above, Gen2 also comes with other replication options, such as ZRS or GZRS (preview), improve
HA, while GRS & RA-GRS improve DR. For data resiliency with Data Lake Storage Gen2, it is recommended to
geo-replicate the data using GRS or RA-GRS. RA-GRS makes the secondary copy read only / accessible. With geo
replication, Microsoft manages the block level replication of data internally, without customer worrying about it.
There could be a delay on the availability of data in secondary/paired region and as claimed by Microsoft, this delay
is not more than 15 Minutes.

In order to avoid data corruption or accidental deletes, it is suggested to take snapshots of the data periodically.
Currently Gen2 doesnt provide Snapshot feature but is in road map. In case projects wants to keep the snapshot, it
is suggested to take manual backups of the data periodically into different ADLS location in the same region.

Some of the best practices to avoid accidental deletion are:

Remove the delete access to Users on the Production System.


Keep notification for all deletes. Deletes not taken care through SPN , should be flagged immediately to the
platform owner/responsible team. Microsoft maintains the deleted data in Trash till the garbage collector
removes the data. If the platform team gets to know about the delete, there are high chances of getting the
data back by raising critical MS ticket immediately.
Provide only read access to consumer application SPN’s.

Some of the features which are in roadmap and Unilever actively working with MS to prioritize and get updates are

File level Snapshots


Versioning of data and restoring a version
Soft Deletes

Version 9.1 Published on 5th August 2020 Page 146 of 321


I&A Azure Solution Architecture Guidelines

Section 2.5 - Azure Analysis Services


What is Azure Analysis Service
Analysis Services in Tabular Mode
Design Principles
VertiPaq Architecture
Administrative Security And Data Security
On-Premises Gateways

What is Azure Analysis Service

Azure Analysis Services is a fully managed platform as a service (PaaS) that provides enterprise-grade data
models in the cloud. Use advanced mashup and modeling features to combine data from multiple data sources,
define metrics, and secure your data in a single, trusted tabular semantic data model. The data model provides an
easier and faster way for users to browse massive amounts of data for ad hoc data analysis.

Azure Analysis Services is compatible with many great features already in SQL Server Analysis Services Enterprise
Edition. Azure Analysis Services supports tabular models at the 1200 and higher compatibility levels. Tabular
models are relational modeling constructs (model, tables, columns), articulated in tabular metadata object
definitions in Tabular Model Scripting Language (TMSL) and Tabular Object Model (TOM) code. Partitions,
perspectives, row-level security, bi-directional relationships, and translations are all supported*. Multidimensional
models and PowerPivot for SharePoint are not supported in Azure Analysis Services.

Tabular models in both in-memory and DirectQuery modes are supported. In-memory mode (default) tabular
models support multiple data sources. Because model data is highly compressed and cached in-memory, this mode
provides the fastest query response over large amounts of data. It also provides the greatest flexibility for complex
datasets and queries. Partitioning enables incremental loads, increases parallelization, and reduces memory
consumption. Other advanced data modeling features like calculated tables, and all DAX functions are supported.
In-memory models must be refreshed (processed) to update cached data from data sources. With Azure service
principal support, unattended refresh operations using PowerShell, TOM, TMSL and REST offer flexibility in making
sure your model data is always up to date.

DirectQuery mode* leverages the backend relational database for storage and query execution. Extremely large
data sets in single SQL Server, SQL Server Data Warehouse, Azure SQL Database, Azure Synapse Analytics
(SQL Data Warehouse), Oracle, and Teradata data sources are supported. Backend data sets can exceed
available server resource memory. Complex data model refresh scenarios aren't needed. There are also some
restrictions, such as limited data source types, DAX formula limitations, and some advanced data modeling features
aren't supported. Before determining the best mode for you, see Direct Query mode.

Analysis Services in Tabular Mode

Tabular models consist of Tables linked together by Relationships. It works best when your data is modelled
according to the concepts of dimensional modelling

It provides a semantic layer that sits between your data and your users and gives them:

The ability to query the data without knowing a query language like SQL
Fast query performance
Shared business logic – how the data is joined and aggregated, calculations, security – as well as just
shared data

Every table in AAS can can be split up into multiple partitions. Usually this is applied for large tables with millions of
rows.

DAX is the query and calculation language of Tabular models. There are six places that DAX can be used when
designing a model:

1.

Version 9.1 Published on 5th August 2020 Page 147 of 321


I&A Azure Solution Architecture Guidelines

1. Calculated columns
2. Calculated tables
3. Calculation groups
4. Measures
5. Detail Rows expressions
6. Security

Design Principles

Performance, performance, performance


Query performance balanced with processing performance
Accommodation of changes without forcing a reload of data
Minimal configuration settings

VertiPaq Architecture

Tabular models in Analysis Services are databases that run in-memory or in DirectQuery mode, connecting to data
directly from back-end relational data sources. By using state-of-the-art compression algorithms and multi-threaded
query processor, the Analysis Services Vertipaq analytics engine delivers fast access to tabular model objects and
data by reporting client applications like Power BI and Excel.

Column-based data store


Separate data structures per column
Great for common analytical queries
Optimized for retrieving data from a subset of columns
Large number of similar items in column Better compression
Better compression Faster query performance

1. Table data stored in segments and dictionaries per column


2. Calculated columns are stored like regular columns
3. Hierarchies can provide quicker access for querying
4. Relationship structures are created to accelerate lookups across tables, remove hard coupling between
tables
5. Partitions are a group of segments and defined for data management
6. Any table can have partitions, defined independently of other tables. Partitions are only for segment data.

Administrative Security And Data Security

Version 9.1 Published on 5th August 2020 Page 148 of 321


I&A Azure Solution Architecture Guidelines

There are two types of permission in AAS:


Administrative permissions, which control whether a user can create, process or delete objects
Data permissions, which control which data a user can see in a model
Server administrators can be created
When creating your AAS instance in the Azure Portal
In the Analysis Services Admins pane in the Azure Portal
In SQL Server Management Studio by right-clicking on the instance name and clicking Properties
Database-level administrators are created in roles in the database itself

On-Premises Gateways

For Azure AS to connect to on-premises data sources you need to install an on-premises gateway. This is the same
gateway used by Power BI, Flow, LogicApps and PowerApps. Azure AS can only use gateways configured for the
same Azure region for performance reasons

Install the gateway as close as possible to the data source for the best performance
Gateways can be clustered for high availability
It provides extensive logging options for troubleshooting

Version 9.1 Published on 5th August 2020 Page 149 of 321


I&A Azure Solution Architecture Guidelines

Azure Analysis Services - Best Practices


Best Practices - General
1. Pause the service when not in use
2. Right-size Azure Analysis Services instances
3. Choose right Tier/Performance Level for AAS
How Much Memory Do You Need?
How Many QPUs Do You Need?
4. How to get Usage stats
5. Sizing For DirectQuery
6. One Big Model Or Many Small Models?
Development tools
1. Visual Studio for AAS
2. Developing in Tabular Editor
3. Other Third Party Tools
Best Practices - Security
1. Configure the Firewall
2. Azure Analysis Services Roles
3. Row Level Security: High Level
4. Object-Level Security
Best Practices - Data Model Optimization
1. Aim for Star Schemas
2. Optimize Dimensions
3. Optimizing Facts
4. Be careful with bi-directional relationships

Best Practices - General

These guidelines apply to all Azure Analysis Services instances in all environments across Unilever’s Azure
subscriptions. These are in place to ensure that business value is delivered and we use the service optimally at the
same time. These measures allow us to save costs.

1. PAUSE THE SERVICE WHEN NOT IN USE

Azure Analysis Services is one of the most expansive components on our Azure stack and pausing it when not in
use can save massive amount of costs.
The ADF instance in PDS environments come with ready made pipelines that can be used for suspending the
service. These pipelines can be scheduled or triggered manually.

Please note that regular audits are in place to ascertain that the service is paused by projects.

2. RIGHT-SIZE AZURE ANALYSIS SERVICES INSTANCES

Make sure that the Analysis Services is appropriately sized. If the service tier is too low for your project
requirements the refresh might fail especially if many users are accessing the cube at the time of refresh. ADF for
every PDS project will have ready made pipelines for scaling the service up and down. If necessary, use these
pipelines to scale up the service before refresh and scale it back down after refresh to save costs.

If the service tier is too high then lot of capacity will be wasted and the project will incur huge costs.

3. CHOOSE RIGHT TIER/PERFORMANCE LEVEL FOR AAS

Azure Analysis Services servers come in three tiers. There are multiple performance levels in each tier

Developer tier is intended for dev use but licence allows for use in production

Version 9.1 Published on 5th August 2020 Page 150 of 321


I&A Azure Solution Architecture Guidelines

Runs on shared hardware so no performance guarantees


Has all features and functionality
Basic tier lacks some features: perspectives, partitioning, DirectQuery
The equivalent of Standard Edition on-prem
Standard tier has all features
The equivalent of Enterprise Edition on-prem

How Much Memory Do You Need?

It depends on the following factors:

How big is your model? Not easy to determine until you deploy
How much will your model grow over time?
There are lots of well-known tricks for reducing AS Tabular memory usage
A full process may mean memory usage almost doubles – but do you need to do a full process?
Processing individual partitions/tables will use less memory
Some unoptimized queries/calculations may result in large memory spikes
Caching takes place for some types of data returned by a query

How Many QPUs Do You Need?

QPU = Query Processing Unit. It is a relative unit of computing power for querying and processing. As an
illustration, 20 QPUs is roughly equal to 1 pretty fast core. Also note that a server with 200 QPUs will be 2x faster
than one with 100 QPUs.

Deciding on how many QPUs you will need for processing depends on the following:

How often do you need to process? What type of processing will you be doing? What will you be processing?
Not easy to know until later in the development cycle
The more QPUs, the more you can process in parallel
Tables in a model can be processed in parallel
Partitions in a table can be processed in parallel
Many processing operations such as compression are parallelised
Always better to plan so you do not process partitions containing data that has not changed!

Deciding on how many QPUs you will need for querying depends on the following:

How many concurrent users will you have? What types of query will they run? Not easy to determine until
you go into Production
There are two parts of SSAS that deal with queries
All queries start off in the Formula Engine
This is single threaded
More QPUs = more concurrent users running queries
But even then data access might become a bottleneck
The Storage Engine reads and aggregates data
Parallelism is only possible on large tables of > 16 million rows

4. HOW TO GET USAGE STATS

Log into Azure portal


Look for the resource group
Click on AAS component
On the left pane click on metrics
Add metric Memory limit high and Memory Usage -choose aggregation type dates can be customized

Version 9.1 Published on 5th August 2020 Page 151 of 321


I&A Azure Solution Architecture Guidelines

Chart like below will give the usage details and dotted lines (blue) is the paused state

5. SIZING FOR DIRECTQUERY

In DirectQuery mode there is no processing needed – the data stays in the source database
DirectQuery needs Standard tier
At query time the Formula Engine is used but there is no Storage Engine activity – queries are sent back to
the data source
Therefore:
You need the bare minimum of memory
QPUs are still important because a lot of activity still takes place in the Formula Engine

6. ONE BIG MODEL OR MANY SMALL MODELS?

An AAS database or Power BI dataset should contain all the data you need for your report
If you need to get data from multiple databases/datasets, you’re in trouble
Until composite models for Live connections arrive, but even then…
Advice: put all your data into on database/dataset until you have a good reason not to
Reasons to create multiple small databases/datasets include easier scale up/out, easier security, easier
development

Development tools

There are quite a few tools that may be used to buils and manage a Tabular model database on Azure Analysis
Services. Following sections describe some of them.

1. VISUAL STUDIO FOR AAS

Visual Studio is used for Analysis Services development. You need to install an extension called “Analysis Services
projects” to do this.

For new Azure AS projects you should choose the highest compatibility level available

Its not free. Every developer should have a license.


Visual studio has an integrated workspace database
An instance of SSAS Tabular running in-process
Much faster and more convenient than running a separate workspace database instance
Make sure you reduce your data volumes before you start to develop!

Visual Studio AAS Project Properties

Deployment Options\Processing Option


Default: processes objects that need processing

Version 9.1 Published on 5th August 2020 Page 152 of 321


I&A Azure Solution Architecture Guidelines

Full: always processes everything


Do Not Process: processes nothing
Deployment Options\Server: server you are deploying to
Use the Management Server Name url found on the Overview pane in the Azure Portal, not the
Server Name url!
Deployment Options\Database: the database you are deploying to

Advantages:

Fully supported and regularly updated


Integration with all Visual Studio features such as source control
Intellisense for DAX
Full Power Query experience

Disadvantages:

Need to work with small datasets while developing


May be slow

2. DEVELOPING IN TABULAR EDITOR

Tabular Editor is a community tool for Analysis Services developers

Advantages:

Works offline, so much more responsive – no workspace database


Optimised for working with large/complex models
Scripting allows for the creation of multiple measures easily
Better support for source control and multi-developer scenarios

Disadvantages:

You don’t see your data


No DAX intellisense for tables or columns
Power Query support is very basic

3. OTHER THIRD PARTY TOOLS

BISM Normalizer: Visual Studio extension for deploying and comparing Analysis Services databases
DAX Studio: a tool for writing and tuning DAX queries
Vertipaq Analyzer: helps you understand memory usage by tables and columns
Power BI tools include:
Various at https://powerbi.tips/tools/
Power BI Helper: https://powerbihelper.org/
Power BI Sentinel ($): http://www.powerbisentinel.com/
Data Vizioner ($): https://www.datavizioner.com/

Best Practices - Security

Azure Analysis Services security is based on Azure AD. It doesn’t allow using usernames/passwords. It mandates
all users to have a valid Azure AD identity in a tenant in the same subscription as the AAS instances.

1. CONFIGURE THE FIREWALL

By default Azure SSAS accepts traffic from any source

Version 9.1 Published on 5th August 2020 Page 153 of 321


I&A Azure Solution Architecture Guidelines

The Azure SSAS firewall controls which IP addresses clients can connect from
Access from Power BI is a built-in option
Current bug that Power BI imports don’t work, only Live connections
For everything else you need to supply an IP address range
Applications blocked by the firewall get a 401 Unauthorized error message

2. AZURE ANALYSIS SERVICES ROLES

Roles can be created in two ways:


Inside the Visual Studio project
Inside SQL Server Management Studio
Creating roles in the project is preferable because it means you don’t have to worry about roles being
deleted during deployment
Role permissions are additive – if you are a member of multiple roles, you have all the permissions from all
of the roles
Better to add AD security groups to roles rather than individual users, because this means less maintenance
needed on the roles themselves

3. ROW LEVEL SECURITY: HIGH LEVEL

All tables in AAS can have row-level security filters applied. The filter takes the form of a DAX expression that
returns a Boolean expression – if the expression returns false for a row, that row is not seen by a user. Since filters
move along relationships, from the one side to the many side, filtering a dimension table also filters a fact table

Inappropriately configured RLS can severely hurt performance.

RLS guidelines:

Avoid RLS filters directly on fact tables


Avoid performing logic in the security filter (e.g. conditionals, string manipulation)
Avoid disconnected security tables and LOOKUPVALUE() to simulate a relationship
Keep seccurity tables as small as possible
Don’t have multiple roles for people with ‘all’ access. Can generate security tables dynamically on
each load
Avoid combining different security grains in a single table. Having multiple security related tables is
fine when used correctly
Avoid using bi-directional security filters
Easier to implement but performance suffers (not cached)
Better to use single direction and DAX expression to leverage the appropriate relationships
If you must use bi-directional security tables, try to keep them under 128k rows
If you have many different RLS filters on a single fact table coming from different dimensions, consider
collapsing security contexts into a single junk dimension

4. OBJECT-LEVEL SECURITY

Entire tables can be secured


Only if this does not break a chain of relationships between unsecured tables
Columns on tables can be secured
Measures, KPIs and Detail Rows Expressions that reference secured columns are automatically secured
Relationships only work on unsecured columns
Users cannot be members of roles that use row-level security and column-level security
Only available in AAS, not in Power BI

Version 9.1 Published on 5th August 2020 Page 154 of 321


I&A Azure Solution Architecture Guidelines

Best Practices - Data Model Optimization

Remember that an inefficient model can completely slow down a report, even with very small data volumes. We
should try and build the model towards these goals:

Make the model as small as possible


Schema should support the analysis
Relationships are built purposefully and thoughtfully

1. AIM FOR STAR SCHEMAS

Dimensional modelling structures data specifically for analysis as opposed to storage/retrieval. Hence it often
requires denormalizing the data which effectively minimizes joins. Here is Kimball Group’s 4 step process to do this:

1. Select the business process (e.g. sales)


2. Declare the grain (e.g. 1 entry per sales line item)
3. Identify the dimensions (e.g. date, customer, store, product)
4. Identify the facts (e.g. item quantity, item amount, line total, tax)

2. OPTIMIZE DIMENSIONS

1. Minimize the number of columns. In particular columns with high number of distinct values should be
excluded
2. Reduce cardinality (data type conversions)
a. For example, convert DateTime to Date, or reduce the precision of numeric fields to achieve better
compression ratio and reducing number of unique values.
b. Even if you need the time precision, split the DateTime column in 2 columns - Date and Time and if
possible round the time to the minute
3. Filter out unused dimension values (unless a business scenario requires them)
4. Use integer Surrogate Keys, pre-sort them
a. Azure Analysis Services compresses rows in segments of millions of rows
b. Integers use Run Length Encoding
c. Sortime will maximize compression when encoded as it reduces the range of values per segment
5. Ordered by Surrogate Key (to maximize Value encoding)
6. Hint for VALUE encoding on numeric columns
7. Hint for disabling hierarchies on Surrogate Keys

3. OPTIMIZING FACTS

1. Minimize the number of columns and exclude the ones not required for any reporting or self-service. In
particular columns with high number of distinct values should be excluded. Usually primary keys for Fact
tables can be excluded.
2. Handle early arriving facts. [Facts without corresponding dimension records]
3. Replace dimension IDs with their surrogate keys
4. Reduce cardinality (data type conversions)
a. For example, convert DateTime to Date, or reduce the precision of numeric fields to achieve better
compression ratio and reducing number of unique values.
b. Even if you need the time precision, split the DateTime column in 2 columns - Date and Time and if
possible round the time to the minute
5. Consider moving calculations in the BDL (source) layer so that results can be used in compression
evaluations
6. Order by less diverse SKs first (to maximize compression)
7. Increase Tabular sample size for deciding Encoding

Version 9.1 Published on 5th August 2020 Page 155 of 321


I&A Azure Solution Architecture Guidelines

4. BE CAREFUL WITH BI-DIRECTIONAL RELATIONSHIPS

Bi-directional relationships are undesired because applying filters/slicers traverses many relationships and will be
slower. Also, some filter chains are unlikely to add business value

Version 9.1 Published on 5th August 2020 Page 156 of 321


I&A Azure Solution Architecture Guidelines

Webhooks for AAS cube refreshes


Overview
How It Works
PL_PROCESS_CUBE
PL_CALLBACK
TMSL script

Overview

This article describes a standard approach of processing an Analysis Services database using webhooks. As per
the guidelines from TDA, this process should be used by any project that makes use of AAS and requires to refresh
cubes. This process replaces the previously approved approaches of using Batch Account and Function Apps for
cube refresh.

Similar to the AAS pause and resume ADF pipelines, the webhook is a standard code maintained by the landscape.
This allows for standardizing the approach we use for cube refreshes while allowing the projects the ability to do full
/partial refreshes. The ADF pipeline uses automation account to connect with AAS and process the cube.

How It Works

This approach uses Tabular Model Scripting Language which allows to maintain the AAS cubes. When a new ADF
is configured in a PDS environment, Landscape team will provide 2 additional pipelines. Please refer to the
following sections for their use.

PL_PROCESS_CUBE

This pipeline has a single activity and a parameter. You may choose to pass the parameter from another pipeline or
you can trigger the pipeline on its own by supplying appropriate value for the parameter. The parameter is called
tmslScript and it accepts TMSL representation of the refresh command that you want to perform on the cube.

Following image shows the pipeline and its only parameter.

The pipeline will finish in a few seconds generating an asynchronous task. When the asynchronous task finishes, it
creates a blob in the project’s storage account in a container that has the same name as the data factory. The
webhook will create the container if it doesn’t exist and it also supports projects with multiple data factories. Name
of the blob will be the same as the pipeline run id. Details of AAS refresh asynchronous task will be available in the
said blob. Please note, the blob appears in the container as soon as the asynchronous task is finished. Any
exceptions raised by AAS refresh command will be contained in blob.

Version 9.1 Published on 5th August 2020 Page 157 of 321


I&A Azure Solution Architecture Guidelines

ExecuteTMSL is a web activity and generates a POST request. The ‘body’ of the request is formed of a few pieces
and is generated at run-time. The dynamic content looks like this:

@concat('{"csv":"',pipeline().DataFactory,',',pipeline().RunId,',,PL_CALLBACK",','"object":',pipeline().parameters.
tmslScript,'}')

@concat('{"csv":"',pipeline().DataFactory,',',pipeline().RunId,',,
PL_CALLBACK",','"object":',pipeline().parameters.tmslScript,'}')

Notice that the pipeline parameter tmslScript is passed in the body. The name of the callback pipeline is also
passed in the body.

If this pipeline isn’t already available in your ADF, you can use the following JSON to create it:

{
"name": "PL_PROCESS_CUBE",
"properties": {
"description": "Process AAS Cube using TMSL refresh command.",
"activities": [
{
"name": "ExecuteTMSL",
"type": "WebActivity",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": "https://s2events.azure-automation.net
/webhooks?token=C2ZOe%2fMXfGJuAkmLiyOk1Or5PsRG5Tn9sqTqEPdg%2bFM%3d",
"method": "POST",
"body": {
"value": "@concat('{\"csv\":\"',pipeline().
DataFactory,',',pipeline().RunId,',,PL_CALLBACK\",','\"object\":',
pipeline().parameters.tmslScript,'}')",
"type": "Expression"
},
"linkedServices": [],
"datasets": []
}
}
],
"parameters": {
"tmslScript": {

Version 9.1 Published on 5th August 2020 Page 158 of 321


I&A Azure Solution Architecture Guidelines

"type": "string",
"defaultValue": {
"refresh": {
"type": "dataOnly",
"objects": [
{
"database": "Livewiree"
}
]
}
}
}
},
"annotations": []
},
"type": "Microsoft.DataFactory/factories/pipelines"
}

PL_CALLBACK

An additional pipeline is provided if you want to perform any action upon completion of the asynchronous task. The
webhook calls this pipeline and passes 2 parameters:

exitStatus – It will be a Boolean value suggesting if the pipeline has been successful
logFileName – A string value representing the name of the blob.

Use of this pipeline is optional and you may decide to use any other pipeline upon completion of
PL_PROCESS_CUBE. To do that, replace PL_CALLBACK with the name of your pipeline in the body of
ExecuteTMSL web activity. Please note, your pipeline should accept the exitStatus and logFileName parameters.

If this pipeline isn’t already available in your ADF, you can use the following JSON to create it:

{
"name": "PL_CALLBACK",
"properties": {
"activities": [
{
"name": "Set variable1",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "dummy",
"value": {
"value": "@pipeline().parameters.logFileName",
"type": "Expression"
}
}
}
],
"parameters": {

Version 9.1 Published on 5th August 2020 Page 159 of 321


I&A Azure Solution Architecture Guidelines

"exitStatus": {"type": "bool"},


"logFileName": {"type": "string"}
},
"variables": {
"dummy": {"type": "String"}
},
"annotations": []
},
"type": "Microsoft.DataFactory/factories/pipelines"
}

TMSL SCRIPT

While TMSL can be used for several operations on AAS cubes, this webhook code is restricted to only run ‘refresh’
commands. Any other command supplied as part of the tmslScript parameter will not be accepted and an
appropriate error message will be generated.

Here are few examples of using TMSL:

Action TMSL

Full Refresh {"refresh":{"type":"full","objects":[{"database":"Livewiree"}]}}

Data Only Refresh {"refresh":{"type":"dataOnly","objects":[{"database":"Livewiree"}]}}

Automatic {"refresh":{"type":"automatic","objects":[{"database":"Livewiree"}]}}

Single Table Refresh {"refresh":{"type":"automatic","objects":[{"database":"Livewiree", “table”:”Date”}]}}

Single Partition Refresh {"refresh":{"type":"dataOnly","objects":[{"database":"Livewiree", “table”:”SalesOrder”,


“partition”:”Nov2019”}]}}

For further reading about refresh command in TMSL, refer https://docs.microsoft.com/en-us/bi-reference/tmsl


/refresh-command-tmsl

Version 9.1 Published on 5th August 2020 Page 160 of 321


I&A Azure Solution Architecture Guidelines

Section 2.6 - Azure Data Warehouse/Azure Synapse Analytics


Introduction
Synapse SQL pool in Azure Synapse
Azure Synapse Analytics (formerly SQL DW) architecture
Synapse SQL MPP architecture components
Best Practices - General
1. Pause the service when not in use
2. Right-size the data warehouse instances
3. Implement optimization recommendations
Access Control
Storage options
Rowstore – Clustered Index
Rowstore – Heap
Clustered ColumnStore Index (CCI)
Distribution
Hash Distributed Tables
Round-Robin Distributed Tables
Replicated Tables
Data Distribution Guidance
Indexing
Clustered ColumnStore Index Considerations
Data skew
Other common performance issues - Statistics
Other common performance issues - CCI Health
Clustered ColumnStore Indexes
Other common performance issues - Resource Contention
Best Practices - Findings from Livewire India

Introduction

Data warehousing is about bringing massive amounts of data from diverse sources into one definitive, trusted
source for analytics and reporing. A data warehouse layer represents a single source of truth in a curated fashion
that Unilever can use to gain insights and make decisions. Azure Synapse is a SQL database that is optimized for
analytics, enforces structure, data quality and security. Azure Synapse is a PaaS service that brings together
enterprise data warehousing and Big Data analytics. It has storage and compute as 2 separate components where
you don’t have to pay for lots of idle compute when you don’t need it.

Azure Synapse is complete T-SQL based analytics.

Synapse SQL pool in Azure Synapse

Synapse SQL pool refers to the enterprise data warehousing features that are generally available in Azure Synapse.

SQL pool represents a collection of analytic resources that are being provisioned when using Synapse SQL. The
size of SQL pool is determined by Data Warehousing Units (DWU).

Import big data with simple PolyBase T-SQL queries, and then use the power of MPP to run high-performance
analytics. As you integrate and analyze, Synapse SQL pool will become the single version of truth your business
can count on for faster and more robust insights.

Data warehouse is a key component of a cloud-based, end-to-end big data solution.

Version 9.1 Published on 5th August 2020 Page 161 of 321


I&A Azure Solution Architecture Guidelines

In a cloud data solution, data is ingested into big data stores from a variety of sources. Once in a big data store,
Hadoop, Spark, and machine learning algorithms prepare and train the data. When the data is ready for complex
analysis, Synapse SQL pool uses PolyBase to query the big data stores. PolyBase uses standard T-SQL queries to
bring the data into Synapse SQL pool tables.

Synapse SQL pool stores data in relational tables with columnar storage. This format significantly reduces the data
storage costs, and improves query performance. Once data is stored, you can run analytics at massive scale.
Compared to traditional database systems, analysis queries finish in seconds instead of minutes, or hours instead
of days.

The analysis results can go to worldwide reporting databases or applications. Business analysts can then gain
insights to make well-informed business decisions.

Azure Synapse Analytics (formerly SQL DW) architecture


Synapse SQL MPP architecture components

Synapse SQL leverages a scale-out architecture to distribute computational processing of data across multiple
nodes. The unit of scale is an abstraction of compute power that is known as a data warehouse unit. Compute is
separate from storage, which enables you to scale compute independently of the data in your system.

Version 9.1 Published on 5th August 2020 Page 162 of 321


I&A Azure Solution Architecture Guidelines

SQL Analytics uses a node-based architecture. Applications connect and issue T-SQL commands to a Control
node, which is the single point of entry for SQL Analytics. The Control node runs the MPP engine, which optimizes
queries for parallel processing, and then passes operations to Compute nodes to do their work in parallel.

The Compute nodes store all user data in Azure Storage and run the parallel queries. The Data Movement Service
(DMS) is a system-level internal service that moves data across the nodes as necessary to run queries in parallel
and return accurate results.

With decoupled storage and compute, when using Synapse SQL pool one can:

Independently size compute power irrespective of your storage needs.


Grow or shrink compute power, within a SQL pool (data warehouse), without moving data.
Pause compute capacity while leaving data intact, so you only pay for storage.
Resume compute capacity during operational hours.

Best Practices - General

Here are the best practice guidelines for Azure SQL DW.

1. Pause the service when not in use

Data Warehouse is one of the most expansive components on our Azure stack and pausing it when not in use can
save massive amount of costs.
The ADF instance in PDS environments come with ready made pipelines that can be used for suspending the
service. These pipelines can be scheduled or triggered manually.

Please note that regular audits are in place to ascertain that the service is paused by projects.

2. Right-size the data warehouse instances

Make sure that the data warehouse is appropriately sized. If the service tier is too low for your project requirements
the processes will take longer and if multiple processes are trying to access the service, some of them might even
fail.

On the other hand if the service tier is too high then lot of capacity will be wasted and the project will incur huge
costs.

3. Implement optimization recommendations

Follow the recommendations in each of the following sections to minimise spend as well as achieve better
performance from SQL Data Warehouse.

ACCESS CONTROL

Developers do NOT have access to SQL DW in any environment except DEV environment.
Developers can access SQL DW via SSMS using their Unilever credentials.
To connect to ADLS gen1 for polybase, SPN MUST be used.

STORAGE OPTIONS

Rowstore – Clustered Index

Keeps the data in an indexed / ordered fashion which is great for reading. These indexes however get Fragmented
over time.

Version 9.1 Published on 5th August 2020 Page 163 of 321


I&A Azure Solution Architecture Guidelines

Performance

When you insert data which is ordered on Cluster Key, new rows are appended to the end of the table
resulting in fast load performance. If the data is not ordered on Cluster Key, new rows inserted into existing
pages results in Page Splits – poor load performance
Index maintenance is an overhead on data load
Good lookup performance
Ideal for limited range scans & singleton selects (Seeks)
Slower for table scans / partition scans / loading

Rowstore – Heap

This form of storage has no no clustered index. It is basically unordered data.

Performance

New rows appended to the end of the table – fast load performance
Whole table is / may be read for lookups (Seeks)
Whole table is read for Scans
Bad read performance

Clustered ColumnStore Index (CCI)

This is a highly compressed, IO efficient form of storing data.

Performance

Compression up to 15x (vs. RowStore up to 3.5x)


Load performance dependent on Batch Size
Lookup (seek) queries perform badly
Scan queries – optimized!

The Query performance depends on CCI quality / health

DISTRIBUTION

Pick the right distribution for your tables. Select the proper table distribution - Replicate tables

Select the right hash distribution column and minimize data movement
Replicate dimension tables to reduce data movement

Hash Distributed Tables

Data is distributed based on the hash of the distribution key value


60 Distributions
Use for Fact Tables

Version 9.1 Published on 5th August 2020 Page 164 of 321


I&A Azure Solution Architecture Guidelines

Round-Robin Distributed Tables

Data is distributed evenly across all distributions


60 Distributions
Use for Staging Tables

Replicated Tables

Copied on each compute node


Use for Dimension tables < 2GB

DATA DISTRIBUTION GUIDANCE

A distribution is the basic unit of storage and processing for parallel queries that run on distributed data. When SQL
Data Warehouse runs a query, the work is divided into 60 smaller queries that run in parallel. Each of the 60 smaller
queries runs on one of the data distributions. Each Compute node manages one or more of the 60 distributions. A
data warehouse with maximum compute resources has one distribution per Compute node. A data warehouse with
minimum compute resources has all the distributions on one compute node.

When creating a table in SQL DW, you need to decide if the table will be hash distributed or round-robin distributed.
This decision has implications for query performance. Each of these distributed tables may require data movement
during query processing when joined together. Data movement in MPP RDBMS system is an expensive but
sometimes unavoidable step. The objective of a good data warehouse design in SQL DW is to minimize data
movement.

Start with Round Robin but aspire to a hash-distribution strategy to take advantage of a massively parallel
architecture.
Make sure that common hash keys have the same data format.
Don’t distribute on varchar format.
Dimension tables with a common hash key to a fact table with frequent join operations can be hash
distributed.
Use sys.dm_pdw_nodes_db_partition_stats to analyze any skewness in the data.
Use sys.dm_pdw_request_steps to analyze data movements behind queries, monitor the time broadcast,
and shuffle operations take. This is helpful to review your distribution strategy.

INDEXING

Azure SQL Data Warehouse supports following type of indexes:

Clustered ColumnStore Index


Clustered Index
Non-Clustered Index
No Index – Heap

Clustered ColumnStore Index Considerations

By default, SQL Data Warehouse creates a clustered ColumnStore index when no index options are specified on a
table. Clustered ColumnStore tables offer both the highest level of data compression as well as the best overall
query performance. Clustered ColumnStore is the best place to start when you are unsure of how to index your
table.

The clustered ColumnStore index is more than an index, it is the primary table storage. It achieves high data
compression and a significant improvement in query performance on large data warehousing fact and dimension
tables. Clustered ColumnStore indexes are best suited for analytics queries rather than transactional queries, since
analytics queries tend to perform operations on large ranges of values rather than looking up specific values.

Version 9.1 Published on 5th August 2020 Page 165 of 321


I&A Azure Solution Architecture Guidelines

Heap Index

This is no index at all used recommended only in temporary landing of data eg staging layer data load

Clustered Index

Clustered indexes may outperform clustered ColumnStore tables when a single row needs to be quickly retrieved.
For queries where a single or very few row lookup is required to performance with extreme speed, consider a
cluster index or non-clustered secondary index.

The disadvantage to using a clustered index is that only queries that benefit are the ones that use a highly selective
filter on the clustered index column. To improve filter on other columns a non-clustered index can be added to other
columns. However, each index which is added to a table adds both space and processing time to loads.

DATA SKEW

Data skew primarily refers to a non uniform distribution in a dataset. Basically the data warehouse has 60 nodes
where the data is distrubited. If some nodes have more data than others, the workers with more data should work
harder, longer, and need more resources and time to complete their jobs. Detect data skew

Reconsider hash key causing physical data skew


Only as fast as your slowest distribution

Causes of Data Skew

Natural Skew
NULL hash key values
Default hash key value
Bad hash key choice

Resolution

Pick a different hash key


Split default values into a secondary table

Other common performance issues - Statistics

The MPP Query Optimizer heavily relies on statistics to evaluate plans. Out-of-Date or Non-Existent Statistics is the
most common reason for MPP performance issues. Avoid issues with statistics by creating them on all
recommended columns and updating them after every load

Create and update statistics


Enable auto-create stats (auto sample stats)
Auto update statistics
Add multi-column statistics for joins on multiple columns or predicates
Optimal plan selection

It is recommended Statistics are created on all columns used in:

Joins
Predicates
Aggregations
Group By’s
Order By’s
Computations

Don’t forget about multi-column statistics …

Version 9.1 Published on 5th August 2020 Page 166 of 321


I&A Azure Solution Architecture Guidelines

Good news is that Azure SQL DW now supports automatic creation of column level statistics however

Auto Update not supported


Multi-column stats not auto created
Stats Creation is Synchronous
Stats Creation is triggered by:
SELECT
INSERT-SELECT
CTAS
UPDATE
DELETE
EXPLAIN

Other common performance issues - CCI Health

CLUSTERED COLUMNSTORE INDEXES

Better Performance with > 100K rows / Compressed Row Group


Best Performance with 1,048,576 rows / Compressed Row Group
Deleted rows impact performance
Open Row Groups (Delta Store) – HEAPs
Loading Batches –
Greater than 100K rows / distribution – direct to Compressed format
Small Resource Class – memory pressure can limit Compressed RGs size
Compressing a Row Group requires Memory:
72MB + (r * c * 8) + (r * short str col * 32) + (long str col * 16MB)
Distributed tables have 60 sets of Row Groups
Recommended 60M rows (1M / distribution)
Each distribution has its own Delta Store
Partitions add CCIs / distribution

Other common performance issues - Resource Contention

Queries occupy Concurrency Slots, based on Resource Class. Number of concurrent queries depends on DWU
service objective. Allocated RAM / Query allocated depends on Resource Class and DWU.

Provision additional adaptive cache capacity


Increase cache hit percentage
Reduce tempdb contention
Scale or update resource class

Best Practices - Findings from Livewire India

Large fact tables or historical transaction tables should be stored as hash distributed tables
Partitioning for large fact benefit data maintenance and query performance
No Statistics, recommendations for updating statistics:
Frequency of stats updates: Conservative: Daily
After loading or transforming your data: if the data is less than 1 billion rows, use default sampling (20
percent). With more than 1 billion rows, statistics on a 2 percent range is good
Temporarily landing data on SQL Data Warehouse needs a heap table makes the overall process faster
Minimal logging with bulk load ( INSERT from SELECT ) to avoid memory errors
Batch size >= 102,400 per partition aligned distribution

Version 9.1 Published on 5th August 2020 Page 167 of 321


I&A Azure Solution Architecture Guidelines

Scale up and down depending on the ELT process and AAS cubes generation.
No business reports/dashboards connect to Azure SQL DW.
It can not be paused because operational data incidents must be analyzed using SQL statements. Data
Linage is needed for investigation.
Reserved capacity recommended.
Azure Data Factory & Databricks for the ETL process
Split user reporting workloads:
Direct Query (Azure SQLDW) – Fact Detail
Import (Power BI Datasets) – Fact Aggregates & Dimensions
PowerBI composite models & aggregated tables
Azure SQL Datawarehouse Materialized Views and Result set cache
Improve performance and query concurrency for queries at lowest granular level
Reduce the number of pre-calculated combinations at the cube level
Need to find the right aggregation level for an scalable architecture
Incremental refresh vs Full refresh

Good Reads

https://techcommunity.microsoft.com/t5/DataCAT/Azure-SQL-Data-Warehouse-loading-patterns-and-
strategies/ba-p/305456
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-
overview#unsupported-table-features

Version 9.1 Published on 5th August 2020 Page 168 of 321


I&A Azure Solution Architecture Guidelines

Section 2.7 - Azure Machine Learning


Overview of Azure Machine Learning
Use case
Limitation
Industrialization
Azure Machine Learning components
Machine Learning Workspace
Storage Account
Application Insights
Key Vault
Automated Machine Learning
Designer (preview)
Azure Notebooks (Free)
Machine Learning Studio
Cloud based Notebook VM - JupyterLab
Local environment
Data Science Virtual Machine (DSVM)
Azure Machine Learning Pipelines
Azure Notebook VM
MLOps

Overview of Azure Machine Learning

Azure Machine Learning is a cloud service that you use to train, deploy, automate, and manage machine learning
models, all at the broad scale that the cloud provides.
The following table shows various development environments supported by Azure Machine learning, along with
their pros and cons.

Environment Pros Cons

Automated ML Automated machine learning automates the Less control on data preparation, hyper-
process of selecting the best algorithm to parameter tuning. Available in Enterprise
use for your specific data, so you can Edition only.
generate a machine learning model quickly.

Designer Azure Machine Learning designer lets you Still in preview. Only has initial set of popular
(preview) visually connect datasets and modules on modules
an interactive canvas to create machine
learning models. It enables you to prep
data, train, test, deploy, manage, and track
machine learning models without writing
code. Provides a UI-based, low-code
experience.

Azure Free. Supports more languages than any Each project is limited to 4GB memory and
Notebooks other platform including Python 2, Python 3, 1GB data to prevent abuse.
R, and F#

Machine Easiest way to get started, includes Scale (10GB training data limit). Supports
Learning Studio hundreds of built-in packages and support proprietary compute target, CPU only. ML
(Classic) for custom code. Supports data Pipeline is not supported.
manipulation, model training and deployment

Version 9.1 Published on 5th August 2020 Page 169 of 321


I&A Azure Solution Architecture Guidelines

Local Full control of your development Takes longer to get started. Necessary SDK
environment environment and dependencies. Run with packages must be installed, and an
any build tool, environment, or IDE of your environment must also be installed if you
choice. don't already have one.

Azure Ideal for running large-scale intensive Overkill for experimental machine learning, or
Databricks machine learning workflows on the scalable smaller-scale experiments and workflows.
Apache Spark platform. Additional cost incurred for Azure Databricks.
See pricing details .

The Data Similar to the cloud-based notebook VM A slower getting started experience
Science Virtual (Python and the SDK are pre-installed), but compared to the cloud-based notebook VM.
Machine with additional popular data science and
(DSVM) machine learning tools pre-installed. Easy to
scale and combine with other custom tools
and workflows.

Use case

Good for experimentation


Drag and drop UI Interface
Data sits in Azure environment.
Azure ML uses computes like CPU, GPU, HDI and Databricks
Supports model building in Python
Supports containerization and publishing models as web services
Supports azure notebooks
Supports library publishing within a workspace

Limitation

Limited support for R

Industrialization

Azure ML jobs needs to be published as web services, which can be scheduled from Azure

Azure Machine Learning components

When you create a Machine Learning component you get the following resources:

Machine Learning Workspace

A workspace is a centralized place to work on all Machine Learnig aspects. It is a logical container for your machine
learning experiments, compute targets, data stores, machine learning models, docker images, deployed services
etc. keeping them all together. It allows to prepare data for experimentation, train models, compare experimentation
results, deploy models and monitor them. The workspace keeps a history of all training runs, including logs, metrics,
output, and a snapshot of your scripts. This is where you create experiments and maintain models

Storage Account

Is used as the default datastore for the workspace. Jupyter notebooks that are used with your notebook VMs are
stored here as well.

Application Insights

Version 9.1 Published on 5th August 2020 Page 170 of 321


I&A Azure Solution Architecture Guidelines

Stores monitoring information about your models.

Key Vault

Stores secrets that are used by compute targets and other sensitive information that's needed by the workspace.

Automated Machine Learning

In normal machine learning scenarios, you bring data together from despirate places and prepare it for model
training. You then go ahead and train the model and one you're happy with the output of the model that you've got,
you go ahead and deploy it.
This whole process is a number of steps and interconnected decisions that you make to get the model accuracy
that you're looking for. For example, when you're preparing data you may ask yourself how do you handle nulls, do
you have right number of features in that dataset to prepare the model for the accuracy and score that you're
looking for. And when you're building and training, what algorithms should you select? Should you be selecting a
linear regression or a decision tree and what are the hyper-parameters that you need to choose for those specific
algorithms.
So as you can see there are a number of questions that you may be asking yourself through this process. Without a
tool that can automatically do this for you, you might be iteratively trying a combination of the above to achieve the
score you wanted. This process can be very costly and time consuming especially if you don't really know the data.

Automated machine learning is a way to automate this process. With Automated machine learning you enter your
data, you define your goals, and you apply your constraints. Automated machine learning builds an end to end
pipeline that allows you to build a model and reach an accuracy quickly and effectively. You can get an optimized
model with far fewer iterations and far fewer steps saving you time, money and resources.
Automated machine learning basically simplifies the process of generating models tuned from the goals and
constraints you defined for your experiment, such as the time for the experiment to run or which models to allow or
deny, how many iterations to run of an exit score that you may have defined.

“Automated machine learning isn't a brute-force attack, nor is it doing a hyper-parameter


optimization for you. Automated machine learning is using machine learning to help you build
machine learning models.

Automated machine learning examines the dataset that you have and the characterstics and recommends new
pipelines to use to build your machine learning models. This encompasis:

preprocessing steps
feature extraction
feature generation

Version 9.1 Published on 5th August 2020 Page 171 of 321


I&A Azure Solution Architecture Guidelines

model selection
hyper-parameter tuning

It also learns from the metadata from your previous iteration to recommend new pipelines to get to your score and
exit criteria much quicker and much sooner. This helps accelerate your machine learning processes with more
efficiency and ease.

Designer (preview)

The visual interface uses your Azure Machine Learning workspace to:

Create, edit, and run pipelines in the workspace.


Access datasets.
Use the compute resources in the workspace to run the pipeline.
Register models.
Publish pipelines as REST endpoints.
Deploy models as pipeline endpoints (for batch inference) or real-time endpoints on compute resources in
the workspace.

There is no programming required, you visually connect datasets and modules to construct your model.

Azure Notebooks (Free)

Azure Notebooks is a free hosted service to develop and run Jupyter notebooks in the cloud with no installation.
Please note however, each project is limited to 4GB memory and 1GB data to prevent abuse. Legitimate users that
exceed these limits see a Captcha challenge to continue running notebooks. Azure Notebooks helps you to get
started quickly on prototyping, data science, academic research, or learning to program Python giving instant
access to a full Anaconda environment with no installation.
To learn more, check out the documentation .

Machine Learning Studio

Azure Machine Learning Studio gives you an interactive, visual workspace to easily build, test, and iterate on a
predictive analysis model. You drag-and-drop datasets and analysis modules onto an interactive canvas,
connecting them together to form an experiment, which you run in Machine Learning Studio. To iterate on your
model design, you edit the experiment, save a copy if desired, and run it again. When you're ready, you can convert
your training experiment to a predictive experiment, and then publish it as a web service so that your model
can be accessed by others.

There is no programming required, just visually connecting datasets and modules to construct your predictive
analysis model.

Version 9.1 Published on 5th August 2020 Page 172 of 321


I&A Azure Solution Architecture Guidelines

Cloud based Notebook VM - JupyterLab

The Azure ML Notebook VM is a cloud-based workstation created specifically for data scientists. Developers and
data scientists can perform every operation supported by the Azure Machine Learning Python SDK using a familiar
Jupyter notebook in a secure, enterprise-ready environment. Notebook VM is secure and easy-to-use,
preconfigured for machine learning, and fully customizable.

Key features Azure Machine Learning service Notebook VMs are:

Secure – provides AAD login integrated with the AML Workspace, provides access to files stored in the
workspace, implicitly configured for the workspace.
Preconfigured – with Jupyter, JupyterLab, up-to-date AML Python Environment, GPU drivers, Tensorflow,
Pytorch, Scikit learn, etc. (uses DSVM under the hood)
Simple set up – created with a few clicks in the AML workspace portal, managed from within the AML
workspace portal.
Customizable – use CPU or GPU machine types, install your own tools (or drivers), ssh to the machine,
changes persist across restarts.

Version 9.1 Published on 5th August 2020 Page 173 of 321


I&A Azure Solution Architecture Guidelines

Please note that there are some disadvantages:

Lack of control over your development environment and dependencies.


Additional cost incurred for Linux VM (VM can be stopped when not in use to avoid charges).
Exposes to potential misuse

Local environment

Azure Machine Learning enables you to locally create and run machine learning experiments, create and train
models and much more. This requires installing various tools like Visual Studio Code for development environment
and the Azure ML SDK . Using azureml.core.compute.ComputeTarget you can select local machine as a
compute target. For a comprehensive guide on setting up and managing compute targets, see the how-to .

Here's a video to help you get started using Visual Studio code for Machine Learning

<iframe width="840" height="475" src="https://www.youtube.com/embed/8EGJP7RPe1A" frameborder="0" allow="


accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

Data Science Virtual Machine (DSVM)

DSVMs are Azure Virtual Machine images, pre-installed, configured and tested with several popular tools that are
commonly used for data analytics, machine learning and AI training. They are Pre-configured environments in the
cloud for Data Science and AI Development.

“DSVMs offer the most flexiblity to develop ML models but it is hard to govern and hence
Unilever doens't promote use of DSVM unless there's a specific business requirement that
can't be met by the standard PaaS services described above.

For further reading, refer to the documentation .

Azure Machine Learning Pipelines

An Azure ML pipeline performs a complete logical workflow with an ordered sequence of steps. Each step is a
discrete processing action. Pipelines run in the context of an Azure Machine Learning Experiment .

Pipelines should focus on machine learning tasks such as:

Version 9.1 Published on 5th August 2020 Page 174 of 321


I&A Azure Solution Architecture Guidelines

Data preparation including importing, validating and cleaning, munging and transformation, normalization,
and staging
Training configuration including parameterizing arguments, filepaths, and logging / reporting configurations
Training and validating efficiently and repeatably, which might include specifying specific data subsets,
different hardware compute resources, distributed processing, and progress monitoring
Deployment, including versioning, scaling, provisioning, and access control

Azure Notebook VM

Azure Notebook VMs come with the entire ML SDK already installed in your workspace VM, and notebook tutorials
are pre-cloned and ready to run. While it is the easiest way to get started to run ML models, there are some
disadvantages.
Lack of control over your development environment and dependencies. Additional cost incurred for Linux VM (VM
can be stopped when not in use to avoid charges).

MLOps

Machine Learning Operations (MLOps) is based on DevOps principles and practices that increase the efficiency of
workflows. For example, continuous integration, delivery, and deployment.

Azure Machine Learning provides the following MLOps capabilities:

Create reproducible ML pipelines. Machine Learning pipelines allow you to define repeatable and reusable
steps for your data preparation, training, and scoring processes.
Create reusable software environments for training and deploying models.
Register, package, and deploy models from anywhere. You can also track associated metadata required
to use the model.
Capture the governance data for the end-to-end ML lifecycle. The logged information can include who is
publishing models, why changes were made, and when models were deployed or used in production.
Notify and alert on events in the ML lifecycle. For example, experiment completion, model registration,
model deployment, and data drift detection.
Monitor ML applications for operational and ML-related issues. Compare model inputs between training
and inference, explore model-specific metrics, and provide monitoring and alerts on your ML infrastructure.
Automate the end-to-end ML lifecycle with Azure Machine Learning and Azure Pipelines. Using
pipelines allows you to frequently update models, test new models, and continuously roll out new ML models
alongside your other applications and services.

Version 9.1 Published on 5th August 2020 Page 175 of 321


I&A Azure Solution Architecture Guidelines

Section 2.8 - Azure Monitor & Log Analytics


Monitoring is the act of collecting and analyzing data to determine the performance, health, and availability of your
business application and the resources that it depends on.

What is Azure Monitor?

Azure Monitor provides a single management point for infrastructure-level logs and monitoring for most of your
Azure services. Azure Monitor maximizes the availability and performance of your applications by delivering a
comprehensive solution for collecting, analyzing, and acting on telemetry from your cloud and on-premises
environments.

The following diagram depicts a high-level view of Azure Monitor. At the center of the diagram are the data stores
for metrics and logs, which are the two fundamental types of data that Azure Monitor uses. On the left side are the
sources of monitoring data that populate these data stores. On the right side are the different functions that Azure
Monitor performs with this collected data such as analysis, alerting, and streaming to external systems.

Azure Monitor can collect data from a variety of sources. You can think of monitoring data for your applications as
occurring in tiers that range from your application to any OS and the services it relies on to the platform itself. Azure
Monitor collects data from each of the following tiers:

Application monitoring data - Data about the performance and functionality of the code you have written,
regardless of its platform.
Guest OS monitoring data - Data about the OS on which your application is running. It might be running in
Azure, in another cloud, or on-premises.
Azure resource monitoring data - Data about the operation of an Azure resource.
Azure subscription monitoring data - Data about the operation and management of an Azure subscription
and data about the health and operation of Azure itself.
Azure tenant monitoring data - Data about the operation of tenant-level Azure services, such as Azure
Active Directory (Azure AD).

Version 9.1 Published on 5th August 2020 Page 176 of 321


I&A Azure Solution Architecture Guidelines

As soon as you create an Azure subscription and start adding resources, such as VMs and web apps, Azure
Monitor starts collecting data. Activity logs record when resources are created or modified and metrics tell you how
the resource is performing and the resources that it's consuming. You can also extend the data you're collecting by
enabling diagnostics in your apps and adding agents to collect telemetry data from Linux and Windows or
Application Insights.

Azure Monitor is the place to start for all your near real-time resource metric insights. Many Azure resources will
start outputting metrics automatically once deployed. For example, Azure Web App instances will output compute
and application request metrics. Metrics from Application Insights are also collated here in addition to VM host
diagnostic metrics.

Log Analytics

Centralized logging can help you uncover hidden issues that may be difficult to track down. With Log Analytics you
can query and aggregate data across logs. This cross-source correlation can help you identify issues or
performance problems that may not be evident when looking at logs or metrics individually. The following illustration
shows how Log Analytics acts as a central hub for monitoring data. Log Analytics receives monitoring data from
your Azure resources and makes it available to consumers for analysis or visualization.

You can collate a wide range of data sources, security logs, Azure activity logs, server, network, and application
logs. You can also push on-premises System Center Operations Manager data to Log Analytics in hybrid
deployment scenarios and have Azure SQL Database send diagnostic information directly into Log Analytics for
detailed performance monitoring.

When designing a monitoring strategy, it's important to include every component in the application chain, so you
can correlate events across services and resources. For services that support Azure Monitor, they can be easily
configured to send their data to a Log Analytics workspace. You can also submit custom data to Log Analytics
through the Log Analytics API. The following illustration shows how Log Analytics acts as a central hub for
monitoring data. Log Analytics receives monitoring data from your Azure resources and makes it available to
consumers for analysis or visualization.

Log Analytics allows us to Collect, search and visualize machine data from cloud and on-premises.

1. It is a Hyper-scale machine data analytics platform


2. Provides out of the box application and workload insights

Version 9.1 Published on 5th August 2020 Page 177 of 321


I&A Azure Solution Architecture Guidelines

Sample list of log/matrics that Log Analytics collects:

Custom Application/Infra logs


Azure Platform telemetry
Windows event logs
Windows performance counters
Security Event Logs
IIS Logs

With this data in Log Analytics, you can query the raw data for troubleshooting, root cause identification, and
auditing purposes. Here are some examples:

Track the performance of your resource (such as a VM, website, or logic app) by plotting its metrics on a
portal chart and pinning that chart to a dashboard.
Get notified of an issue that impacts the performance of your resource when a metric crosses a certain
threshold.
Configure automated actions, such as autoscaling a resource or firing a runbook when a metric crosses a
certain threshold.
Perform advanced analytics or reporting on performance or usage trends of your resource.
Archive the performance or health history of your resource for compliance or auditing purposes
Query the logs

For several known services (SQL Server, Windows Server Active Directory), there are management solutions
readily available that visualize monitoring data and uncover compliance with best practices.

Version 9.1 Published on 5th August 2020 Page 178 of 321


I&A Azure Solution Architecture Guidelines

Generate dashboards for different services

Log Analytics allows you to create queries and interact with other systems based on those queries. The most
common example is an alert. Maybe you want to receive an email when a system runs out of disk space or a best
practice on SQL Server is no longer followed. Log Analytics can send alerts, kick off automation, and even hook
into custom APIs for things like integration with IT service management (ITSM).

ADF Dashboard Example

AAS Processing Dashboard Example

Azure Databricks Dashboard Example

Version 9.1 Published on 5th August 2020 Page 179 of 321


I&A Azure Solution Architecture Guidelines

Version 9.1 Published on 5th August 2020 Page 180 of 321


I&A Azure Solution Architecture Guidelines

Section 2.9 - Azure App Service


Web App recommendations:

Only PAAS Azure Web App component is approved as part of I&A Landscape.
Only Windows based web app is approved as a PAAS component.
Web App should use AAD authentication and MFA for user access control
WAF (Web Application Firewall) is mandatory for Web App as per the security right practices
Web App can use SQL Database to keep the configuration and user profiling information. For any data
requirement please work with Architect to define the right database for storing the data
mysql database is not approved to store the data
Web app connection to SQL DB/DW should be only through MSI method.
ASP .Net is the suggested language
Only 2 App service plans are suggested one for Prod and another for Non prod
Deployment of Web App component should follow standard Azure DevOPs deployment. Docker is not
suggested.
Web App can be used to embed Power BI dashboards

Below needs to be verified:

TDA need to add on embedding PBI in web app.


Blazor component usage for Web App?

Blazor

Blazor allows you to build full stack web-apps using just .NET Core 3.0 and C#.
Until now the .NET has been used to generte server rendered web apps. The server runs .NET and generates
HTML or JSON code in response to a browser request. If you wanted to do anything on the client (browser), you
could use JavaScript.

With Blazor, you can:

Build client-side web UI with .NET instead of JavaScript


Write reuseable web UI components with C# and Razor
Share .NET code with both the Client and the Server
Call into JavaScript libraries & browser APIs as needed

Blazor employs a component model. So, unlike MVC, where each “View” is essentially an entire page, which you
get back from the server when you make a request, Blazor deals in components.

A component can represent an entire “page”, or you can render components inside other components.

The component model approach gives you a convenient way to take a mockup or design, break it down into smaller
parts, and build each part separately (as components) before composing them back together. Components also
carry the benefit of being easily re-used.

Hence it is a very productive way of writing/maintaining the web app code. Its available with VS 2019 and also on
VS Code with C# extension.

Running Blazor on Client or Server

Blazor WebAssembly Blazor Server

Pro: Pro:

Version 9.1 Published on 5th August 2020 Page 181 of 321


I&A Azure Solution Architecture Guidelines

True SPA, Full interactivity Smaller download size, faster load


Utilize Client Resources time
Supports offline, static sites, PWA scenarios Running on fully featured .NET
runtime
Code never leaves the server
Simplified Architecture

Con: Con:

Larger download size Latency


Requires WebAssembly No offline support
Still in preview Consumes server resources
Not supported by old browsers like IE 11

Concerns

Blazor server does most of its processing on the server, making your server (and network resources) the primary
point of failure, and bottleneck when it comes to speed. Every button click, or DOM triggered event gets sent over a
socket connection to the server, which processes the event, figures out what should change in the browser, then
sends a small “diff” up to the browser (via the socket connection).

This article from Microsoft is a useful guide to the kind of performance you can expect.

“In our tests, a single Standard_D1_v2 instance on Azure (1 vCPU, 3.5 GB memory) could handle over 5,000
concurrent users without any degradation in latency”

Based on this, for an average web application it seems the server resources wouldn’t be a major concern, but the
network latency might be a factor.

Blazor Web Assembly will be able to remove this constraint but it is still in preview and is expected to be in GA by
May 2020.

Version 9.1 Published on 5th August 2020 Page 182 of 321


I&A Azure Solution Architecture Guidelines

Web app connection to Azure SQL DB using MSI


Overview
Setup
Enable the MSI
AD Groups
References

Overview

A web app can connect to an Azure SQL database using Managed Service Identity and it is a safer way to connect
as there is no need of username or password in the connection string. This document describes the process of
achieving that.

It is supported for both ASP.NET app as well as ASP.NET Core app.

Note: the PoC was performed for projects running on .Net Framework 4.7.2 but this method should be supported by
.Net Frameworks 4.5.2 and 4.6.1 as well.

Setup

Few things need to be set up for it to work. Most of these would be done by landscape team so projects don’t have
to worry about them but to describe the complete process this document will include them.

ENABLE THE MSI

We need to enable MSI for the web app. To do that, navigate to your web app and under settings, click on ‘identity’.
Under the System assigned tab, change the status to On. It will generate the Object ID.

AD GROUPS

Three AD groups in each environment will be in place and would have admin, write and read accesses on the
Azure SQL Database. Please note that landscape would have already enabled the AD admin on the SQL Server
setting the first AD group as admin for SQL Server.

Depending on your requirement, you can add the MSI as member either to the reader or to the writer AD group.
Please reach out to the DevOps team or the landscape team to get it set it up.

Version 9.1 Published on 5th August 2020 Page 183 of 321


I&A Azure Solution Architecture Guidelines

Modify ASP.NET application code

If your application uses ASP.NET then follow these steps. Steps in this section should be done by the project teams.

Install the NuGet package Microsoft.Azure.Services.AppAuthentication to your application.


In Web.config, under <configSections>, add the following line:

<section name="SqlAuthenticationProviders" type="System.Data.SqlClient.


SqlAuthenticationProviderConfigurationSection, System.Data, Version=4.
0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" />

Add another section under after <configSections> using the following code:

<SqlAuthenticationProviders>
<providers>
<add name="Active Directory Interactive" type="Microsoft.Azure.
Services.AppAuthentication.SqlAppAuthenticationProvider, Microsoft.
Azure.Services.AppAuthentication" />
</providers>
</SqlAuthenticationProviders>

Change your connection string to:

"server=tcp:<server-name>.database.windows.net;database=<db-name>;
UID=AnyString;Authentication=Active Directory Interactive"

Replace <server-name> and <db-name> with your server name and database name.

Publish your app

References

For full Microsoft documentation please refer to the following links.

https://docs.microsoft.com/en-us/azure/app-service/app-service-web-tutorial-dotnet-sqldatabase

https://docs.microsoft.com/en-us/azure/app-service/app-service-web-tutorial-connect-msi

Version 9.1 Published on 5th August 2020 Page 184 of 321


I&A Azure Solution Architecture Guidelines

Section 2.10 - Azure Logic App


Introduction
Approved connectors
Approved connection patterns
Logic App to Azure SQL DW
Logic App to Azure SQL DB
Logic App to Azure Data Lake Storage Gen1
Logic App to Sharepoint
Logic App to O365 email

Introduction

Azure logic app helps with creating schedule, automate, and orchestrate tasks, business processes, and workflows.

Logic app workflow or processing is initiated based on a trigger. Logic app works on trigger-action model.

Triggers: Trigger is an event that meets specified conditions. For example, receiving an email or new file/blob
creation in storage account. Recurrence trigger can be used to start logic app workflow.
Actions: Actions are steps that are executed as a result of trigger. Each action usually maps to an operation
that's defined by a managed connector, custom API, or custom connector.

Approved connectors

Logic app supports multiple connectors for trigger and action. Following connectors are approved for usage within
Unilever environments:

1. Azure SQL DB
2. Azure SQL DW
3. Azure Data Lake Storage Gen1
4. Sharepoint
5. O365 email

Any connectors not listed above MUST be approved via I&A TDA.

Approved connection patterns


Logic App to Azure SQL DW

Version 9.1 Published on 5th August 2020 Page 185 of 321


I&A Azure Solution Architecture Guidelines

In this pattern, logic app and logic app API connector is deployed by I&A landscape in DEV environment.
For higher environments, this MUST be deployed via azure devops CI/CD pipeline.
Azure SQL data warehouse logic app connector SHOULD be deployed using SQL database credentials.
Data warehouse credentials SHOULD have limited privileges required to execute workflow. (e.g. read or
read-write access to specific tables.)
Azure SQL data warehouse credentials MUST be stored in product specific azure keyvault.
Product teams WILL NOT have permissions to read secrets, credentials from product specific azure keyvault.
Unilever I&A landscape will deploy logic app using following workflow:

Project team WILL be provided with required templates to run these deployments from azure devops.

At the time of publishing of this document logic app does not support SPN or MSI authentication with Azure
SQL. Project MUST change the connectivity type, when SPN or MSI authentication is available with Azure
SQL connector.

Logic App to Azure SQL DB

Version 9.1 Published on 5th August 2020 Page 186 of 321


I&A Azure Solution Architecture Guidelines

In this pattern, logic app and logic app API connector is deployed by I&A landscape in DEV environment.
For higher environments, this MUST be deployed via azure devops CI/CD pipeline.
Azure SQL connector MUST be deployed using SQL database credentials.
Database credentials SHOULD have limited privileges which are required to execute workflow. (e.g. read or
read-write access to specific tables.)
Azure SQL database credentials MUST be stored in product specific azure keyvault.
Product teams WILL NOT have permissions to read secrets, credentials from product specific azure keyvault.
Unilever I&A landscape will deploy logic app using following workflow:

Project team WILL be provided with required templates to run these deployments from azure devops.

At the time of publishing of this document logic app does not support SPN or MSI authentication with Azure
SQL. Project MUST change the connectivity type, when SPN or MSI authentication is available with Azure
SQL connector.

Logic App to Azure Data Lake Storage Gen1

In this pattern, logic app and logic app API connector is deployed by I&A landscape in DEV environment.
For higher environments, this MUST be deployed via azure devops CI/CD pipeline.
Azure Data Lake storage gen1 api connector MUST be deployed using SPN credentials.

Version 9.1 Published on 5th August 2020 Page 187 of 321


I&A Azure Solution Architecture Guidelines

SPN credentials MUST be stored in product specific azure keyvault.


Product teams WILL NOT have permissions to read secrets, credentials from product specific azure keyvault.
Unilever I&A landscape will deploy logic app using following workflow:

Project team WILL be provided with required templates to run these deployments from azure devops.

Logic App to Sharepoint

In this pattern, logic app and logic app API connector is deployed by I&A landscape in DEV environment.
For higher environments, this MUST be deployed via azure devops CI/CD pipeline.
Sharepoint api connector SHOULD be deployed using AD user credentials which have required permissions
on sharepoint.
I&A Landscape WILL create sharepoint api connector without credentials.
AD user credentials SHOULD be entered manually by project team.
Project team MUST manage AD user password expiry.
Project team is responsible to document and implement process to manage AD user credential expiry.
Unilever I&A landscape will deploy logic app using following workflow:

Version 9.1 Published on 5th August 2020 Page 188 of 321


I&A Azure Solution Architecture Guidelines

Project team WILL be provided with required templates to run these deployments from azure devops.

Logic App to O365 email

In this pattern, logic app and logic app API connector is deployed by I&A landscape in DEV environment.
For higher environments, this MUST be deployed via azure devops CI/CD pipeline.
O365 email api connector SHOULD be deployed using AD user credentials which has permissions to send
email.
I&A Landscape WILL create O365 email api connector without credentials.
O365 user credentials SHOULD be entered manually by project team.
Project team MUST manage O365 user password expiry.
Project team is responsible to document and implement process to manage O365 email user credential
expiry.
Unilever I&A landscape will deploy logic app using following workflow:

Version 9.1 Published on 5th August 2020 Page 189 of 321


I&A Azure Solution Architecture Guidelines

Project team WILL be provided with required templates to run these deployments from azure devops.

Version 9.1 Published on 5th August 2020 Page 190 of 321


I&A Azure Solution Architecture Guidelines

Section 2.11 - Microsoft PowerApp


Microsoft PowerApp allows building apps in short-time without having to write code. It allows creation of workflows
that implements business processes.

PowerApp is allowed in Unilever I&A landscape in following pattern:

PowerApp visual for PowerBI is approved for interactive powerBI interface. This plug-in allows users to take action
on business insight from within powerBI report and observe the impact from same powerBI report.

PowerApp to SQL connectivity: (To read data from SQL)

1. PowerApp to SQL connectivity MUST use Azure AD auth via “set admin” option.
2. Landscape WILL configure azure AD admin using “set admin” configuration.
3. Project SHOULD create new azure AD MFA enabled group that consists of all end-users of the application
along with application specific azure AD login (referred to as service account).
4. AD group created in step 3 SHOULD be provided READ only access on the required tables only. (This
should be added as contained group.)
5. Project SHOULD use service account when creating power app.
6. When users connect to powerapp using their own AD account and then access to backend data will also be
based on user’s own AD account.

PowerApp to SQL connectivity: (To write data to SQL)

1. Write back to SQL from powerapp SHOULD be configured via azure logic app.
2. For details on logic app to SQL connectivity please refer to Section 2.10 - Azure Logic App.
3. PowerApp automation SHOULD be configured to call logic app HTTP request endpoint.
4. Logic app HTTP endpoint SHOULD be configured to use Shared Access Signature (SAS) in the endpoint's
URI.
5. HTTP request to logic app will follows this format: https://<request-endpoint-URI>sp=<permissions>sv=<SAS-
version>sig=<signature>
6. Each URI contains the sp, sv, and sig query parameter as described in this table:
a. sp: Specifies permissions for the permitted HTTP methods to use.
b. sv: Specifies the SAS version to use for generating the signature.

c.

Version 9.1 Published on 5th August 2020 Page 191 of 321


6.
I&A Azure Solution Architecture Guidelines

c. sig: Specifies the signature to use for authenticating access to the trigger. This signature is generated
by using the SHA256 algorithm with a secret access key on all the URL paths and properties. Never
exposed or published, this key is kept encrypted and stored with the logic app. Your logic app
authorizes only those triggers that contain a valid signature created with the secret key.

PowerApp Licensing:

Please work with Technology services - Collaboration services team for power app and power app connector
licensing.

Version 9.1 Published on 5th August 2020 Page 192 of 321


I&A Azure Solution Architecture Guidelines

Section 2.12 - Azure Cache for Redis


Overview

Azure Cache for Redis provides an in-memory data store based on the open-source software Redis. Redis is in-
memory data structure store that can be used as a database, cache and message broker.

When used as a cache, Redis improves the performance and scalability of systems that rely heavily on backend
data-stores. Performance is improved by copying frequently accessed data to fast storage located close to the
application. With Azure Cache for Redis, this fast storage is located in-memory instead of being loaded from disk by
a database.

Azure Cache for Redis offers access to a secure, dedicated Redis cache. It is managed by Microsoft, hosted on
Azure, and accessible to any application within or outside of Azure.

CACHE FOR REDIS TIERS

Feature Basic Standard Premium

Cache Nodes 1 node 2 nodes 2 nodes

Cache Size 250 MB – 53 GB 250 MB – 53 GB 6 GB – 120 GB

SLA No SLA 99.9% 99.9%

Geo-replication No No Yes

VNET integration No No Yes

Azure cache for redis approved pattern:

1. Azure cache for redis is approved for usage within Unilever I&A environment in cache aside pattern.
2. It is NOT part of standard architecture but is approved for usage on case by case basis.

CACHE-ASIDE PATTERN ARCHITECTURE:

CACHE ASIDE PATTERN DATA FLOW:

Version 9.1 Published on 5th August 2020 Page 193 of 321


I&A Azure Solution Architecture Guidelines

SECURITY GUIDELINES

1. Azure Cache for redis credentials MUST be stored in key vault.


2. Azure cache for redis credentials MUST NOT be hardcoded in application.
3. App service MSI WILL be assigned READ permission on key vault.
4. For production environment, project MUST use standard or premium cache.
5. Basic cache is NOT approved for production environments.

REFERENCES

Refer to MS best practices on usage of azure cache for redis: https://docs.microsoft.com/en-us/azure/azure-cache-


for-redis/cache-best-practices

Version 9.1 Published on 5th August 2020 Page 194 of 321


I&A Azure Solution Architecture Guidelines

Section 2.13 - Power BI


Overview
Power BI Service
Power BI Architecture
Power BI Data Sources
Within Unilever I&A Tech, following sources are allowed
Security & Authentication
Power BI Pro Licensing
Power BI Premium Offering
What is Power Bi Premium
Premium offering in Unilever
Access to Power BI
When to use custom development VS Power BI

Overview

Power BI is a cloud-based business analytics service that enables:

fast and easy access to data


a live 360º view of the business
data discovery and exploration
insights from any device
collaboration across the organization
anyone to visualize and analyse the data

Power BI Service

Live Dashboards
Interactive Reports
Data Visualizations
Mobile Applications
Natural Language query

Version 9.1 Published on 5th August 2020 Page 195 of 321


I&A Azure Solution Architecture Guidelines

Type questions in plain language – Power BI Q&A will provide the answers
Q&A intelligently filters, sorts, aggregates, groups and displays data based on the key words in the question
Sharing with Others

Power BI Architecture

Power BI Data Sources

Power BI connects to most of the popular databases as explained in the point above.
First, you need to determine if the database can be connected to Power BI. Some databases may need the
corresponding ODBC drivers to be installed.
Any database which is hosted on a server within the company's network is considered an On Premise Data Source.
If you want to use any On Premises data source and have a scheduled data refresh then it needs a bridge called
'On Premise Data Gateway'.
Such solutions need to be discussed with the TDA team to get the necessary approvals for setting up the Data
Gateway.

Within Unilever I&A Tech, following sources are allowed

Live connectivity with SQL Server Analysis Services


Integration with Azure Services
Azure Analysis Services
Azure Data Warehouse
Azure Data Lake
Excel and Other Power Bi Desktop Files.

Security & Authentication

When sharing data, you always need to assess who should access it and restrict the access accordingly. You can
do via restricting access to reports using Active Directory Groups, restricting access to specific values in the dataset
or if required, via both systems.
Visit these links and post your questions in chatter for more information.
https://docs.microsoft.com/en-us/power-bi/service-admin-rls
https://docs.microsoft.com/en-us/power-bi/developer/embedded-row-level-security
Also, there might be a level of security implemented in your data source, if you are using managed databases of
any kind as backend. However, when you extract data in Power BI and publish it, it will be embedded in your report
(sort of) and you'll be solely responsible for the access restrictions.

Hosting Restricted Data on Power BI


Yes, we have got the information security clearance for hosting Restricted Data on Power BI. However, if the data
involves Personal Sensitive information, it needs to be specifically cleared with the Information Security Team.
The responsibility of getting the necessary clearance, hosting the data and also the accountability in case of any
breach is purely with the respective Project Teams and not with the FE team.

User authentication
Data source security
Authentication methods
Data source authentications in Power BI Desktop

Data refresh

Real-time visibility
using the Power BI REST API or with built-in Azure Stream Analytics integration

Version 9.1 Published on 5th August 2020 Page 196 of 321


I&A Azure Solution Architecture Guidelines

Live connectivity
data is updated as user interacts with dashboard or report
to existing on-premise sources, e.g. Analysis Services, with auto-refresh
to Azure SQL Database with auto refresh
Automatic and scheduled refresh
regardless of where data lives
SaaS data sources (automatic)
Schedule refreshes for on-premise data sources with Personal Gateway
Scheduled refresh using Power BI Personal Gateway (on-premises sources)
Personal Gateway empowers the business analyst to securely and easily refresh on-premise data
No help from IT required to setup Personal Gateway (on local machine) or schedule refreshes
With the Power BI Desktop or Excel and the Personal Gateway, data from a wide range of on-
premises sources can be imported and kept it up-to-date
The Gateway installs and runs as a service on your computer, using a Windows account you specify.
Data transfer between Power BI and the Gateway is secured through Azure Service Bus

Power BI Pro Licensing

While its free to publish a report in Power BI Service, one needs a Pro license to share the report with other users.
Any user who accesses these reports also needs a Pro license.

Publishing a report on Power BI Premium has a different licensing policy which necessitates only the publisher to
have a Pro license and the users who access the shared reports can be free users.

Acquiring a Pro License in Unilever


Power BI Pro licenses need to be purchased by the individual project teams. Unilever has centrally negotiated the
price for these licenses with our vendors. Here is the procedure for procuring it:

Write to [email protected] with a copy to [email protected] and Tanvi.


[email protected] giving them the details of the no of licenses and the company, country under which it
will be bought
The SAM demand management team will send you the Quote and the Vendor details
Raise a DO on the vendor for the required quantity. After raising the DO you need to raise an IT request to
get these licenses allocated to the individual users. Please note, you need to raise just one request for the
group of users. No need to raise individual request for each user.
Under IT Request – choose 'software and access' – choose 'business application access' – choose power bi
provide the details (ref the screen shot given below)
Just attach the DO and the list of user ids
Also provide the business justification e.g. Required for so and so project etc.
For these of kind of requests it is mandatory to get it approved by the LM. So, this request will go to your LM
for approval. Get this approved.

Power BI Premium Offering


What is Power Bi Premium

Power BI Premium is slightly different different from the normal Power BI Service in terms of performance
and scalability.
While Power BI Service is a capacity shared by all customers of Microsoft who have bought that service,
Power BI Premium is a dedicated capacity bought by an enterprise for its users.
Since it is a dedicated capacity, it can provide better performance and few other advantages such as –
Easier sharing of reports with other users where the users accessing the reports need not have Pro
license

Version 9.1 Published on 5th August 2020 Page 197 of 321


I&A Azure Solution Architecture Guidelines

Better data refresh frequency


You can look at Power BI Premium as separate logical folder belonging to a particular enterprise (like
Unilever) on the shared power bi service (powerbi.com). The users still need to log into powerbi.com
to access any reports hosted on power bi premium
The user sees all the Dashboards shared with him/her in the section "Shared with me" irrespective of
if it is hosted on Premium or in shared capacity. The user cannot make out whether it is hosted on
Premium or not.

Few points that apply to Premium capacity:

Total size of all models can be much more than the available memory of the capacity
It manages operations by priority to make the most use of available resources. Low memory can result in
eviction
The system makes decisions based on current resource usage
Since Premiun capacity is fixed, performance can be significantly influenced by resource scarcity
Through improper management its possible to get inconsistent bad performance in Premium even with well
optimized models

Premium offering in Unilever

Unilever has acquired Power BI Premium capacity which is being shared with a number of projects. To enable
users to utilize the full benefits of Premium we are providing end users and project teams the opportunity to host
their solutions on the Premium Capacity as a part of a temporary arrangement untill we go live. We are calling this
interim phase as the 'Soft Launch'.
For a project that wants to host their Workspaces on premium, following information is required:

1. Number of expected users under below two categories:


a. Total Number of users who will have access
b. Estimated number of Distinct users in a week
2. Filled up Agreement Document
3. Email Approval from respective L3 (for the purpose of cross charging the cost as per the cost model
mentioned in the document)

Access to Power BI

Any user with a valid Unilever Id can have access to Power BI. Power BI has two components – Power BI Desktop
and Power BI Service.

Power BI Desktop: This is a free utility which can be downloaded and installed by users on their local desktop
/laptop. It is primarily used for developing Power BI reports which later can be published and shared on the Power
BI Service

Power BI Service: This is a cloud based service offered by Microsoft which allows free access to all users with a
valid Unilever id. Every Unilever Id has a free account on this service where the reports can be published and
shared. This service is accessible through the link https://powerbi.microsoft.com. If a user has a Pro License or is a
member of a premium workspace, they would be able to avail all features of Power BI Service.

Power BI Mobile: Users can connect to your on-premises and cloud data from the Power BI mobile apps. Try
viewing and interacting with your Power BI dashboards and reports on your mobile device — be it iOS (iPad,
iPhone, iPod Touch, or Apple Watch), Android phone or tablet, or Windows 10 device.

When to use custom development VS Power BI

Power BI is the preferred tool for reporting as per the Ecosystem 3.0 guidelines. It caters to a multitude of scenarios
and requirements across the business.
However, there are use cases where Power BI currently is not the tool of choice as it does not satisfactorily cater to

Version 9.1 Published on 5th August 2020 Page 198 of 321


I&A Azure Solution Architecture Guidelines

the requirements. Most of these requirements fall under Data Science and advanced analysis.
The preferred way forward for these requirements is custom development using managed code (.net etc.)
Listing below scenarios where Power BI will not be the preferred tool of choice:

Multiple Scenario What If Analysis


Descriptive Statistics on a data set
Correlations
Two-Sample T-Test
Paired t-test
Crosstabulation and Chi-Squared Test of Independence
Change Tolerance/Threshold levels

The main reasons are that huge datasets are involved and these require a lot of processing. R visuals can be used
but so far, the R implementation in power BI is limited in implementation as called out in below points:

Data size limitations – data used by the R visual for plotting is limited to 150,000 rows. If more than 150,000
rows are selected, only the top 150,000 rows are used and a message is displayed on the image.
Calculation time limitation – if an R visual calculation exceeds five minutes the execution times out, resulting
in an error.
Relationships – as with other Power BI Desktop visuals, if data fields from different tables with no defined
relationship between them are selected, an error occurs.
R visuals are refreshed upon data updates, filtering, and highlighting. However, the image itself is not
interactive and cannot be the source of cross-filtering.
R visuals respond to highlighting other visuals, but you cannot click on elements in the R visual in order to
cross filter other elements.
Only plots that are plotted to the R default display device are displayed correctly on the canvas. Avoid
explicitly using a different R display device.

Version 9.1 Published on 5th August 2020 Page 199 of 321


I&A Azure Solution Architecture Guidelines

Power BI performance
Introduction
Which part is slow?
Tuning the data refresh
Verify that query folding is working
Minimize the data you are loading
Consider performing joins in DAX, not in M
Review your applied steps
Make use of SQL indexes
Tuning the model
Use the Power BI Performance Analyzer
Remove data you don’t need
Avoid iterator functions
Use a star schema
Visualization Rendering
Lean towards aggregation
Filter what is being shown
Testing Performance of Power BI reports
Tools for performance testing

Introduction

Performance tuning Power BI reports requires identifying the bottleneck and using a handful of external
applications. This section covers how to narrow down the performance problem, as well as general best practices.

Which part is slow?

There are 4 main areas where there might be a slowdown:

Data refresh
Model calculations
Visualization rendering

Identifying which one of these is the problem is the first step to improving performance. In most cases, if a report is
slow it’s an issue with step 2, your data model.

Tuning the data refresh

Usually you are going to see a slow refresh when you are authoring the report or if a scheduled refresh fails. It’s
important to tune your data refresh to avoid timeouts and minimize how much data your are loading.

VERIFY THAT QUERY FOLDING IS WORKING

If you are querying a relational database, especially SQL Server or Data Warehouse, then you want to make sure
that query folding is being applied. Query folding is when M code in PowerQuery is pushed down to the source
system, often via a SQL query. One simple way to confirm that query folding is working is to right click on a step
sand select View Native Query . This will show you the SQL query that will be run against the database. If you have
admin privileges on the server, you can also use extended events to monitor the server for queries.

Some transformation steps can break query folding, making any steps after them unfoldable. Finding out which
steps break folding is a matter of trial and error. But simple transformations, such as filtering rows and removing
columns, should be applied early.

MINIMIZE THE DATA YOU ARE LOADING

Version 9.1 Published on 5th August 2020 Page 200 of 321


I&A Azure Solution Architecture Guidelines

If you don’t need certain columns, then remove them. If you don’t need certain rows of data, then filter them out.
This can improve performance when refreshing the data.

If your Power BI file is more than 100MB, there is a good chance you are going to see a slowdown due to the data
size. Once it gets bigger than that it is important to either work on your DAX code, or look into an alternative
querying/hosting method such as DirectQuery or Power BI Premium.

CONSIDER PERFORMING JOINS IN DAX, NOT IN M

If you need to establish a relationship purely for filtering reasons, such as using a dimension table, then consider
creating the relationship in DAX instead of in PowerQuery. DAX is blazing fast at applying filters, whereas Power
Query can be very slow at applying a join, especially if that join is not being folded down to the SQL level.

REVIEW YOUR APPLIED STEPS

Because Power Query is a graphical tool, it can be easy to make changes and then forget about them. For
example, sometimes people sort the data during design but that step is costly and often is not required. Make sure
such extra steps are not left behind. This will be terrible for data loading performance.

MAKE USE OF SQL INDEXES

If your data is in a relational database, then you want to make sure there are indexes to support your queries. If you
are using just a few columns it may be possible to create a covering query that covers all of the columns you need.

Tuning the model

When someone says that a Power BI report is slow, it is usually an issue with the DAX modelling. Unfortunately,
that fact isn’t obvious to the user and it can look like the visuals themselves are slow. There is a tool to identify the
difference: the Power BI Performance Analyzer .

USE THE POWER BI PERFORMANCE ANALYZER

If your report is slow, the very first thing you should do is run the Power BI performance analyzer. This will give you
detailed measurements of which visuals are slow as well as how much of that time is spent running DAX and how
much is spent rendering the visual. Additionally, this tool gives you the actual DAX code being run behinds the
scenes, which you can run manually with DAX Studio.

REMOVE DATA YOU DON’T NEED

Because of the way the data is stored in Power BI, the more columns you have the worse compression and
performance you have. Additionally, unnecessary rows can slow things down as well. Two years of data is almost
always going to be faster than 10 years of the same data.

Additionally, avoid columns with a lot of unique values such as primary keys. The more repeated values in a
column, the better the compression because of run-length encoding. Unique columns can actually take up more
space when encoded for Power BI than the source data did.

AVOID ITERATOR FUNCTIONS

Iterator functions will calculate a result row by agonizing row, which is not ideal for a columnar data store like DAX.
There are two ways to identify iterator functions. The aggregation functions generally end in an X: SUMX, MAXX,
CONCATENATEX, etc. Additionally, many iterators take in a table as the first parameter and then an expression as
the second parameter. Iterators with simple logic are generally fine, and sometimes are secretly converted to more
efficient forms.

USE A STAR SCHEMA

Version 9.1 Published on 5th August 2020 Page 201 of 321


I&A Azure Solution Architecture Guidelines

Using a star schema, a transaction table in the center surrounded by lookup tables, has a number of benefits. It
encourages filtering based on the lookup tables and aggregating based on the transaction table. The two things
DAX is best at is filtering and aggregating. A star schema also keeps the relationships simple and easy to
understand.

Visualization Rendering

Sometimes the issue isn’t necessarily the data model but the visuals. I’ve seen this when a user tries to put >20
different things on a page or has a table with thousands of rows.

LEAN TOWARDS AGGREGATION

The DAX engine, Vertipaq, is really good at two things: filtering and aggregations. This means it’s ideal for high
level reporting like KPIs and traditional dashboards. This also means it is not good at very detail-heavy and granular
reporting. If you have a table with 10,000 rows and complex measures being calculated for each row, it’s going to
be really slow. If you need to be able to show detailed information, take advantage of drill-though pages or report
tooltips to pre-filter the data that is being shown.

FILTER WHAT IS BEING SHOWN

Unless you are using simple aggregations, it’s not advisable to show all of the data at once. One way to deal with
this is to apply report or page filters to limit how many rows are being rendered at any one time. Another options is
to use drill-through pages and report tool-tips to implicitly filter the data being shown.

Limit how many visualizations are on screen. The part of Power BI that renders the visualization is single-threaded
and can be slow sometimes. Whenever possible, try to not have more than 20 elements on screen. Even simple
lines and boxes can slow down rendering a little bit.

Testing Performance of Power BI reports

Here are the recommended steps that should be done in order to identify and benchmark the performance of Power
BI reports.

Run the Azure speed test


Identify response times when browsing reports from Unilever network in the concerned market
Identify response times when browsing reports from Unilever network in Bangalore
Identify response times when browsing reports from outside the Unilever network
Combine the results with acceptable numbers
Minimum 3 tests need to be done

Tools for performance testing

There are few tools that allow performance testing on Power BI.
One of them is the ‘performance analyser ’ that is part of Power BI desktop.
There’s another PowerShell based tool that runs performance tests on Power BI reports by passing dynamic
parameters. You can define how many reports to run in parallel and how many instances of each report. A nice
video describing the usage is here .

Version 9.1 Published on 5th August 2020 Page 202 of 321


I&A Azure Solution Architecture Guidelines

Livewire Learnings - Power BI

There was an joint exercise carried out with Microsoft to find better efficiency for designing Power BI Reports. Key
Takeaways were:

POWER BI MODELLING

It is important to trace queries at development time for better understand the efficiency of the report and if
there are any scope of improvement on the same. Use Performance Analyser, DAX Studio or SQL Profiler.
Formulas to be equipped, with considering only filtered data rather than calculating on the whole data set.

POWER BI UX

Sub-total / Conditional formatting in the same matrix chart to be avoided.


Avoid using Table/ Matrix visual to create card insights/single value insights
Avoid overlapping elements such as empty textboxes and rectangle shapes, filters placed on top of matrix
/table, which was causing problems with interaction.
Avoid overlapping shapes as much as possible, especially on interactive elements to avoid visuals from
disappearing when the user clicks on a blank area which happens to be a shape.
Configure Interactions to cause minimal required effect to the visuals
Only use multiselect for slicers wherever needed and agreed with business.
Encourage to use the filtering capability in filter pane, and only the necessary slicers in the visual pane.
Encourage to assign default values to slicers and avoid retaining ‘All’ so as the amount of data to render a
page gets reduced.
Use ample white space/empty space to group related sections together or use a line to separate the different
sections instead of relying on adding shapes in the background.

Version 9.1 Published on 5th August 2020 Page 203 of 321


I&A Azure Solution Architecture Guidelines

Section 2.14 - Azure Search


Overview

Azure search is a search-as-a-service cloud solution that gives developers APIs and tools for adding a rich search
experience over private, heterogeneous content in web, mobile, and enterprise applications.

Azure search tiers

Service tier Free Basic Standard S1

Availability SLA No Yes Yes

Max documents 10,000 1 million 180 million

(15 million/partition)

Max partitions N/A 1 12

Max replicas N/A 3 12

Max storage 50 MB 2 GB 300 GB (Higher SKU in standard tier support more


than 300 GB sotrage)

Azure search approved pattern:

1. Azure search is approved for usage within Unilever I&A environment to create index for large undifferentiated
text, image files, or application files such as Office content types on an Azure data source such as Azure
Blob storage.
2. It is NOT part of standard architecture but is approved for usage on case by case basis.

Azure search sample architecture

Version 9.1 Published on 5th August 2020 Page 204 of 321


I&A Azure Solution Architecture Guidelines

Security guidelines

1. By default, Azure search listens on HTTPS port 443. Across the platform, connections to Azure services are
encrypted.
2. Azure search data MUST be encrypted at rest using azure storage service encryption.
3. It is recommended to use Microsoft managed keys for storage service encryption.
4. Azure search access keys MUST be stored in key vault.
5. Web application MUST NOT hardcode azure search access keys.

References

Azure search documentation: https://docs.microsoft.com/en-in/azure/search/

Azure search security documentation: https://docs.microsoft.com/en-in/azure/search/search-security-overview

Version 9.1 Published on 5th August 2020 Page 205 of 321


I&A Azure Solution Architecture Guidelines

Section 3 - Design Patterns


Design pattern section covers different architectural patterns approved in I&A and the type of connections between
the different components.

Approved Connection Types

Compon Compo Connect Deploy Key Comments on Connection Pattern


ent 1 nent 2 ion Type Time / Run Vault Specifics Allowed /
Time Read? Not

ADF ADLS SPN Deploy Time No Yes

ADF SQL DW SPN Deploy Time No Yes

ADF Databric SPN Deploy Time No Yes


ks

Databricks ADLS SPN Deploy Time No KeyVault backed Secret Scope Yes
is used. Credentials are
removed so that user cannot
access.

No access to users on credentials.

SQL DW ADLS SPN Deploy Time No Yes

Databricks SQL DW SQL Run Time yes Databricks requires either SQL Pattern Not
Server credentials (which cannot suggested
be shared due to security reason)
or access on Key-vault to read
the SQL credentials. Whoever
has access to databricks will be
able to read the credentials if
notebook execution access is
given in production, so which is a
security risk hence pattern is not
continued.

AAS SQL DW SQL Deploy Time Yes CICD looks up the sql credential. Yes

Suggested using Run Books and


Web hooks only. Runbooks are
maintained centrally by
landscape.

Azure AAS SPN Run Time Yes Requires access to Key-vault. Pattern Not
Functions Read access on key-vault suggested
provides access to all credentials
in key-vault including SQL DW

Databricks Log Key Run Time Yes Added KeyVault backed Secret Yes
Analytics Scope to store the log
analytics related credentials.

ADF Key RunTime Yes

Version 9.1 Published on 5th August 2020 Page 206 of 321


I&A Azure Solution Architecture Guidelines

Log Testing failed. Unable to make it Not


Analytics work without hardcoding/using suggested
another compute to create the till this is
auth token. sorted with
ADF MS
team

Batch AAS Certificate Run Time Yes Default turnkey in Amsterdam Pattern not
doesn’t grant key vault access to suggested
the SPN. Requires manual
intervention.

This is mainly used in Dublin


Setup but no more suggested for
new projects

Azure AAS MSI Deploy TIme No Pattern still to be tested. No


Web App

Azure SQL MSI Deploy time No Web App makes use of Microsoft. Yes
Web App DW / DB Azure.Services.
AppAuthentication NuGet
Package to authenticate with the
database hence doesn’t require
to supply credentials at run time.

Batch ADF Certifica No Not in Scope Pattern not


te suggested
Any more

Power SQL Azure Run Time NO Users need to be added to MFA Yes
App AD auth enabled AD Group which is
(SSO) provided access on required SQL
tables. Power APP connects to
underlying system using the
Single Sign On i.e. credentials of
the user who has logged in.

Logic ADLS SPN Deploy Time Yes Dev deployment performed by Yes
App landscape using ARM template
which references credentials from
AKV. Higher environment
deployment to be performed by
product team using CI/CD.

(Landscape automation
development in progress and for
immediate requirements
deployment will be manual.)

Logic SQL SQL Deploy time Yes Dev deployment performed by Yes
App Credenti landscape using ARM template
als which references credentials from
AKV. Higher environment
deployment to be performed by
product team using CI/CD.

Version 9.1 Published on 5th August 2020 Page 207 of 321


I&A Azure Solution Architecture Guidelines

(Landscape automation
development in progress and for
immediate requirements
deployment will be manual.)

Logic Log SPN Deploy Time Yes Dev deployment performed by Yes
App Analytic landscape using ARM template
s which references credentials from
AKV. Higher environment
deployment to be performed by
product team using CI/CD.

(Landscape automation
development in progress and for
immediate requirements
deployment will be manual.)

Logic Sharepo Sharepoi Deploy Time No Landscape creates logic app and Yes
App int nt API connector. Project team
credentia responsible for adding sharepoint
ls credentials.

Logic O365 O365 Deploy Time No Landscape creates logic app and Yes
App credentia API connector. Project team
ls responsible for adding O365
credentials.

WebApp Azure cache Run Time Yes Application code fetches keys Yes
cache access from key vault and uses these
for redis keys keys to access cache.

WebApp Azure search Run Time Yes Application code fetches keys Yes
search access from key vault and uses these
keys keys to access search API.

Version 9.1 Published on 5th August 2020 Page 208 of 321


I&A Azure Solution Architecture Guidelines

Section 3.1 - Streaming Analytics


Streaming Ingestion Technologies

ADF + Apache Kafka topics can be used for getting the data needed to Data Lakes.
HDInsight Kafka can be used to stream data into the Data Lake and expose it for further processing by
Databricks.
Stream Analytics starts with a source of streaming data. The data can be ingested into Azure from a device
using an Azure event hub or IoT hub. Preferred pattern for streaming from sources which are IoT or event
streams from connected devices, services, applications
Internal Data can be streamed using IoT Hub and Stream Analytics
External Data can be streamed using Event Hubs and Stream Analytics

Patterns for Streaming

Suitability of the data for Streaming

Master data streaming


Only additive delta can be streamed. Any other complex delta mechanisms may not be suitable for streaming
Unpredictable incoming streaming patterns like out of sequence and late events can be streamed but only
with complex logic while processing it further
High scale and continuous streams of data
Skewed data with unpredictable streams cannot be streamed as managing latency and throughput will be
difficult
External lookup while streaming is a memory consuming operation and hence needs additional mechanisms
like cache reference data, additional memory allocation, etc to enable effective streaming

Architectural Patterns

NRT Streaming – Every 15 minutes/One Hour and processed immediately – Only where needed
Lambda – Data fed in both Batch Layer and Speed layer. Speed layer will compute real time views while
Batch layer will compute batch views at regular interval. Combination of both covers all needs

Version 9.1 Published on 5th August 2020 Page 209 of 321


I&A Azure Solution Architecture Guidelines

Kappa – Processing data as a stream and then processed data served for queries

Streaming outside lakes (all kinds of streaming source) directly into products (Kappa)

No historical data snapshots


Analyze streaming directly at source or into standalone Azure components like SQL DB/SQL DW

Version 9.1 Published on 5th August 2020 Page 210 of 321


I&A Azure Solution Architecture Guidelines

Section 3.2 - Data Science


Analytics has evolved from simple number calculation exercise used in solving problems to assisting in decision
making. Business analysts and data scientist’s play a major role in extracting meaningful insights to define
innovative strategies across organizations.

Important questions to address today is, how easy the data science lifecycle is? Getting environment, right data
sets, knowledge of scalable tools and technologies, industrialization for big data. These questions leave our data
scientists dependent on many other supporting teams before even starting the actual work. In a world where, fail
fast strategy works well, if the strategic decisions must wait for long to get answers, is not acceptable.

This article is targeted on the measures taken in order to make the life of data scientist easier in Unilever I&A Azure
Environment.

Main challenges of Data Scientist

Domain expertise is the knowledge that Data Scientist comes with , Assumption here is data scientist already
knows the problem statement and data required to solve it.

Easy or Quick access to data

Easy access to Raw and Cleansed data for problem solving. Universal data lake (UDL) and Business Data
Lake (BDL) have made the data availability easier and faster as all data is available in one place.

Though UDL and BDL have extensive data, but there could be a case where some of the data sets required for
data scientist is not in UDL and BDL, in those cases data scientist's can bring their own data into the environment
for quick pilot while the same data is prioritized for ingestion into UDL and BDL

All data at one place with right cataloging, as a sharable asset

UDL – Most granular data for consumption

BDL – Sharable/Global business logic applied data

Cataloguing of data lineage and business logic details

Data in cleansed format for as -is usage

Quick access and right governance on the Lake

Bring your own data wherever required

Data Scientist needs to view the catalogue , identify the data and get access to data with right approvals.

Version 9.1 Published on 5th August 2020 Page 211 of 321


I&A Azure Solution Architecture Guidelines

Hardware and Software (Tools and Technologies)

Main challenge that a data scientist encounters today is knowledge of tools and technologies to be used to derive
the outcome. In order to overcome this issue, Unilever I&A Technology team has worked with Microsoft and data
science COE team to come up with list of Experimentation scenarios and tools which can be used by data scientists.

Time Bound Experimentation environment for quick pilots – Look for scenario that the use case fall into

Scenario 1 : Self Service tools from user laptop


Mainly used by Citizen data scientists, who do not want to code. Data is made available for the quick
access.
Works well when the data size is small (MB’s)

Scenario 2 : Data science workstations on Cloud with GPU scale capacity


Used by both Citizen and Hardcore data scientists, who can code using R or Python.
Works well for GB’s of Data but not very huge (30 – 50 GB)

Scenario 3 : Azure PAAS tools


Mainly for hardcore data scientists , who have the knowledge of Spark, R and Python and write their
own code to arrive at a outcome.
Works well with Big Data and when the solution needs to be industrialized immediately after pilot.

Quick provisioning of environment : Takes 15 minutes to provision an environment

Cost : Pay only for what you use (number of hours of usage). Pause when not in use.

Code Repo availability to manage the code at one central place

SCENARIO 1:

User responsibilities
Tools with MFA Support: Azure supports access only through Unilever ID and MFA enabled tools
Tool installation & Licenses has to be taken care by the Data Scientist/User
Cost Involved: No cost on Azure if only data is accessed from UDL/BDL

When to Use?
Data size is a concern ( ~1-3 GB, with no complex processing)
Data is existing in User Laptop And /Or in Azure

Version 9.1 Published on 5th August 2020 Page 212 of 321


I&A Azure Solution Architecture Guidelines

Compute / processing can be managed with resources available in User laptop


Example use case:
Requirement to do a quick POC with small data set which is readily available.

SCENARIO 2:

Azure VM Configuration
Standard configuration (Data Science VM )
Standard hardware configurations. (N standard options)
Pre Installed (Excel, One drive) & Internet Enabled
Unilever approved data science tools pre-installed ( R, Python and Data science tools are available. )
Tool installation & Licenses has to be taken care by the Data Scientist/Project team
Cost Involved: Azure VMs costed as PAYG, i.e. pay per usage. Cost Monitoring tool will provided by
I&A Tech to track spent v/s budget. Cost needs to managed by Users of the environment.
Code Sharing: VSTS Git provided as source code repository.

When to Use? ( Talk to I&A Tech Architecture team if IAAS VM is right component for the use case)

Compute required is larger than User laptop but not too complex

Data size is comparatively larger (~10-15 GB, with no complex models) (Exception is push down processing)

Code and Data needs to be shared between multiple users of the system.

Example Use case :


Multiple users working on use case, with data size between 10-15 GB. Users require common place
to access code and data.
Useful for Citizen data scientist, who want to use self productivity tool to quickly arrive at results of the
experimentation.

SCENARIO 3:

Version 9.1 Published on 5th August 2020 Page 213 of 321


I&A Azure Solution Architecture Guidelines

Tools in Experimentation:
Environment with Preloaded tools but Paused
Environment management (PAUSE & RESUME) to be manged by Data Science Team.
Cost Involved: Azure cost is PAYG. i.e. based on the usage. Cost Monitoring tool provided by I&A
Tech to manage spent v/s budget.
Cost Accountability is with Service Line

When to Use?
Data is in Azure
Compute required is larger than available compute in User Laptop/Azure VM
Parallel processing required
Data size is larger
Example Use case:
Experimentation for a use case which needs to be industrialized/scaled immediately after the
value is proved to end users. For example: Livewire Analytics.

How to get Environment ?


Refer section

Scaling of solution and Industrializing

Data scientists are excellent with model development on sample data with very good results. Once the pilot
is complete, same requires industrialization with E2E automation.

Knowledge of tools and technology and scaling of solution was a big problem for Data Scientists. Unilever
I&A Tech came up with process to industrialize the solution from experimentation scenarios using Azure
Cloud tech stack.

E2E solution design from Experimentation scenarios to Industrialization

Easy migration from pilot codebase to industrialized product.

Unlimited scaling of the product as per the requirement

E2E automation , with no manual intervention.

Cost : Pay only for the compute used for execution. Pause when not in use.

One click deployment with E2E automation.

Industrialization efforts change depending from the tools used building the solution

Version 9.1 Published on 5th August 2020 Page 214 of 321


I&A Azure Solution Architecture Guidelines

One – click deployment to production through automated CI/CD tools.

Data Science Tools and Languages

Architectural Patterns for Data Science:

SCENARIO 1:

Option 1: R Studio (User Laptop) (Open Source)

Version 9.1 Published on 5th August 2020 Page 215 of 321


I&A Azure Solution Architecture Guidelines

Option 2: Python & Jupyter (User Laptop) (Open Source)

Option 3: Client/Self Productivity Tools (Licensed)

SCENARIO 2:

Option 1: Data Science Virtual Machine

Version 9.1 Published on 5th August 2020 Page 216 of 321


I&A Azure Solution Architecture Guidelines

Option 2: Client/Self Productivity Tools

SCENARIO 3:

Option 1: Data Bricks:

Version 9.1 Published on 5th August 2020 Page 217 of 321


I&A Azure Solution Architecture Guidelines

Option 2: Azure ML Service:

Version 9.1 Published on 5th August 2020 Page 218 of 321


I&A Azure Solution Architecture Guidelines

Section 3.3 - Data Distribution Strategy


Overview
Data Distribution Patterns for UDL and BDL
Application Integration Architecture
Integration Layer
Integration Options for External Platforms
Decision Tree
Application Type & Data Distribution Pattern
Process and Cataloging
Data Access process
Data Catalog to be managed to make sure all sharing details are captured:
Infrastructure – Landscape
HA and Support Model (Pending)
Security Consideration
Distribution Layer Folder Structure:
Detailed Requirements on Data Distribution
Requirements – Phase 1
Requirements – Phase 2
Micro Service Layer
Technology Stack:
E2E Flow:
Automated Batch Data Job Management

Overview

Data distribution layer is responsible for providing data lake data to consumers (Internal and external world). The
main vision is to provide ‘Data on Demand’ model using different layers of granularity (including APIs) such that
data is not tightly coupled to project specific views. Data is exposed from sharable layers (UDL , BDL).

Unilever has derived multiple ways of providing data based on the hosted platform of consumers.

Key Data Distribution Mandatory Principals are :

Data connectivity should be de-coupled using different layers of granularity (including APIs) such that data is
not tightly coupled to project specific views. The source data APIs & connectors should provide maximum re-
use.
Should support event sourcing techniques & data topics hub approach using publish and subscribe style
model for greater agility and to make data available in a timely manner.
Must register data consumption details (API, Integration layer consumption) in a central service catalog,
business glossary, or other semantics, so that users can find data, optimize queries, govern data, and
reduce data redundancy.
Data Exploration/Access: Identification of the right dataset to work with is essential before one starts
exploring it with flexible access. Metadata catalog should be implemented help users discover and
understand relevant data worth analyzing using self-service tools.

Data Distribution Patterns for UDL and BDL

Version 9.1 Published on 5th August 2020 Page 219 of 321


I&A Azure Solution Architecture Guidelines

Application Integration Architecture

Integration Layer

Integration layer will be hosted and maintained as part of UDL or BDLs.

All data requests will go through processes defined for UDL and BDL, which involves Data Owner and WL3
approval.

Integration layer will be managed as part of DevOps Activities. Creation of Pipelines, Containers, SAS
Tokens will be managed and governed by respective DevOps teams.

All access via integration layer will be read-only and subject to approvals.

Integration Architecture provides the following options:

Option 1: Direct Connection:

Version 9.1 Published on 5th August 2020 Page 220 of 321


I&A Azure Solution Architecture Guidelines

Products/ Experiments can use this approach to consume data from UDL and BDL. Unilever Azure
hosted, ISA approved platforms can also use this option.
Option 2 : Data copy and staging layer:
This option can be used by external/ third party applications to access data.
I&A UDL/BDL will host an Integration layer with Gen2, ADF, Databricks and SQL DB.
Dynamic ADF and Databricks pipelines will be built using metadata in SQL DB to copy UDL and BDL
data to ADLS Gen2.
Integration layer will only retain latest 3 copies of data. History data will be provided as adhoc onetime
load and will be deleted after incremental data is made available.
Datasets will be segregated and access controlled at container / folder level with IP Whitelisting for
two factor authentication. External platform either connect via SPN or through SAS tokens. User
connections will be made through RBAC.
Option 3 : Azure Data Share:
I&A Tech hosts Integration layer with Azure Data Share component to share data with Destination
Azure Data Share. – Security approval still in Progress and can be used only after full approved
by security.
Option 4 : Data Push from UDL,BDL:
Unilever hosted integration tool (ADF) can be used to share the data from UDL , BDL or Data copy
and staging layer.
Integration tool(ADF) can be part of UDL or BDL or separately managed in separate environment to
share data from UDL & BDL.
Product data cannot be shared. As of today I&A doesn’t own support for integration tool to push the
data. It’s the decision of respective platform to support the option, else tool needs to be managed and
supported by respective data requester.
Option 5 : Micro Service Layer:
I&A hosts an API layer which allows connection through REST endpoints.
Microservice layer and workflow is built based on the requirement. Its suitable for small data sets <
1GB. Two factor authentication will be considered for Micro service layer. (In Roadmap for UDL)

Integration Options for External Platforms

Option 1: For Push, Data factory hosted in Unilever Azure platform can be used. There is no central “integration
as service” as of today for UDL and BDL. UDL and BDL can decide to host this layer for pushing data into third
party systems or separate integration layer needs to be created based on the requirement to push the data from
UDL / BDL. Respective platform where the integration tool is hosted has to manage all support activities for this
ADF themselves. (https://docs.microsoft.com/en-us/azure/data-factory/connector-overview )

Version 9.1 Published on 5th August 2020 Page 221 of 321


I&A Azure Solution Architecture Guidelines

Option 2: External systems should have a mechanism to connect using SPN or SAS Token. Any integration tool
which supports these options can be used. If external platforms want to script it themselves, AZCopy tool with 3-5
lines of code can be used to extract the data using Windows or Linux operating systems.

Option 3: Custom Micro-service layer to be Built for small data sets < 1 GB. Huge data sets are not supported as
part of this pattern. External Applications pull the data using Rest End Point published in Integration Layer. Web
service layer is being built and prioritized by UDL based on the request.

Decision Tree

Application Type & Data Distribution Pattern

I&A Applications , Applications hosted in Core Data Ecosystem Subscription : (Integration Option 1)
Unilever I&A Landscape will manage the creation of application and attaching the SPN (service
principle) for data access.
SPN credentials are not shared with any individuals.
SPN credentials are maintained only in Key-vault, which can only be accessed through application
hosted in Azure. ( Key-Vault is considered as the Secure credential management tool, approved by
security)
Products/ Experiments hosted within I&A Platform can connect directly to UDL and BDL using the
environment provided by I&A platform team.

Non-I&A (Other Unilever Azure Platform Applications ) : (Integration Option 1)


Non I&A but Unilever Azure hosted applications which are ISA approved, can also connect directly to
UDL and BDL to pull the data, provided required security controls mentioned below are taken care.
Respective application will create SPN and share only SPN ID with I&A Platform Team.
Respective application owns the security standards for SPN management using security approved
credential management tool (Key-Vault)
I&A Landscape will attach the shared SPN to respective data folder in UDL/BDL. (No SPN Secret is
Shared)

Version 9.1 Published on 5th August 2020 Page 222 of 321


I&A Azure Solution Architecture Guidelines

As the applications are hosted in Unilever platform and credentials are not exchanged in human
readable format, this is considered as secure approach.

Non-Unilever Applications hosted on Azure : (Integration Option 2 or 3)


UDL/BDL will make the data available as either data share or in data copy & staging layer (Gen2
Storage).
Gen2 storage will allow connection only through SPN or SAS token. Data share will push the data
automatically to destination data share using RBAC.
For Gen2 integration, any integration tool can be used as long as it support’s SAS/SPN.
Second factor authentication is added through IP Whitelisting of the consumer application.

External Platform Connections (Cloud or Non-Cloud) (Unilever/Non Unilever): (Integration Option 2 or


4)
UDL/BDL will make the data available as either Micro Service or in data copy & staging layer (Gen2
Storage).
Respective application needs to pull the data from the Gen2 storage using the credentials
shared. Approved credentials are SPN/ SAS Token for Gen2 and Token based authentication for
Microservice. Any integration tool can be used here as long as it supports the mentioned connection
method.
Second factor authentication is added through IP Whitelisting of the consumer application.

Process and Cataloging

DATA ACCESS PROCESS

UDL Access : Contact UDL Dev Ops team for Remedy Request Details
BDL Access : Contact BDL Dev OPS team for respective BDL access

DATA CATALOG TO BE MANAGED TO MAKE SURE ALL SHARING DETAILS ARE CAPTURED:

Catalog should capture clear information on

Cataloging of available Data API’s with information like


Underlying UDL/BDL data set details (Source data set & folder in UDL/BDL)
Column level details
Information on Available data sets in Data Copy and Storage layer.
Underlying UDL/BDL data set details (Source data set & folder in UDL/BDL)
Column level details
Details of any transformation done (filtering / mapping)
Frequency of the data refresh
Data Governance - Consumer of data
Consumer Application Name / Platform Name
Approvals (Data Owner, I&A Contact, Business Contact )
Consumption Method
Frequency of consumption

Infrastructure – Landscape

Integration layers will be set up in Dublin (North Europe) – New foundation design

New resource groups will be created as part of UDL and BDLs for Integration Layer.

Approved components for integration layers:


Azure Data Factory
ADLS Gen2

Version 9.1 Published on 5th August 2020 Page 223 of 321


I&A Azure Solution Architecture Guidelines

Databricks (mounted on UDL and BDL)


Data Share (awaiting approval from InfoSec)

DevOps teams will build ADF pipelines and Databricks notebooks in Dev environments.

Creating pipelines for datasets or cross charging will be taken care by respective DevOps teams

Only 3 environments will be created:


Dev – Experiments and one-time data transfer
UAT – Deployment testing of automated data distribution.
Prod – Production data distribution

HA and Support Model (Pending)

Option 1
HA/DR or Business continuity process applied to UDL will apply.
Option 2 : Gen2
Gen2 Geo replication to be enabled for the Distribution Layer.
In case of region disaster, consumers should point to secondary location shared.
Detailed process will be published as part of Business Continuity process.
Option 3 : Data Share
Data share and Micro Service layer HA/DR to be planned along with MS team. Since the design is
still in progress, this needs to be planned

Security Consideration

I&A Application:
Data owner approval for sharing the data.
Non I&A (Unilever Azure Hosted Application)
Data owner approval for sharing the data.
ISA approval for consumer platform
SPN created and managed at consumer azure platform.
Non-Unilever / External Applications.
Security approval for moving data out of Unilever
Data owner Approval
Legal Approval wherever applicable.
Additional: Exceptional approval from Sri/Phin to share the data with Legacy systems (On Premise)

Distribution Layer Folder Structure:

Folder structure for UDL

Restricted
<Business Function – (Data belongs to)>
DataSetName
<Global – (As-IS ) Data> /
<Country – (When filtered on country)>/
<FilterName/Usecase- (When filtered on specific usecase)>
<Date/FrequencyFolder-date when file is placed in distribution layer>
Actual File
Non-Restricted (Same structure as restricted)

Version 9.1 Published on 5th August 2020 Page 224 of 321


I&A Azure Solution Architecture Guidelines

Folder structure for BDL

Restricted / Non Restricted (Separate folder for restricted and Non Restricted data set)
BDL - (DataSet Existing in BDL only filtered)
DataSetName/KPI Name ( Meaningful name to identify source data set from BDL)
<Global – (As-IS ) Data>
<Country – (When filtered on country)>/
<FilterName/Usecase- (When filtered on specific usecase)>
<Date/Frequency Folder [Daily – (dd-mm-yyyy), Weekly – (WeekID-yyyy),
Monthly –(mm-yyyy)>
Actual File (dd-mm-yyyyhh24MI – File name.csv)
Non-BDL (Data set which is not existing in BDL. Taken from)
DataSetName/KPI Name (meaningful name to identify source data set)
<Country – (When filtered on country)>/
<FilterName/Usecase- (When filtered on specific usecase)>
<Date/FrequencyFolder [Daily – (dd-mm-yyyy), Weekly – (WeekID-yyyy), Monthly
–(mm-yyyy)>
Actual File (dd-mm-yyyyhh24MI – File name.csv)

Detailed Requirements on Data Distribution

REQUIREMENTS – PHASE 1

In Scope
Distribution layer to share the data from data lake with internal and external business applications.
Data will be made available as- is in its current format from UDL and BDL into distribution layer.
Only delta/incremental data is made available as part of distribution layer.
Consumers will pull the data hosted in distribution layer, no data push into consumer platform scoped.
Data will be shared in flat file format i.e. CSV format.
Minimum Two factor authentication is done for authentication of the consumer application.
No restricted data is scoped as part of data distribution layer.
Consumer applications needs to follow a set of process to get approvals from security, legal and data
owner before getting access to the data.
Out of Scope
Filtering of data before placing in distribution layer.
Converting file format
Integration as service for pushing data into consumer platform.

REQUIREMENTS – PHASE 2

In Scope
Distribution layer to share the data from data lake with internal and external business applications.
Small transformations allowed in order to take care of certain conditions like,

Version 9.1 Published on 5th August 2020 Page 225 of 321


I&A Azure Solution Architecture Guidelines

Country/ Unilever & Non Unilever data filtering to meet legal requirement.
Column/Row filtering to reduce the overall egress of data.
Joining of multiple data set to apply filtering or reduce overall data.
Only delta/incremental data is made available as part of distribution layer. History data is shared only
as adhoc basis, available only for a certain period.
Data in delta/parquet format should be converted to flat file format (CSV or pipe Separated)
Data encryption in distribution layer whenever restricted data is to be shared with the application.
Splitting of files into multiple small files as exceptional basis. (Logic to add file number based on the
split)
Configurable or automated data pipeline creation to take care of all the requirements, without any
manual intervention.
Automated workflow creation to take care of approval process.
Folder format for data sets in distribution layer.
Consumers will pull the data hosted in distribution layer, no data push into consumer platform scoped.
Data will be shared in flat file format i.e. CSV format.
Minimum Two factor authentication is done for authentication of the consumer application.
No restricted data is scoped as part of data distribution layer.
Consumer applications needs to follow a set of process to get approvals from security, legal and data
owner before getting access to the data.
Out of Scope
Integration as service for pushing data into consumer platform.

Micro Service Layer

TECHNOLOGY STACK:

E2E FLOW:

Version 9.1 Published on 5th August 2020 Page 226 of 321


I&A Azure Solution Architecture Guidelines

Automated Batch Data Job Management

Version 9.1 Published on 5th August 2020 Page 227 of 321


I&A Azure Solution Architecture Guidelines

Section 3.4 - Analytical Product Insights write-back to BDL


As part of I&A Strategy and in order to avoid Mesh architecture, products are not allowed to share the insights
generated with other products. Only shareable layer in Data Lake is UDL and BDL. All shareable business KPI’s
should be part of BDL inorder to share it with other products. However there are certain data science product’s that
are consuming data from UDL & BDL and are generating insights which are useful for other products/business
applications. As these are data science products using specific data science tools or models to generate the
insights, it doesn't make sense to move the entire data science processing to BDL layer.

In order to make sure such insights are shareable to other business usecases, a new pattern is developed to write
back the data from Data science products into BDL.

Some of the example usecases which falls under this category are

Leveredge IQ: IQ recommendations generated are useful for many of the decision making done in other
products but IQ is hosted as PDS solution.
NRM TPO – Optimization Promotions and plans are useful insights for other products to consume.

Decision as agreed with Leadership team.

Create a shareable space in respective functional BDL to ingest the PDS generated data science Insights.

BDL Functional owner is responsible to make sure right governance in place for the insights written by PDS
into BDL.

Architecture

Analytical (Data science) products may write back insights to the BDLs; But simple sharable KPIs need to be
built into the BDLs

BDL owner responsible for the governance – approvals etc

BDL catalogue needs to be updated to include appropriate information


Consumers of this data (downstream products) need to be informed of changes & impacts

PHYSICAL ARCHITECTURE

Version 9.1 Published on 5th August 2020 Page 228 of 321


I&A Azure Solution Architecture Guidelines

Approval and Provisioning Process:

Data Science products can ingest the data into BDL if the below criteria are met

Process is applicable only for Product/PDS categorized as Data science use case and insights to be shared
are model outputs.

Product confirms the consumption of only Trusted or Semi Trusted data from UDL and BDL to generate the
insights. No PSLZ or manual data used to generate the insights.
BDL owner approval on the below points
Generated insights is reviewed by respected Business Lake data owner and approved as shareable.
BDL Catalog updated with below information and approval from data owner on the catalog.
Frequency, availability, SLA clearly defined.
Logic used to generate the O/P documented in Catalog
Support agreement in place between BDL and PDS for data issues and fixes.
Agreement on the process for changes. Should include below clauses.
BDL to confirm the changes are valid and approves the changes
All consumers of the data (downstream systems consuming the data from BDL) are informed of
the changes in the data and impact assessment carried out.
Approved release of changes into the product environment and in turn into BDL
Sharing of product ingested data from BDL with other usecases
Process for BDL data sharing to be followed.
Additional document on who is consuming each data set has to be documented. Mainly to
understand the dependency on each data.
Architectural approval from I&A architecture team
Verification of approval and alignment with respective BDL team.
Technical architecture to write back the data into BDL

Points to be taken care

Involve BDL SME from the design Phase if the write back is required.

BDL Team will provide the BDL Object / Folder where the write back has to happen. PDS will get access
only to that folder to write back, and PDS cannot create any further objects.

Version 9.1 Published on 5th August 2020 Page 229 of 321


I&A Azure Solution Architecture Guidelines

Decision is with BDL owner to approve or reject data write back into BDL.

Version 9.1 Published on 5th August 2020 Page 230 of 321


I&A Azure Solution Architecture Guidelines

Section 3.5 - Data Preparation/Staging Layer


Data Prep is a staging layer provided to a data consolidator to consolidate/ validate the data before sharing with
UDL team. Data prep layer is made available case by case basis.

Some of the examples of Data Prep layer are:

Retailer data: Multiple retailer (100s of retailer) data from small retailer shops available as manual files and
shared with data owners through mails or some other method, requires one level of data standardization, to
bring it to right format. As long as the format of the data is standard and useful for all consumers then this
standardization can be done at data prep layer.
Social Network Data: Data from social network extracted at different time frame. Validation is required from
data owner to verify the completeness of data, if not validated, wrong/incomplete data will flow into UDL and
requires data cleaning and fixes. In order to avoid this, data prep layer will be made available for data owner
to validate and confirm the data ready for ingestion into UDL.

Below are some of the criteria to be followed in Data Prep layer:

Data Prep Storage is part of I&A Tech Central Platform. A space is provided to data provider to ingest the
Raw data make it UDL ready.

Data Prep Compute which includes ADF and Databricks will be provided in respective provider space , with
access only to data provider.
Ideally Data staging or prep layer is used only to validate/consolidate the data for the cases where the data is
coming from social network or third party (with no automated integration). Modification of data in Data Prep
layer is allowed in exceptional cases where modification of the data is must for consumption of the data.

ADF will be used to ingest the data into Data Prep Layer.

Databricks hosted in Data Prep layer will be used to make the data UDL ready. Validation and approved
modification can be done here.
Modification of the data is allowed only when there is a business justification for modification.
If the data is unusable without required modification/ Transformation
Modification/Transformation details are aligned with Data SME/expertise.
Modification/Transformation is automated.

Data ready for UDL ingestion should be present in UDL_Ready/ folder

Any data planned in Prep layer has to have a valid DC and DMR agreed and approved with Data Expertise
and Data architect team.
Data validation i.e. DQ after the ingestion in to staging is responsibility of data staging owner. Only validated
data will be ingested to UDL from UDL_Ready Folder.

UDL is the only consumer for Data Prep Layer. Data cannot be shared from Data Prep Layer.

Folder structure will be managed as below


Level 1: Data_Provider (For Example : NAME_SC_Data)
Level 2: Data_Set_Name (For Example: Logistics)
Level 3: UDL_Raw & UDL_Ready
Level 4: Date(DD-MM-YYYY) for transaction data.

Data Preparation/Staging Architecture:

Version 9.1 Published on 5th August 2020 Page 231 of 321


I&A Azure Solution Architecture Guidelines

E2E Flow Diagram:

Resource Group Structure

Data prep layer is available only as part of UDL. Below is the resource group structure for the same.

1 Central Resource group (Shared Resource group as part of UDL for Staging Storage )
ADLS Gen2 : Raw and UDL ready data will be available in ADLS
Blob Storage: Landing zone for external and manual data sets.
Resource Group per Data Provider : Each data provider will be given a separate resource group to
manage the data preparation before ingestion into UDL.
Databricks (Read only access)
ADF ( Read and Write)

Version 9.1 Published on 5th August 2020 Page 232 of 321


I&A Azure Solution Architecture Guidelines

Section 3.6 - Self Service in PDS


Self-service is an approach to the data analytics that enables business users to access the data directly from the
data source in the required granularity, without going through the IT Developed dashboards or reports.

Self Service Architecture in Azure:

PATTERN 1: SELF SERVICE CONNECTING TO AZURE ANALYSIS SERVICE

Reporting layer in I&A Azure platform consists of tools like Power BI, Excel used by end users and Azure Analysis
services and SQL DW used as backend for the same. This pattern makes use of only Azure Analysis services as
backend layer for the self service. Self service from users needs to go through Multi Factor Authentication.

Below are the steps for self service from Azure Analysis services

Refresh required data in to Azure Analysis services from SQL DW or from ADLS, based on the architecture.
If the self service data is different / more granular than the data required for published report then it is
suggested to use different AAS/Cubes for self service and published report.
Create multiple AD Groups to provide role based authentication on the Cube/Data. Consult with Business to
align on Number of roles required based on the data in AAS instances.
Enable MFA on the AD groups
Unilever security mandates having Multi factor Authentication for all Public end points. Since AAS
doesn't provide inbuilt MFA, only way to apply MFA is by enabling it on the AD groups.
No users should be directly added into AAS instance. All access should be provided through one of
the AD group attached on the AAS instance.
Make sure to enable MFA for all AD groups attached on the AAS Instance.
Do not white-list any IP on the AAS instance.
Once all AD groups attached on the AAS is MFA enabled, firewall on the AAS can be turned off for
self service via excel or Power BI
Project (Delivery Manager) owns the accountability to make sure all the Access is controlled
through AD group and all AD groups are MFA enabled before disabling the firewall.

LIMITATIONS OF THE APPROACH:

Expensive, as all data required for self service needs to be present in AAS. This means project needs to use
either huge AAS instance or go with multiple AAS instance.
Limitation on Data Size: Max data that can be stored in largest AAS instance (S9) is 400GB. For global
project’s where the granular data is huge, not all data can fit into a cube. Project needs to go with multiple
AAS instances, which increases the duplication of data if same data is required in multiple instances.

PATTERN 2: SELF SERVICE CONNECTING TO SQL DW

Version 9.1 Published on 5th August 2020 Page 233 of 321


I&A Azure Solution Architecture Guidelines

This pattern makes use of both SQL DW and AAS as backend layer for the self service and published report
respectively.

Self service from users needs to go through Multi Factor Authentication.

Below are the steps for self service from SQL DW.
Refresh only aggregated data required into Azure Analysis services from SQL DW or from ADLS, based on
the architecture.
All the granular data will reside in SQL DW.
Published reports are served from AAS
For self service end users are supposed to connect directly to SQL DW using MFA and their own Unilever
credentials
Self service can be achieved here via two patterns
Direct query to SQL DW via Power BI reports
Create cubes which is a non persistent layer and connect to SQL DW via the cubes
SQL DW credentials will not be shared with the user.
Create multiple AD Groups to provide role based authentication on the SQL DW tables. Consult with
Business to align on Number of roles required based on the data in SQL DW.
Enable MFA on the AD groups
Unilever security mandates having Multi factor Authentication for all Public end points. Since SQL DW
doesn't provide inbuilt MFA, only way to apply MFA is by enabling it on the AD groups.
No users should be directly added into SQL DW instance. All access should be provided through one
of the AD group attached on the SQL DW instance.
Make sure to enable MFA for all AD groups attached on the SQL DW Instance.
Do not white-list any IP on the SQL DW instance.
Once all AD groups attached on the SQL DW is MFA enabled, firewall on the SQL DW should be
turned off
Project (Delivery Manager) owns the accountability to make sure all the Access is controlled
through AD group and all AD groups are MFA enabled before disabling the firewall.

LIMITATIONS OF THE APPROACH:

Though this pattern allows self service on most granular data of any size supported on SQL DW, but this can
turn out to be an Expensive solution, if large SQL DW is used and kept running 24/7.
Suggested to go with < 1500 DWU for SQL DW.
Align with end users to keep the SQL DW running only during office hours to minimize the cost.
Limitation on concurrency: SQL DW supports limited concurrency. 2000 DWU supports only 42 concurrent
queries in small RC(Resource Class). If the product requires large number of end users to concurrently
connect and do the self service then this may not be the right solution.
Performance limitation: Performance may not be as good as connecting to AAS. Project team needs to
analyse the performance and align with end users accordingly.

PATTERN 3: SELF SERVICE CONNECTING TO SQL DB (EXCEPTIONAL PATTERN WHERE SQL DB IS USED IN E2E
ARCHITECTURE)

Version 9.1 Published on 5th August 2020 Page 234 of 321


I&A Azure Solution Architecture Guidelines

This pattern makes use of only both SQL DB as backend layer for the self service and published report . This
pattern is suggested only when the data size < 50 GB

Self service from users needs to go through Multi Factor Authentication.

Below are the steps for self service from SQL DB.

All the data resides in SQL DB.


Published reports are served from SQL DB through direct query mechanism.
For self service end users are supposed to connect directly to SQL DB using MFA and their own Unilever
credentials
SQL DB credentials will not be shared with the user.
Create multiple AD Groups to provide role based authentication on the SQL DB tables. Consult Business to
align on Number of roles required based on the data in SQL DB.
Enable MFA on the AD groups
Unilever security mandates having Multi factor Authentication for all Public end points. Since SQL DB
doesn't provide inbuilt MFA, only way to apply MFA is by enabling it on the AD groups.
No users should be directly added into SQL DB instance. All access should be provided through one
of the AD group attached on the SQL DB instance.
Make sure to enable MFA for all AD groups attached on the SQL DB Instance.
Do not white-list any IP on the SQL DB instance.
Once all AD groups attached on the SQL DB is MFA enabled, firewall on the SQL DB can be turned
off.
Project (Delivery Manager) owns the accountability to make sure all the Access is controlled
through AD group and all AD groups are MFA enabled before disabling the firewall.

Self Service Tools:

I&A tech azure platform supports below 2 tools for self service

Power Bi Desktop
Excel

Approach in Azure

I&A tech azure platform provides different ways to allow self service.

Connect through RDS/Citrix


Connect directly using the User Laptop

REMOTE DESKTOP SERVER (RDS)/ CITRIX:

Version 9.1 Published on 5th August 2020 Page 235 of 321


I&A Azure Solution Architecture Guidelines

RDS & Citrix is the common environment provided to host different DevOps tools for Azure Platform. Citrix is
considered as secure environment as user is required to have Unilever ID and go through MFA to login to RDS
/Citrix.

RDS/Citrix keeps the data secure by restricting users from downloading the data into User laptop. In case if
projects host restricted or PII data, the only approved method to connect to azure components is over RDS/Citrix.

USER LAPTOP:

Accessing azure using tools from user laptop has certain restrictions. Tools that connects over Secured port and
with MFA authentication, are allowed direct connection from user laptop.

As of now, only Excel and Power BI are the TDA approved tool’s allowed to use from User Laptop. Any addition of
new tool, as a self-service tool has to go through evaluation process and TDA approval for usage. User;s can
access the Azure platform only through MFA. Projects has to make sure to enable MFA on the Azure tools before
providing access to the End user.

Insights and Risks:

User onboarding to RDS/Citrix needs to be taken care by the project as part of the project process, to use
RDS for self-service.
There is a risk of data extracted by users into their laptop and miss-used. Before providing self service
access to the data from Excel and PBI, project owner needs to make sure right user is getting access and
any restricted or sensitive data is not shared.
Data owner and Project owner is responsible/accountable for any data shared through self service with end
users.
Data extraction using Excel can affect the performance. If multiple users are downloading data through Excel
over Self Service, can cause huge performance issue. Self-service should be limited to only required users,
who are aware of the performance risk in downloading the data.
End users should be trained on right practice on self-service. This can reduce the performance issues.
Projects/ End user’s should be made aware of the risks involved in user downloading the data into his laptop
and ensure that Information Securtiy requirements are fulfilled

Version 9.1 Published on 5th August 2020 Page 236 of 321


I&A Azure Solution Architecture Guidelines

Enable MFA on AD Group

Adding groups to CA Policy “AzureAnalysisservice-MFA”

1. Go to the change management tab in Remedy 8

Click on the tab “New change”

2. Below window opens up and by default Remedy 8 shows the location updated in inside.unilever.com.

Below is the screenshot for the mandate fields that should be filled in before submitting the CRQ.

Mandate fields (Service Categorization, Operational Categorization, Change Reason, Impact, Importance, Priority,
Risk Level)

Note : CRQ Risk Level should be always be Minor-1 for this activity.

Version 9.1 Published on 5th August 2020 Page 237 of 321


I&A Azure Solution Architecture Guidelines

On the change reason once you click “Requirement Driven change” below questionnaires should be answered

Backout-Plan :

01. Login to Portal.azure.com

02. Select Azure Active Directory

03. Click on Conditional Access

04. Select “AzureAnalysisservice-MFA” policy

05. Remove the newly added group

Development Plan : NA

Implementation Plan :

Version 9.1 Published on 5th August 2020 Page 238 of 321


I&A Azure Solution Architecture Guidelines

01. Login to Portal.azure.com

02. Select Azure Active Directory

03. Click on Conditional Access

04. Select “AzureAnalysisservice-MFA” policy

05. Add the newly created group

Rest questionnaires should be answered apart from the above one.

Mention the scheduled Start Date & Stop Date

Note: Both the CRQ and the Task start and end Date should match.

Version 9.1 Published on 5th August 2020 Page 239 of 321


I&A Azure Solution Architecture Guidelines

Click on Tasks tab and click on relate and a new task would be created fill in the summary details and assign to IT-
GL-Active Directory. (Please note Start and End date of the CRQ should be minimum 2 days).

Version 9.1 Published on 5th August 2020 Page 240 of 321


I&A Azure Solution Architecture Guidelines

Section 3.7 - Global or Country Setup


Pros and cons of Approaches

Description 1. All countries as 1. Each country Recommendation


one product as separate
Product

All countries share Every product will


the same Azure have a separate
Infrastructure and infrastructure
same objects

For example :
SQL DW: One SQL
DW all countries
AAS : One AAS
instance

Scalability Scalability of different SQL DW / DB No scalability issue Option 1


components Concurrency can be an as each country will
issue if the data and have its own
DataBricks : concurrent access underlying infra
Completely Scalable grows. setup
SQL DW : No Limit on
Data Size for columnar AAS can become an
store. issue as one AAS will
not be sufficient for
Concurrency limit is 128. large global projects,
Maximum concurrent which requires
Open Session is 1024. reporting data needs to
Concurrency slot depends be split between
on the resource group, if multiple AAS. If
Large 4 slots and if Extra splitting is not efficient
Large 1 Slot then duplication of data
can occur
AAS : AAS max size
allowed per region is
S9 supports 400 GB of
Memory.

Releases Today releases are done No flexibility in releases Flexibility of a any Option 2
per ITSG as code as all countries share country deployment
repository is associated same code base and or release can be If project team
with each Project or ITSG. release brach. done at any comes with a
point. Each country process to
Releases requires will have its own manage releases
controlled approach so branch and code and downtime is
that developers do not repository acceptable to
checkin things to countries, option
release branch when a 1 can be
release is planned for considered
different country.
Product team requires
a control over release
branch and check-in's.

Version 9.1 Published on 5th August 2020 Page 241 of 321


I&A Azure Solution Architecture Guidelines

Schema's and Model's


are different then no
downtime required.

Costing Cost on azure is PAY As Cost can reduce as Cost is high , as Option 1
(IAAS and You Go. Every component one environment is each country will
PAAS) will have a per hour cost used for Dev, QA, PPD have its own
associated with it. & Prod . environment and
Noumber of IAAS VM's components
IAAS Components are required reduces as
setup per country to meet countries are sharing
the security mandate of the environment
two level of authentication

Code Code repository is One code base for all No Issue as each Option 2
Managem mainatined at ITSG level. countries can become country has its own
ent an issue if not code branch.
managed right. Limitation could be
on not having a
Each country can central code branch.
have its own
Vendor Partner
developing the
code. Risk of
vendor partner
affecting each
others code.
One release
branch, hence the
releases between
the countries needs
to be managed by
project centralally.
Project needs to
have a central team
which manages
and coordinates the
releases between
the countries.
Features and
development's are
manged right so
that one country
changes doesnt
affect the other
country. This is a
big risk unless
there is a
governance
mechanism within
the project to
manage this

Cost Cost center is associcated Each country can Option 2 if each


Sharing at the resource group level have its own cost country has its
and Infrastructure or center associated own cost center.

Version 9.1 Published on 5th August 2020 Page 242 of 321


I&A Azure Solution Architecture Guidelines

resource group is created Only one consolidated with its resource


per ITSG. cost. Cost sharing group
between the countries
will be an issue

Infrastruct Infrastructure currently is If all countries exist in All countries will All 3 are feasible
ure provided per ITSG, as one underlying have its own ITSG
resource groups are infrastructure, a master and resource groups
created based on ITSG, ITSG is required for
and even costing is done which the cost and
on ITSG’s. resource groups can
be assigned

Code Repository will


be same with single
master branch.

Data Analytics/ Data science All data at once place , Analytics Option 1
Sharing can access any data from and analytics which /Datascience has to
for UDL or BDL. requires data from sit within the same
Analytics When it is product, cross country becomes country setup. So
analytics has to be within easier. multiple deployment
the product as cross of same analytics
connections between product might be
products is not allowed required.

Access Access control is defined Row level access Countries have its Option 1
Control at the project level. Each cannot be defined on own AD groups Option 2 : If no
project has its own AD SQL DW or AAS. Since data restriction
groups and end users are objects are shared, between the
assigned to those AD every country user will country users or
groups. get access to all the right level of
data available in the controls applied
underlying AAS model
or SQL DW database.

Service Service setup is managed One service as ITSG is Each product as Option 2
per ITSG and per product one separate service

Notes*

If Prod and Non prod are setup with different options For Example Prod (Option 2) and Non - Prod
(Option 3) then there will be additional complexy in release process and environment provisioning
process. Right controls to be put in place by project to manage the situation

Developer workstation and efficient utilization can be achieved by using Bigger


Workstation VM's and Multi User license instead of 2 concurrent user per VM

Version 9.1 Published on 5th August 2020 Page 243 of 321


I&A Azure Solution Architecture Guidelines

Section 3.8 - Job Management


Job management is one of the main concept to be taken care in all projects. Some of the concerns that we have
today with Job management in all layers are

There is no centralized scheduling mechanism available for schedules

Dependencies and multi level dependencies are not handled very well, as ADF manages only time
schedules.

If the job to be executed on data arrival, there are lot of challenges, as on file arrival feature is not straight
forward , it depends on source systems.

If there is an action which needs to be triggered on successful completion of a job or multiple jobs , there is
no means to achieve that as of now

No prescribed notification mechanism for the schedules/pipelines

No prescribed reporting mechanism as of now

If a job missed a scheduled SLA , currently no way to notify the users. It is either manual or through a
dashboard.

As a solution to the above said issues and with some added features there is a new framework which projects are
encouraged to adapt based on their requirements

Features of the Job Management Framework

This tool is will act as a workflow engine for ADF scheduling and monitoring purposes

This tool can replace the existing schedulers which are currently are not centralized

This tool also serves as a template for all the ADF pipelines as per schedules

Modifications/updates can be done on the template instead of individual ADF pipe lines

Multi level pipeline dependencies is handled

Email notification for the status of the pipelines are incorporated


Run status
Child pipeline not scheduled status if parent pipeline fails
Status when SLA elapsed for pipelines

Power BI report for run history can be made available

Tools used

Logic App for workflow design and notification

SQL DW/DB for backend data storage (SQL DB Suggested)

Framework Design

Version 9.1 Published on 5th August 2020 Page 244 of 321


I&A Azure Solution Architecture Guidelines

Framework workflow

The following workflow which can be achieved by implemented and steps provided below . This can be highly
customised can be used for job schedules of varied frequency . This can be also used where a schedule is not in
place and job needs to be triggered on data arrival. If any project wants to implement a part of the framework and
not the framework on entirety that can be achieved as well.

Framework Jobs

Daily Job Population: Workflow to load the daily executable Jobs into Job_Schedule
Workflow uses Job Master, Job Exceptions and Job Adhoc Run for populating Job_Schedule table
Will run everyday at 00:00 Hour.
An entry has to be created for each execution. (Hourly job will have 24 entries)

Version 9.1 Published on 5th August 2020 Page 245 of 321


I&A Azure Solution Architecture Guidelines

Logic to be created for daily, hourly, weekly, monthly , by hour and by minute jobs.
Execution Engine
Execution engine will analyse which job can be executed. Analysis is done based on
Data availability
Dependency completion (using Job Run Status details)
Exclusion list in Job Schedule.
Priority of the job
Schedule of the job.
Execution engine will trigger the workflow , there is no scheduling done on ADF.
Execution engine will also make an entry into Job_Run_Status for each execution, with Run_ID of the
job.
Execution engine will check the run status and if its failed will retry the job until the retry threshold is
reached on the retry count
Execution engine will make sure to limit the number of parallel execution as required for the
application.
Job Status Update: Workflow to update the status of the jobs
Workflow will look for Job completion from ADF logs, using the ADF run_id
Updates the Job Status on completion of the job with status, error message etc.
Source Data availability check
Jobs configured based on source data availability can only be run when data is available.
This job will check the availability of data in Source.
Source could be MCS table of UDL
Actual source systems like Blob (Event Trigger), File Share /SFTP etc.
Finding the source data availability is dependent on the source system. Custom logic is
required for each source.
Refer the section on Job Triggering in ADF for more details.
Notification Workflow
Checks for miss in the SLA and sends notification to configured list of users
Send notification for success and failure of job as well
Ad-hoc Job Run : Job to look for any adhoc job run configuration and add that into Job Schedule.
Should check for Job Dependency before adding a job into Job SChedule.
Should make sure a job is added again in to Job -Schedule when
Existing job is complete (Success/failed)
A new job to be added
Improvements : TO be looked into :
How to stop the long running jobs
Remedy Integration
Service Bus Integration for Subscription of notification by dependent PDS/Systems.
Extract business run date from the data and capture in the table.

Framework Data Model

Please find the attached framework data model . If any product wants to implement this framework in parts or in
entirety then this data model needs to be in place . The team can choose which tables to be used based on the use
case but table structures and column names should be the same as given below.

Version 9.1 Published on 5th August 2020 Page 246 of 321


I&A Azure Solution Architecture Guidelines

TableName Column Description


Name

Job_Master Job_ID Unique ID to identify a Job

(This table captures the metadata for all jobs in Pipeline_ Name of the ADF Pipleline
the application. ) Name
Metadata Table Job_Exe Scheduled Start Time for pipeline run, it will be
cution_St populated if pipeline is scheduled - this is the
art_Dateti time from which the pipeline will be first
me scheduled to run

Schedule Priorities for each job. This will be considered


_Priority_ when framework runs only N number of jobs at
Level a time, priority will decide the jobs to be picked
up first

Frequenc Frequency of pipeline run -


y_of_exe Daily
cution Monthly
Weekly
AdHoc
Hourly
By Hour
By Minute

Job_exec Daily/Hourly/ - 1 (every day and every hour)


ution_Day
ByHour/ByMinute - once in how many hours
/minutes

Adhoc - specify the day of month to execute

Weekly - Day of Week (day1(Sun), 2, 3)


Monthly - Day of month(1st, 2nd )

Job_exec Hour and Minute


ution_time Example: 1st day of month at 11:00 AM

Job_excl Daily/Hourly/ - Which date or hour


usion_Day
ByHour/ByMinute - which Hour/minute

Weekly - Week of the year


Monthly - month of the year

Job_input List of input data set folder path (Only for


_path information, not used in JOb)

Job_outp Output folder path


ut_path

Notificati What type of notification needs to be sent - eg


on type email,no notification

SLA_Time Defined SLA timeline by when Pipeline run


should be successfully completed

Version 9.1 Published on 5th August 2020 Page 247 of 321


I&A Azure Solution Architecture Guidelines

This will consider the frequency as well . For


example for daily- Everyday by 11-00 am job
should complete. If delayed then notification will
be sent to required people.

Depende Indicator to specify if pipeline is dependent on


nt_Pipeli any other pipeline. (Yes/No)
ne_Flag

Data_Av To denote that if this pipeline is dependent on


ailability_ data availability or is schedule (yes/No)
Depende
ncy_Flag

Enabled_ Record Active flag


Ind

Job_Dependency_Mapping Job_Depe Unique row indicator for dependency


(This table captures the dependency between ndency_In
the pipelines and the objects ) stance_Id

Metadata Table Job_ID Job_id (Job_ID from Job_master table)

Depende Dependent on (Job ID from Master Table)


nt_job_ID

Depende Number of dependent child jobs


ncy_Cou
nt

Previous FLag for dependency on previous execution.


_executio Like day 1 should be complete for day 2
n_Depen execution. (Yes/No)
dency

Job_Schedule Job_Sche Unique row indicator which is for a pipeline


(This is a schedule table where schedules by dule_Id schedule
date are recorded, this table is populated
everyday at 00:00 hour by workflow engine for Job_Sch Unique date for run (dd-mm-yyyy)
the runs required for the day) edule_ru
n_Date

Job_ID JOb_ID from Job_Master

Schedule Actual Run Date and TIme for the job -


Date_time Extracted and populated from metadata table.

Hourly job will have 24 entries . By MInute job


will have respective number of entries.

Exclusion Incase a job needs to be stopped from running


_Flag for the day- Update the flag and Logic app will
not pick this job

Indicator to specify if data is available in the


source location

Version 9.1 Published on 5th August 2020 Page 248 of 321


I&A Azure Solution Architecture Guidelines

Data_Av 0 - Data not available


ailability_
Flag 1 - Data is available

2- NO data availability to be checked

Retry_co A job is attempted for retry only for 3 counts -


unt This should be configurable parameter.

Job_status Status of Pipeline Run -


Queued
In Progress
Succeeded
Failed

Pipeline_ Comments if any


Schedule
_Text

Job_Run_Status Job_run_id Unique row indicator for job run


(This table captures the run details of pipeline
run, table will be automatically updated/inserted Run_Dat dd-mm-yyyy-hh24:MI
based on the run details from workflow engine) e_Time
JOb execution schedule - based on frequency

Job_sche Job_schedule_ID from Job_Schedule table


dule_id

Job_ID Job_ID from Job_Master

ADF_Ru Run ID from ADF


n_ID

Executio Start time for pipeline run


n_Start_ti
me

Executio End time for pipeline run


n_End_ti
me

Run_Dur Duration of pipeline run


ation

Notificati Indicator to specify if notification mail has been


on Flag sent for the pipeline run

Last_upd Last status update time.


ate_Time

Job_Run Status of Pipeline Run -


_Status In Progress
Succeeded
Failed

Error Message

Version 9.1 Published on 5th August 2020 Page 249 of 321


I&A Azure Solution Architecture Guidelines

Pipeline_
Run_Stat
us_Messa
ge

Pipeline_Schedule_Exceptions Schedule Unique row indicator for an exception


_Exceptio
(Any job exceptions are to be added. n_Instanc
e_Id
For exmaple,

Job A should not be run on a particular date. Job_ID Unique Identifier

Due to some data issues at source, project Exceptio Exception created Date
doesn't want to run a job till it is fixed. ) n_Create
d_DateTi
me

Exceptio Exception Created by


n_Create
d_By

Exceptio Start Date time of the exception


n_Start_
DateTime

Exceptio End Date time of the exception


n_End_D
ateTime

Enabled_ Record Active flag


Ind

Pipeline_ Comments if any


Schedule
_Excepti
ons_Text

Pipeline_Schedule_Adhoc Pipeline_ Unique row indicator for an adhoc run


Schedule
(Any adhoc runs to be executed. FOr Example: _Adhoc_I
A job needs to be rerun due to failures or data nstance_Id
fixes at source , then entry needs to be made in
this table) Job_ID From JOb Master

Adhoc_S Date for which the adhoc shedule needs to be


chedule_ run
Date

Adhoc_S Time when the adhoc shedule needs to be run


chedule_
Time

Exception created Date

Version 9.1 Published on 5th August 2020 Page 250 of 321


I&A Azure Solution Architecture Guidelines

Adhoc_S
chedule_
Created_
DateTime

Adhoc_S Exception Created by


chedule_
Created_
By

Enabled_ Record Active flag


Ind

Pipeline_ Comments if any


Schedule
_Adhoc_
Text

Job_Log_table Job_log_ID Unique ID for the table

WorkFlow Each WorkFlow job referred through an id


_JOB_ID

WorkFlow Job Name (Daily Job Population, Execution


_Job_Na Engine, Notification Job)
me

WorkFlow Developers can add Step Number for each


_Job_Step processing where a log needs to generated to
track some information

WorkFlow Text to identify the workflow details.


_Job_Log
_Details

Good to Have - Detailed Activity and object level details

Pipeline_Activitiy_Master Activity_In Unique row indicator - Should be the activity id


(This table captures Activity details) stance_Id in this case

Activity_ID ADF Activity ID

Pipeline_ Unique row indicator which is for a pipeline Id


Instance_
ID

Activity_T Lookup Activity/Notebook


ype

Activity_ Name of the activity


Name

Version 9.1 Published on 5th August 2020 Page 251 of 321


I&A Azure Solution Architecture Guidelines

Activity_Run_Status Activity_R Unique row indicator - Should be the activity id


un_Instan in this case
ce_Id

Activity_ Start time for parent object/Activity run


Run_Star
t_DateTi
me

Activity_ End time for parent object/Activity run


Run_End
_DateTime

Activity Status of activity Run -


Status Queued
In Progress
Succeeded
Failed

Activity_ Other details to be captured if any


Run_Text

Activity_Object_Master Object_In Unique row indicator - Should be the object id in


stance_Id this case

Activity_I Activity Instance Id to which an obejct is


nstance_Id mapped to

Object_N Name of the Object


ame

Function Function Name (Marketing/SC/CB/…)


_Name

Data_So Data Source of the object eg: Nelson, Kantar,


urce SAP BW

Notebook Notepath of the object


_Path

Object_T Object type transaction, reference


ype

Object_E Extraction type eg :full, delta


xtraction_
Type

Object_Dependency_Mapping Object_M Unique row indicator


(This table captures the dependency between apping_In
the objects ) stance_Id

Parent_O Unique Identifier for Parent Object


bject_Inst
ance_ID

Version 9.1 Published on 5th August 2020 Page 252 of 321


I&A Azure Solution Architecture Guidelines

Child_Ob Unique Identifier for child Object


ject_Insta
nce_ID

Schedule Schedule Priority level - Dependency levels can


Priority be 1,2,3 etc
Level

Object_M Any other details of the object Mapping


apping_T
ext

Object_Run_Status Object_R Unique id for object run


un_Instan
ce_Id

Object_P Actual Path of the object picked for this run


ath Eg:
…Country\Function\Object\Year\Month\Day\Hou
r…

Object_R Start time for parent object/Activity run


un_Start_
DateTime

Object_R End time for parent object/Activity run


un_End_
DateTime

Object Status of activity Run -


Status Queued
In Progress
Succeeded
Failed

Object_R Other details to be captured if any


un_Text

Version 9.1 Published on 5th August 2020 Page 253 of 321


I&A Azure Solution Architecture Guidelines

Section 3.9 - Data Integration Architecture


When it comes to big data, there are two main ways of data integration :

Batch-based data integration ( Batch and Micro Batch)


Real-time integration

Real-time data integration is the idea of processing information the moment it's obtained. In contrast, Batch data-
based integration methods involve the process of storing all the data received until a certain amount is collected
and then processed as a batch.

Batch Integration:

Batch integration supports data in batches , which includes data refresh hourly, daily, weekly monthly, Yearly and
Adhoc.

Some of the consideration/practice for batch integration.

Bring history data (data in full) as per the requirement only once.
Bring only incremental data in agreed batches.
Use databricks delta for update, delete and insert
Process the data only for the modified records.
Keep the data in Delta Parquet format for easy updates.

Micro Batch / Near Real Time:

Micro batch integration is created for getting the data more frequently than the batch. At the moment Micro-Batch
support is in 15 min batches.

Some of the consideration/practice for Micro- batch integration.

Bring history data (data in full) as per the requirement only once.

Version 9.1 Published on 5th August 2020 Page 254 of 321


I&A Azure Solution Architecture Guidelines

Bring only incremental data in agreed batches.


Make the micro batch available for consumption from the Micro-batch landing, before any processing is
done.
Do daily consolidation / processing of micro batches into Daily batch (if the same data is required for
usecases as daily delta)

Near real time/Micro batch integration : Based on the requirement data can be consumed by BDL /PDS applications
as shown below

ARCHITECTURE FOR MICRO-BATCH CONSUMPTION TO BDL AND PDS:

These are two approaches which can be used to achieve the near real time requirements.This holds good if the
data volume is less and minimum or no transformations are involved.

Approach 1 - Reading the data using the Spark structured streaming.

Version 9.1 Published on 5th August 2020 Page 255 of 321


I&A Azure Solution Architecture Guidelines

Approach 2- Reading the data using the Logic app to establish the dependency on the file arrival.

Best Practices:

Use Cluster pools or interactive clusters based on the size of the data for Micro batch. Job Cluster takes
minimum 4 minutes to spin up the cluster.
Use single processing, instead of write and read into underlying storage.
Avoid multiple layers/stages, if the data is to be processed quick.

Real time Streaming:

Version 9.1 Published on 5th August 2020 Page 256 of 321


I&A Azure Solution Architecture Guidelines

Streaming Ingestion Technologies

HDInsight Kafka can be used to stream data into the Data Lake and expose it for further processing in UDL

Stream Analytics starts with a source of streaming data. The data can be ingested into Azure from a device
using an Azure event hub or IoT hub. Preferred pattern for streaming from sources which are IoT or event
streams from connected devices, services, applications

Internal Data can be streamed using IoT Hub and Stream Analytics

External Data can be streamed using Event Hubs and Stream Analytics

Architectural Patterns

NRT Streaming – Every 15 minutes/One Hour and processed immediately – Only where needed

Lambda – Data fed in both Batch Layer and Speed layer. Speed layer will compute real time views while
Batch layer will compute batch views at regular interval. Combination of both covers all needs

Kappa – Processing data as a stream and then processed data served for queries

Suitability of the data for Streaming

Master data streaming

Only additive delta can be streamed. Any other complex delta mechanisms may not be suitable for streaming

Unpredictable incoming streaming patterns like out of sequence and late events can be streamed but only
with complex logic while processing it further

High scale and continuous streams of data

Skewed data with unpredictable streams cannot be streamed as managing latency and throughput will be
difficult

External lookup while streaming is a memory consuming operation and hence needs additional mechanisms
like cache reference data, additional memory allocation, etc to enable effective streaming

Version 9.1 Published on 5th August 2020 Page 257 of 321


I&A Azure Solution Architecture Guidelines

Section 4 - Information security


As part of information and analytics, products work with wide variety of Unilever data. This data needs to be
protected, as should it fall into the wrong hands, it could have a negative impact on our business.

Information Classification

There are different types information classification standards and each standard requires different level of protection.

Public Information is information that is available to the general public (i.e. already in the public domain). Public
information may be freely distributed without risk of harm to Unilever.

Internal Information is non-public, proprietary Unilever information that is for undertaking our business processes
and operational activities and where the unauthorized disclosure, modification or destruction is not expected to
have a serious impact to any part of Unilever.

Confidential Information is non-public, proprietary Unilever information where the unauthorized disclosure,
modification or destruction could seriously impact a part of the Unilever organisation (e.g. country, brand, function).

Confidential Personal data is defined as information which can be used to directly or indirectly identify an
individual.

Confidential Sensitive Personal data is any personal data which has the potential to be used for discriminatory,
oppressive, or prejudicial purposes.

Restricted Information is information which the Group Secretary has classified as Restricted Information because it
is highly sensitive to Unilever for commercial, legal and/or regulatory reasons.

When to Use Cryptography


Data At Rest

Unilever information classification of Internal MUST be protected using disk, file or database encryption.
Unilever information with a classification of Confidential or above MUST be protected using disk, file or
database encryption.
Encryption technologies may be applied at the physical or logical storage volume and may be
managed by Unilever or by a 3rd party service provider
For Restricted and Sensitive Personal Data, controls MUST ensure that only authorised individuals
can access data and MUST fail closed.
For Restricted and Sensitive Personal Data, annual review of the effectiveness of these controls
MUST take place
All mass storage devices used to store Internal information or above, MUST be protected from accidental
information disclosure (e.g. due to theft of the device) by use of encryption technology.
The encryption method applied MUST be of AES 256bit or equivalent/higher and MUST either provide
file level protection (if stored on an otherwise unencrypted volume) or encrypt the entire storage
volume.

Data In Transit

Internal information MUST be protected during transmission whenever source and destination are in different
security resource group.
Confidential information MUST be protected during transmission whenever source and destination are in
different security resource group.
Restricted or Sensitive Personal data MUST be encrypted whenever in transit. User authentication
credentials MUST always be protected regardless of where they are being transmitted.
Acceptable methods for encrypting data in transit include:

Version 9.1 Published on 5th August 2020 Page 258 of 321


I&A Azure Solution Architecture Guidelines

Transport Layer Security (TLS 1.2 or above)


Virtual Private Networks
Data transmitted over email MUST be secured using cryptographically strong email encryption tools

References:

INFORMATION CLASSIFICATION STANDARD


CRYPTOGRAPHY TECHNICAL SECURITY STANDARD

Version 9.1 Published on 5th August 2020 Page 259 of 321


I&A Azure Solution Architecture Guidelines

Section 4.1 - Environment and data access management


This page describes the fundamental operating model for user & data access permission on Azure environment.

Environment Access

Permissions, Roles and Security Groups

Access to resource groups, components and data is controlled using custom Azure roles granted to Azure Active
Directory security groups. Since Landscape create the environments, developers write and deploy code into their
environments, developers & DevOps personnel do not need Contributor permission on the application resource
groups. Developers and DevOps members cannot provision or edit Azure resources.

Non Functional Custom Azure Roles

There are five custom roles. Two are used to grant access to resource groups, the remaining three control access
to data. For data access note that the type of storage component does not matter. So e.g., if you are a member of
the data reader group for a given environment, you will be able to read data in SQLDW, SQLDB, ADLS & BLOB.

Custom Azure Roles (Non Functional)

Role Description Overview Security


Name Group
Name

InA InA equivalent to Made up of all actions from the standard 'Reader' role plus any SEC-ES-
Tech regular Azure permissions from 'Data Factory Contributor' that relate to DA-<env>-
App 'Reader' role. monitoring and controlling pipelines. It excludes permissions that <ITSG>-
Reader enable authoring. app-reader

InA InA equivalent to This is a cut down version of the regular 'Contributor' role. It does SEC-ES-
Tech regular Azure not enable resources to be stood up. Instead it gives 'Reader' on DA-<env>-
App 'Contributor' role. the resource group, 'Data Factory Contributor'. <ITSG>-
Contri app-reader
butor

InA InA custom role for Read access to all data via its public end point. Does not include SEC-ES-
Tech reading data in any portal access. DA-<env>-
Data component <ITSG>-
Reader data-
reader

InA InA custom role for Write access to all data via its public end point. Does not include SEC-ES-
Tech writing data in any portal access. DA-<env>-
Data component <ITSG>-
Writer data-writer

InA InA custom role for Read and write access to all data via its public end point. Abilty SEC-ES-
Tech reading, writing and to set ADLS folder permissions. Must be granted in addition to DA-<env>-
Data controlling access 'InA Tech App Reader/Contributor' in order provide portal access. <ITSG>-
Owner to data. data-owner

Access Control (Functional Roles)

Below Functional Roles are used in the Dublin & Amsterdam operating model, a user is part of single role only.

Description Name

Version 9.1 Published on 5th August 2020 Page 260 of 321


I&A Azure Solution Architecture Guidelines

Security
Group
(Functional)

Developer This user group is for the Developers. A developer’s permissions diminish as SEC-ES-DA-d-
(New you move into each higher environment: <DevITSG>-
Foundation) azure-developer

Tester This user group is for the Testers. Testers have read only access in QA but SEC-ES-DA-d-
(New have no access to the Pre-Prod and production environments. <DevITSG>-
Foundation) azure-tester

DevOps DevOps users combine the permissions required to develop, test and support SEC-ES-DA-p-
(New the application once it has been released to production. Generally, this mean <ProdITSG>-
Foundation) access is granted in all environments. azure-devops

Developer This user group is for the Developers. A developer’s permissions diminish as SEC-ES-DA-d-
(Old you move into each higher environment: <DevITSG>-
Foundation) Developer

Tester (Old This user group is for the Testers. Testers have read only access in QA but SEC-ES-DA-d-
Foundation) have no access to the Pre-Prod and production environments. <DevITSG>-
Tester

DevOps DevOps users combine the permissions required to develop, test and support SEC-ES-DA-p-
(Old the application once it has been released to production. Generally, this mean <ProdITSG>-
Foundation) access is granted in all environments. Support

Support This group is for the users who provide support during Production releases. SEC-ES-DA-p-
Level 1 User will get execute and data loading permission for Prod environment <ProdITSG>-
(Old supportlevel1
Foundation)

Application The application SPN has data reader and writer permissions but it cannot svc-b-da-
SPN read key vault. Wherever possible, ADF linked services use the application <env>-<ITSG>-
SPN when connecting to data. ina-aadprincipal

Deployment The deployment SPN has elevated permissions and has full access to data, svc-b-da-
SPN components and key vault. <env>-<ITSG>-
ina-deployment

Old Foundation Access Model

Environment Dev QA PPD UAT Prod

Developer Contributor Reader, Execute n/a Reader, Execute n/a

Tester Reader Reader, Execute n/a Reader, Execute n/a

Support Contributor Contributor Contributor Contributor Contributor

SupportLevel1 n/a Reader, Execute Reader, Execute Reader, Execute Reader, Execute

New Foundation Access Model

Group Name Dev QA * UAT * PPD Prod * Experiment


/ Principals

Version 9.1 Published on 5th August 2020 Page 261 of 321


I&A Azure Solution Architecture Guidelines

Developer InA Tech InA Tech InA Tech n/a n/a InA Tech
App App Reader App Reader App
Contributor InA Tech InA Tech Contributor
InA Tech Data Reader Data Reader InA Tech
Data Owner Data Owner

Tester InA Tech InA Tech n/a n/a n/a n/a


App Reader App Reader
InA Tech
Data Reader

DevOps InA Tech InA Tech InA Tech InA Tech InA Tech n/a
App App Reader App Reader App App Reader
Contributor InA Tech InA Tech Contributor InA Tech
InA Tech Data Reader Data Writer InA Tech Data Reader
Data Owner Data Owner

Application InA Tech InA Tech InA Tech InA Tech InA Tech InA Tech
SPN Data Writer Data Writer Data Writer Data Writer Data Writer Data Writer

Deployment Contributor Contributor Contributor Contributor Contributor Contributor


SPN

Data InA Tech InA Tech InA Tech InA Tech InA Tech InA Tech
Scientist App App App App App App
Contributor Contributor Contributor Contributor Contributor Contributor
InA Tech InA Tech InA Tech InA Tech InA Tech InA Tech
Data Owner Data Owner Data Owner Data Owner Data Owner Data Owner

*A locked down environment. Deployment via CICD only

Permissions Comparison Between Old and New Foundation

Security Group Old Foundation New Foundation


(Functional)

Developer Full Access on Dev and Read & Execute Full Access On Dev and Read
access on QA, UAT access on QA, UAT

Tester Read Access on Dev and Read & Execute Read Access on Dev, QA Only
Access on QA & UAT

Dev Ops Full Access on all environments Full Access on Dev & Reader
access on the rest.

AD group naming convention (custom roles)

Custom AD groups names should follow the operating model environment naming convention with a descriptive
name relevant to the access provided, requested and managed by the AD or UAM team using self-service Remedy
offerings (IT, Technical, Active Directory, Groups, …)

Environment: D = Development; Q = QA; B = UAT; U = Pre-Prod; P = Prod

For example, a typical custom AD group to grant business user access to a Production Power BI report reading
data from AAS is : SEC-ES-DA-P-<ITSG>-FinancePBIReadProd

Version 9.1 Published on 5th August 2020 Page 262 of 321


I&A Azure Solution Architecture Guidelines

UDL & BDL Access Management

UDL and BDL consists of data which is shareable . There are lot of business usecases hosted within Unilever
Landscape and outside Unilever landscape requires this data. Refer “Distribution Strategy Design pattern” for
different methods and process involved to share the data.

Data Access : Access management best practices in UDL & BDL

Data access control process for ADLS (UDL & BDL) is three step process:

Step1: Create AD groups.


Step2: Assigning AD groups on required folders in ADLS.
Step 3: Adding users & SPN to AD group.

Make sure to give “Read only “ access to the consumer’s of data. Only respective platform internal process should
have Write permission on the data lake.

UDL:

Step 1 : AD Groups are to be decided by Data SME's/Data owner. It can be based on the below points
Different source system
Data access groups can be defined at any level in the folder hierarchy, Data owner should
make a informed decision to group the data in order to avoid governance on lot of AD groups.
Classification of the data (Restricted, Sensitive, Confidential, Internal)
For example , restricted data access groups are created as separate groups.
Suggestion is to keep separate AD groups for Restricted and Non Restricted data access and
for each business function
Grouping of different commonly used data set. For example: All supply chain data which is mostly
internal or confidential.
Based on the country wise restriction. For example: Indonesia and India finance data is restricted
where as same doesn’t apply to other countries like SEAA or Namet.
SME needs to analyse each data set ingested in to UDL for each function and accordingly arrive at
AD Groups for data access control.
Step 2 : Once the AD Groups and underlying folders are decided. Same should be forwarded to Landscape
team to attach it to respective folders. – NO Super user access given on the ADLS
Step 3 : Process to add the users and SPN’s for read only purpose, into AD group has to be approved
/owned by Data SME.

BDL:

Step1 is owned by the BDL devops team/Data SME – Data SME needs to decide the different of AD
groups required to give different level of access on the BDL data.
For example, SC BDL, the team building this needs to liaise with the IDAM or Landscape team to
create whatever AD groups they need based on the level of granularity of access they wish to provide
and manage.
If the SC BDL team could decide to have one AD group for the whole BDL or one per data set or even
more granular still depending on how they want to manage access.
Step 2 is owned by the I&A landscape team (under Jobby) to make sure AD groups are attached to right
base folders, based on the output of (1)- No super user access given on the ADLS.
Step3 is owned by the project devops team again to ensure that the people who need access, have that
access, by adding them into right AD Groups. All access here should be read only.

PDS:

Version 9.1 Published on 5th August 2020 Page 263 of 321


I&A Azure Solution Architecture Guidelines

No user access given on PDS ADLS. Only connection allowed on PDS for users is to AAS for power BI self-service
via MFA-enabled AD group.

Version 9.1 Published on 5th August 2020 Page 264 of 321


I&A Azure Solution Architecture Guidelines

Section 4.2 - Encrypting Data-At-Rest


Introduction

This page methods approved for encryption of data-at-Rest for various azure components that are used for storing
data.

Based on the Encryption-at-Rest options that each of the components provide, security will be involved to identify
which components should be used for hosting restricted data.

Azure data lake storage gen1

Component Name Azure Data Lake Store Gen 1 Azure Data Lake Store
Gen 1

Type of encryption (Disk/File/Column/Row Storage Storage


Level)

Methods of Encryption Microsoft provided keys - AES Customer Keys - AES 256
256 bit bit

Encryption Key Store Microsoft key store Azure Key Vault

Key Rotation Responsibility Microsoft Customer

Decryption Methods & performance impact? Transparent encryption Transparent encryption


No impact on performance No impact on
performance

Security Agreement? Approved Approved

Azure storage account

Component Name Blob storage Blob storage

Type of encryption (Disk/File/Column/Row Storage Storage


Level)

Methods of Encryption Supported Microsoft provided keys - AES customer-managed key - RSA
256 bit 2048

Encryption Key Stored Microsoft key store Azure Key Vault

Key Rotation Responsibility Microsoft Customer

Decryption Methods & performance Transparent encryption Transparent encryption


impact? No impact on performance No impact on performance

Security Agreement? Approved Approved

Azure SQL data warehouse

Component Name Azure SQL DW (Synapse)

Type of encryption (Disk/File Disk


/Column/Row Level)

Version 9.1 Published on 5th August 2020 Page 265 of 321


I&A Azure Solution Architecture Guidelines

Methods of Encryption Supported Transparent Data Encryption - AES 256

Encryption Key Stored Microsoft Generated Keys: database boot record Customer-managed
keys: Stored in Azure Key Vault .

Decryption Methods & Decrypted by the TDE Protector


performance impact? No impact on performance

Key Rotation Responsibility SQL Managed: Microsoft


BYOK: Customer/Client

Security Agreement? Approved

Azure SQL database

Component Name Azure SQL DB

Type of encryption (Disk/File Disk


/Column/Row Level)

Methods of Encryption Supported Transparent Data Encryption - AES 256

Encryption Key Stored Microsoft Generated Keys: database boot record Customer-managed
keys: Stored in Azure Key Vault .

Decryption Methods & Decrypted by the TDE Protector


performance impact? No impact on performance

Key Rotation Responsibility SQL Managed: Microsoft


BYOK: Customer/Client

Security Agreement? Approved

SQL MI

Component Name Azure SQL MI

Type of encryption (Disk/File Disk, Column


/Column/Row Level)

Methods of Encryption Supported Disk : Transparent Data Encryption - AES 256


Column level : Always Encrypted - AES 256

Encryption Key Store Disk:


Microsoft Generated Keys: database boot record Customer-managed
keys: Stored in Azure Key Vault
Column:
Must be stored in a trusted key store Windows Certificate Store, Azure
Key Vault by the client

Decryption Methods & Disk: Decrypted by the TDE Protector


performance impact? No impact on performance
Column: Performance impact on the client applications (not on the
database server)

Key Rotation Responsibility SQL Managed: Microsoft


BYOK: Customer/Client

Version 9.1 Published on 5th August 2020 Page 266 of 321


I&A Azure Solution Architecture Guidelines

Security Agreement? Approved

AAS

Component Name AAS

Type of encryption (Disk/File/Column/Row Level) Disk

Methods of Encryption Supported Server Side encryption - AES 256

Encryption Key Store Azure managed

Security Agreement? Approved

Coming soon

Databricks internal storage (delta tables)


Azure cache for redis
Azure cognitive search

Version 9.1 Published on 5th August 2020 Page 267 of 321


I&A Azure Solution Architecture Guidelines

Section 4.3 - Encrypting Data-in-Transit


Introduction

This page describes protection and encryption methods used for data in transit.

AZURE DATA LAKE STORAGE GEN1

For data in transit, Data Lake Storage Gen1 uses the industry-standard Transport Layer Security (TLS 1.2) protocol
to secure data over the network.

AZURE STORAGE ACCOUNT

Azure storage account uses TLS 1.2 on public HTTPs endpoints. TLS 1.0 and TLS 1.1 are supported for backward
compatibility.

AZURE SQL DATA WAREHOUSE

Refer to Security on SQL Database/DW

AZURE SQL DATABASE

Azure SQL database always encrypted is data encryption technology that helps protect sensitive data at rest on the
server, during transit between client and server, and while the data is in use, ensuring that sensitive data never
appears as plaintext inside the database system.

SQL MI

Azure SQL database always encrypted is data encryption technology that helps protect sensitive data at rest on the
server, during transit between client and server, and while the data is in use, ensuring that sensitive data never
appears as plaintext inside the database system.

Coming soon

Azure cache for redis


Azure cognitive search

Version 9.1 Published on 5th August 2020 Page 268 of 321


I&A Azure Solution Architecture Guidelines

Section 4.4 - Security on SQL Database/DW


As per the security guideline, every component hosted on Azure should have MFA or Two level of authentication
enabled. Most of the Azure components supports MFA, except a few. SQL DW and SQL DB uses SQL credentials
for connection, which is considered as single level of authentication. I.e. If the SQL Credentials are lost or miss-
placed, anyone having the credentials can connect to the system directly.

In order to avoid the security concern, I&A Platform went with subnet based approach in old foundation design,
where as the same approach is made better in new foundation design, in order to remove subnet and improve cost
spent on proxy VM’s used as IR and OPDG.

Subnet Approach in Old Foundation Design: (Apply’s to both SQL DW and SQL DB)

To Setup the security Mandate


Create Subnet (Responsibility: EC)
Create 2 IAAS VM’s in same Subnet. and provide access to I&A Tech Landscape. (Responsibility:
EC)
For Non Prod since we use clustered IR and OPDG . Instead of creating VM in same subnet there
should be a service end point defined for the clustered IR /OPDG subnet.
I&A Tech landscape to install IR and OPDG software on the VM (Action: I&A Tech Landscape)
Create SQL Server (Action: I&A Tech Landscape)
Create Service End Point to Attach SQL DW to Subnet (Action : EC)
Whitelist the User workstation Subnet on the SQL DW. (Service end point for Workstation Subnet)
(Action: EC)
Whitelist the RDS service end point on the SQL DW
Register the Gateway (Action : I&A Tech Landscape Team)
Make all ADF connections to SQL DW go through IR (Action : Project Team)
Make All AAS connections to Go through OPDG (Action: Project Team)
Make Firewall “ON” on SQL DW
Make Allow Azure Services “OFF” on SQL DW
Remove all white-listing on SQL DW, as connections needs to happen wither through workstation VM
or RDS which is whitelisted through service end point.

How to Verify
Verify below steps for each environment
Check whether you have IR and OPDG created in your environment ? It should be created in
the same subnet as that of SQL DW.
Check if SQL DW has a service end point defined for the below
Subnet service end point , which allows connection to IR and OPDG
Subnet Service end point , which allows connection to Workstation VM’s
Subnet service end point, which allows connection to RDS/Citrix

Version 9.1 Published on 5th August 2020 Page 269 of 321


I&A Azure Solution Architecture Guidelines

Check if your ADF pipelines connecting to SQL DW is using the IR hosted in the same subnet
as SQL DW (Not On Prem IR, It is IAAS IR) . Linked service has to use the Azure IAAS IR to
make connection to SQL DW
Check the AAS refresh is using OPDG to make connection to SQL DW. AAS can connect only
through OPDG hosted in the subnet as that of SQL DW
If any of the above is not taken care , please work with respective team and get it corrected.
(Details in the first Step)

Non-Subnet Approach in New Foundation Design

To Secure the SQL DW without hosting it in Subnet, below points are taken care as part of the design.

SQL DW is setup as PAAS service.


No user will have contributor access on the SQL DW
SQL Admin Account will be discarded after the creation of SQL Service account . SQL Service account will
be used for application connection to SQL DW where credentials are stored only in Key- Vault.
SQL DB Service account credentials are stored in Key-Vault. There is no user who will have access to SQL
DB admin account and Key Vault
Key-Vault is hosted in Central Resource Group and no user will have access to Key-Vault credentials. Key -
Vault read access is not provided to the components from where user can extract it.
ADF/Application connects to SQL using SPN authentication
AAS/Applications connects to SQL by reading credentials from Key-vault during the deployment. Credentials
are not visible to project teams and applicable only through Deployment.
DevOps and Admin users will access SQL DW over internet through conditional MFA. Conditional MFA is
applied by enabling the MFA on the AD groups attached on the SQL DW.
Azure SQL DW connection to Azure Services will not be TURNED OFF
SQL DW firewall will be kept “On” , unless project provides proof of enabling MFA for all AD groups on SQL
DW and no direct user is added into SQL DW.

Refer this link for Design Approval and Security Approval Non Subnet Design for SQL

Version 9.1 Published on 5th August 2020 Page 270 of 321


I&A Azure Solution Architecture Guidelines

Section 4.5 - Data Lake Audit


Security Team audited the Information Technology I&A (Information & Analytics) Data Lakes Platform which is the
foundation of Unilever’s Information Technology strategy to underpin Unilever’s Digital reporting, analytics and
artificial intelligence needs. We focused on Information Protection (including security controls, supplier assurance,
breach notification and user account management) and IT Capability (covering Data Integrity, Architecture and
Governance).

JML AUDIT PROCESS

For detailed process refer to JML Audit Process

LANDSCAPE SECURITY PROCESS

Refer the document Azure Central Lake Security Details to understand and adhere to Security process.

LOGGING AND MONITORING

Refer the document Logging & Monitoring to understand and adhere to Logging and Monitoring process.

PENETRATION TESTING

Unilever is using the PAAS components like- ADLS, Azure Data Factory, Azure Databricks,Azure Key vault ,
Integration Runtime, SQL DB, which are pen tested by the Microsoft.

UDL USER ACCESS PROCESS

Version 9.1 Published on 5th August 2020 Page 271 of 321


I&A Azure Solution Architecture Guidelines

Refer the document UDL User Access to understand and adhere to UDL User Access process.

Note : This process is laid by UDL team in discussion with TDA process.This is living document keep updating
based on future demands.

AZURE ACCESS MANAGEMENT PROCESS SOP

Refer the document Azure access Process to understand Azure Access Management Process SOP

Note : This process is laid by Landscape team in discussion with TDA process.This is living document keep
updating based on future demands.

Version 9.1 Published on 5th August 2020 Page 272 of 321


I&A Azure Solution Architecture Guidelines

JML Audit Process

The purpose of this document is to define a process for on-board & off-board the resource to get access/revoke
permissions for Azure applications in UDL, BDL, Product and Experiment environment.

RAISING ACCESS REQUEST - AZURE APPLICATIONS

The responsibility of on-boarding and raising request for environment and its access remains with the Project
Manager and Unilever Delivery Managers.

ENVIRONMENT ACCESS RACI MATRIX

Below table shows the RACI matrix for environment access.

Azure Application Access ROL Delivery Project Landscape Landscape


ES Manager Manager /DevOps /Devops Team
Manager

New Joiner

Create Unilever Account A R

Raise request to add user into A R


Azure Project AD Group

Add the user into AD Group A R S

Mover

Raise request to remove A R


access from old project

Version 9.1 Published on 5th August 2020 Page 273 of 321


I&A Azure Solution Architecture Guidelines

Raise a request to remove A R S


access

Raise request to add access to A R


the new project

Add the user into AD Group A R S

Leaver

Raise request to remove A R


access from project AD Group

Raise a request to remove A R S


access

Remove UL Account A R

Request removal of the VSTS A R S


License

Project/ Environment
Decommission

Raise request to decommission A R


project

Remove the project Resource A R S


Groups

Remove the user access from A R S


the AD Group

Remove the AD Access A R S

SUPPORTING DOCUMENTS

1. Refer Landscape supporting document "AZURE PLATFORM


SERVICE REQUEST MANAGEMENT" to understand the
Azure Platform Service request Management for Access requests, Landscape requests etc Link
2. The access is given using the MIM Portal by respective landscape team/Devops team

Refer the supporting document "UDL DEVOPS SERVICE REQUEST AND INCIDENT MANAGEMENT" to
understand the UDL Service request, User access requests Link
Note: Attached document provides the details about the UDL process only

NOTIFICATION TO REVOKE ACCESS PERMISSIONS FOR ENVIRONMENT

Azure Landscape team and respective devops team maintain the Tracker with latest data as mentioned in
the below format
The required data can be extracted from the Azure or MIM portal.
Setup the process to send the notification email to Unilever Delivery Manager, Vendor Partner Manager and
Business manager every month to review the active members.
The user access should be revoked based on the input from the Unilever Delivery Managers.

Version 9.1 Published on 5th August 2020 Page 274 of 321


I&A Azure Solution Architecture Guidelines

Note : The details need to be maintained by the team who is providing the access to the MIM portal.

Data Access

The data access should be managed by the Unilever Delivery managers who own the respective environments and
Data Expertise Team.

DATA ACCESS ON UDL/BDL

The data access is governed by the data owners and environment owners. They are responsible to manage the
access and revoke of the data access permissions.

UDL Access : Fill the data access template and share it with UDL Dev Ops team.UDL Data Access Forms
can be downloaded from here
BDL Access : Fill the data access template and share it with BDL Dev Ops Team. BDL data access template
can be downloaded for here . BDL Dev OPS contact is as below
Supply Chain: (Reach out to BDL Dev Ops)

DATA ACCESS RACI MATRIX

Below table shows the RACI matrix for data access.

Activity Delivery Project Data Data Landscape Landscape


Manager Manager owner Expertise / DevOps /Devops
Team Manager Team

New User

Create Unilever Account A R

Raise requests with Data Owner A R

Data owner approves the A


addition of User in Respective
AD Group

Raise requests with respective A R


Devops team

Landscape/Devops team adds S


user into AD Group

Mover

Requests removal of the user R A


access from AD Group

Landscape/Devops team S
removes the access

Removal of data access

Request removal of the user A R


access from AD Group

Version 9.1 Published on 5th August 2020 Page 275 of 321


I&A Azure Solution Architecture Guidelines

Landscape/Devops removes A S
the access

Periodic Review of data access S A

SPN

New SPN access

Create SPN A R

Raise requests with Data Owner A R C

Data owner approves the A


addition of SPN in Respective
AD Group

Landscape creates the SPN S

Removal of SPN access

Request removal of the SPN R A


access from AD Group

Landscape/Devops team S
removes the access

Periodic Review of data access S A

SUPPORTING DOCUMENTS

Refer the UDL supporting document "APPLICATION


ACCESS CONTROL POLICY – UNIVERSAL DATA LAKE"
to understand the access control principles for UDL, and process to request UDL access to users and applications - l
ink

NOTIFICATION TO REVOKE DATA ACCESS PERMISSIONS

Respective Devops team maintain the Tracker with latest data as mentioned in the below format
The required data can be extracted from the Azure portal.
Setup the process to send the notification email to Data Expertise team, Unilever Delivery Manager, Vendor
Partner Manager every month to review the active members.
The user access should be revoked based on the input from the Data Expertise team and Unilever Delivery
Managers.

Version 9.1 Published on 5th August 2020 Page 276 of 321


I&A Azure Solution Architecture Guidelines

Section 5 - Cost Management


Cost on Azure is pay-as-you-go , i.e. you pay only for what you use. In order to plan the budget, manage the cost
I&A architecture team has come up with certain tools, guidelines to manage the cost well.

Some of which are defined below

Costing Questionnaire : Before an application/product is built, there are certain planning activities required to
close the functional and non functional requirements, on-board a build team, secure the budget etc. In order
to secure the budget project requires high level estimation of their Infra, Build and Support cost based on the
functional requirements to be covered in the project. Build and support cost is planned by delivery team’s.
But because the infrastructure used here is Azure cloud PAAS services, architecture team provides high
level estimates of infrastructure cost. Costing questionnaire consists of set of questions to get information on
functional requirements and scope of the project, which will help architecture team to build a draft
architecture and provide high level estimate of the cost based on the draft architecture. This is only HLE
derived based on the answers from the project team and not the actual cost. Cost can differ if the project
makes use of the environment in different manner than claimed in the costing questionnaire.

TDA T Shirt Calculator : Excel based self explanatory calculator built to provide high level estimate. Projects/
Delivery teams who are aware of the data size and high level architecture, can use this calculator to derive
high level cost.

Cost optimization Recommendation: Though the cost in Azure is Pay as you Go, but if the components are
not managed right , cloud can turn out to be an expensive solution. In order to avoid projects spending a lot
on their infrastructure, architecture team has come up with certain guidelines and scripts to optimize the cost.

Version 9.1 Published on 5th August 2020 Page 277 of 321


I&A Azure Solution Architecture Guidelines

Section 5.1 - High Level Estimate - Questionnaire

Please note this page is due to be revised soon.

Costing Questionnaire and HLE

A template is available to capture the Project Inputs required for cost estimates (See Appendix D). The subsequent
sections outline how to fill in the template, why the information is required and how it is used to support cost
estimates

However, it is important to understand the more information made available the more accurate the estimate will be.

1. Data Sources

Field Why it is collected and what the information is used for. Needed
for
Costing?

Source This is the recognized system of record for the data sources required to support Required
System the solution. for design

Source Description of Data Source (this needs to be common to source technical team, Required
System business, project and I&A teams for design
Description

Internal/ Is the source system · on-premise? · Is an external system to UL, managed Required
External externally? · Is an external system to UL, which will first land data on premise? for design

Data Source / Technical name of the data source (interface name) Required
Interface for
Costing

No of Data Number of interfaces(extracts) from each source (both external/internal) and on- Required
Source / prem/cloud for the given frequencies for
Interface Costing

Frequency This is the refresh frequency, real time, infra day, daily, weekly, monthly Required
for
Costing

Size (in MB)/ The size of the file for each refresh. See next section for all data volume Required
Data Volume information required. for
Costing

Provide all the details on the data source requirement for the project. Reference Data Sources Tab of Costing
Questionnaire Excel Document

It Sourc Source Internal Transaction/ Master Source / Frequency Size (in MB)
em e System \ Data/ Text/ Interface / data
System Description External Hierarchy volume

D1

D2

….

Version 9.1 Published on 5th August 2020 Page 278 of 321


I&A Azure Solution Architecture Guidelines

T This information is used in the determination of the (A) Total (B1) Low (C)Total
ot architecture design patterns and confirmation that Number of Frequency = incremental
al the source systems have current design patterns Data Total number volume
which are fully tested and available. Source daily/ weekly month is
/Interfaces /monthly etc calculated

(B2) High = total


number intra
day

For high level estimates carried out pre Gate1 in the project life cycle the estimates can be used for the number of
data sources, frequency and size by source system and data type (transaction, master, text and hierarchy).

For gate 1 and subsequent re-estimation point the data is collected at a granular level a record for each data
source/ Interface by data type.

1. Data Volume Details

List all the volumetric details for the sources considered in section above.

It Item Project Input Required


em Description for
Costing

D 1 Year Data Total volume of data for most current rolling full year by data source/ interface Required
1.1 Volume for
Costing
on
Storage
and
Compute

D History Number of years of historical data required. With actual data volumes by year Required
1.2 and data source captured. for
Costing
on
Storage
and
Compute

D Total Data This is the estimated go live volumes. At Gate 0 a ball park estimate can be Required
1.3 Volume for used (project assumption can be used), at Gate 1 items 1.1 and 1.2 need to for
solution go be completed. Costing
live (Current on Storage
+ History)

D Retention What is the required archiving strategy, will data be archived and/or deleted, Required
1.4 required if not for how long will data be kept. I.e. go live data volumes could be current for
apart from year plus + 2 years, but the retention period for the solution post go live could Costing
(Current + be 10years. on Storage
History +
post go live
retention
period)

D Data growth The data volumes for each data set will vary. Some data sets grow at a Required
1.5 expected per constant rate i.e, volume of data by month is static over time. Other data sets for
Year either grow or reduce in relative size over time. Costing
on

Version 9.1 Published on 5th August 2020 Page 279 of 321


I&A Azure Solution Architecture Guidelines

Storage
and
Compute

D Data in “Hot” Hot data is accessible to the reporting solution and can be accessed by both Required
1.6 storage developed reporting solutions and self-service. The Hot data needs to be for DR
defined by the layer in the architecture and for the future retention period design
outlined in point 1.4 (details to
be added
in Phase 2
of Costing)

D Data in Data can be stored in low cost infrastructure, which can be accessed by data Required
1.7 “Cold” storage science functions, but not the front-end dashboard and self service tools. This for DR
data could be retrieved into the reporting solution on request or accessed design
utilizing I&A data science functions. The retention of data in cold which is only (details to
required in exceptional situations offers a cost effective approach to this be added
requirement. in Phase 2
of Costing)

1. Dashboard Reporting

This section cover’s all the dashboard reporting requirement of the project.

It Item Project Input


em Description

R1 Is there (Yes/No) Required for design


dashboard
reporting
requirement?

R2 Total data If existing solution and TDE's are available, then total size of Required for Costing on
volume for the TDE. Azure Analysis Services
dashboard and SQL DW for
reporting? reporting

R3 No of No of actual workbooks exposed to users with multiple tabs (as Required for design and
Dashboards applicable) Dev resource planning
Planned?

R4 Maximum No. of reports to be built in one dashboard. The dashboard Required for design and
reports in any with maximum number of reports will identify the complexity of Dev resource planning
Dashboard? dashboard creation

R5 Maximum Maximum data that can be fetched from across all the Required for design of
volume of dashboards AAS cubes
records for
any report?

R6 Total Number of business users with access to run dashboards Required for Design and
dashboard Premium capacity
reporting
users?

R7 Total Number of users accessing the reports at the same time Required for Costing
concurrent

Version 9.1 Published on 5th August 2020 Page 280 of 321


I&A Azure Solution Architecture Guidelines

Users
accessing the
reports?

R8 Maximum Maximum response time for availability of report data after the Required for Costing
acceptable users run the dashboards
response
time?

R9 SLA for report Is access to the report required 24 by 7 ? If not by time zone, Required for Costing
availability? what are the required availability hours? Can the reporting
performance be turned down out of defined availability period?

Note: The current recommendation is not assumed reporting solution can be turned off out of hours. Historical it
has been proved that this has been impractical, a similar approach was attempted to manage the on-premise
Tableau servers. A hard close, would result:

· In reports being crashed if running at the close time.

· Critical out of office activities being impacted if reporting solutions are not pre-booked to be kept running.

· The Application Management (AM) team will require access to the environment to evaluate faults. And application
development (AD) teams and DevOps teams will need access. to the environment to deploy enhancements and
CRs

· Additionally, UL is a global organization with report uses traveling globally, resulting in out of hours requirements.

The I&A-Tech Platform will continue to team with projects and business functions on this requirement and ensure
opportunities to reduce costs are taken. However, the recommendation stands that it should not be assumed that
the reporting solution can be turned off out of core reporting hours.

1. Self Service Reporting

This section cover’s all the self-service requirement of the project

It Item Description Project Input Equired for Costing


em

S1 Is there Self-Service reporting (Yes/No) Required for design


requirement?

S3 Total data volume for Self- Overall size (total volume) of data required for Required for costing
Service Reporting? self-service slice and dice reporting for SQL DW

S4 Total Self-Service users? Number of users doing slice and dice Required for Costing
of Power BI Pro
licenses

S5 Total concurrent Users Number of users doing slice and dicing at the Required for Costing
accessing the self-service same time of SQL DW
environment?

S6 Maximum acceptable response Maximum response time for availability of Required for Costing
time? report data after the users run the dashboards

1. Analytics/ Data Science

This section cover’s all the Analytics/Data Science requirements of the project

It Item Description Project Input


em

Version 9.1 Published on 5th August 2020 Page 281 of 321


I&A Azure Solution Architecture Guidelines

A1 Is there Data Science / Advanced (Yes/No) Required for design


analytics requirement

A2 Total data volume considered for data Overall size (total volume) of data Required for costing of
science required for data science use cases Azure ML/HDInsight

A3 Total Data Science/Analytics Users Number of data scientists running Required for design and
models Azure ML licenses

A4 Total concurrent users accessing Number of users accessing models Required for design
analytical platform at same time

A5 No of analytical models planned Number of analytical models Required for design and
dev resource capacity

A6 Is Maximum data volume for any Is Maximum data volume for any Required for design and
analytical model > 10 GB analytical model > 10 GB(if size > 10 Costing if size > 10 GB
GB use HDI, else use ML) use HDI, else use ML

A7 Maximum data volume for any Maximum size of final data set Required for Costing
analytical model considered for all models

A8 Time Frame when the analytical/ Data Number of hours of usage of Required for costing of
Science users are expected to use environment non-Prod environments
the system? (24/7, 12/5) esp. Dev

Version 9.1 Published on 5th August 2020 Page 282 of 321


I&A Azure Solution Architecture Guidelines

Section 5.2 - Cost Optimization Methods


Application/Product right practices:

Reduce the data to 10% in Non-Prod environments for UDL, BDL and PDS, mainly (Dev and QA) , for all
components (ADLS, SQL DW, AAS)
Right size the environments and components
AAS and SQL are the biggest contributors to cost. AAS (<S2) and SQL DW (< 400 DWU) should be
used in Non Prod environments.
UAT and PPD environment should be up and running only when UAT or Performance testing is
carried out.
If Consider a sprint of 4 weeks with release after every 2 sprints. Then
8 weeks of sprint ( Dev Environment)
2 week of testing – (QA environment – 15% of Development cost)
1 week of Performance testing (PPD environment – 8% of Development cost
Monitor the environment to verify, if the environment is up only during limited period.
Pause all Non Prod environment during Weekend and Non-Working Hours.
The usage of Non-Prod should be minimal or No usage during non office hours.
Monitor / Have a approval process to bring up the component in Non-Prod during weekend or non -
working hours, if the requirement is justified.
Pause all compute environment after the processing is complete or when not in Use.
Architecture & Landscape has provided a method to Pause components when not in use. Mainly AAS
and SQL DW , using web hook.
For support activities on SQL DW use lower configuration of SQL DW
Use azure component utilization report to see when all SQL DW and AAS are being used and when it
can be paused. Implement pause and resume accordingly.
Databricks optimization
Implement a timeout of 15-20 Minutes for Interactive cluster.
Implement Databricks cluster optimization. Use right cluster (type and number of nodes) for job type –
small, memory intensive, compute intensive
Reduce the number of Databricks premium instances if premium features are not used
Implement Databricks Delta with partition to reduce compute costs
SQL DW optimization : Implement data distribution, Partitions, Indexing, Distribution, enable statistics to
improve query performance; improved loading techniques
AAS Optimization: Move only calculated/Aggerated data required for reports. Implement partitions,
Incremental refresh, Calculated Groups
Set budget and alert for projects cost over a threshold.
Migrate to New foundation design to completely remove the IR and OPDG for projects. (Applicable to
projects in Old foundation Design i.e. Dublin)
Review and Implement best practices published for each Azure component by TDA in solution
architecture guidelines.
Use Log Analytics and alerts for monitoring and alerting will help reduce manual monitoring costs
Migrate to ADLS Gen2
Pause IAAS components using Park my Cloud
Start/Stop VMs on a schedule: Instruction for PMC- User Guide

Platform On Going Activities for Cost Optimization:

Reserved Instances : Work in Progress


Reserved capacity implemented for Databricks DBU and SQL DW. Expected savings of 50% -
Completed

Version 9.1 Published on 5th August 2020 Page 283 of 321


I&A Azure Solution Architecture Guidelines

Implement reserved instances for Databricks VM’s and ADLS - Work in progress as standardization of
VM type is happening across products.
Architectural patterns to remove the staging component which doesn’t add value in the E2E Model.
(In Progress)
New design patterns implemented to remove SQL DW if database if not critical, as Databricks can be
used as compute layer.
AAS refresh directly from ADLS is tested and implemented in few of the projects.
Shareable cluster for IAAS components IR & OPDG for non-prod environments
Implemented
Migrate to “New Foundation Design” subscription, which no longer need IR & OPDG due to changes
in security controls on SQL DW & resource groups
Work in progress
Move from HDInsight to Databricks
Most of the projects have already migrated.
Migrate to ADLS Gen2 (Gen2 cost is ~30% less than Gen1 Storage cost)
Use Cold, Hot and Archive data layers for ADLS Gen2
Best practices Guidelines and Trainings
Train Build team on right practices for each component.
Review projects which are highest contributor to overall cost of Azure Platform for opportunities on cost
improvement.

Version 9.1 Published on 5th August 2020 Page 284 of 321


I&A Azure Solution Architecture Guidelines

Technical Implementation - Cost Optimization

In order to save costs on Azure Analysis Services and Azure Synapse (formerly Data Warehouse), Unilever I&A
recommends using standard processes that will be available to all projects. These standards are developed and
maintained by landscape and make use of Azure Automation Account. Here is the list of what is supported and
further sections describe how projects can implement them:

How to get Usage stats

SQL DW/SYNAPSE ANALYTICS

Log into Azure portal


Look for the resource group
Click on SQL DW component
On the left pane click on metrics
Add metric DWU limit and DWU used - dates can be customised
Chart like below will give the usage details and dotted lines (blue) is the paused state
Figure 1 below shows DW instance which is not optimized
Although max DWU units is not consumed - it still runs in max capacity all the time
Figure 2 below shows an optimally used SQL DW instance the doted lines denote pausing of DW instance

AAS

Log into Azure portal


Look for the resource group
Click on AAS component

Version 9.1 Published on 5th August 2020 Page 285 of 321


I&A Azure Solution Architecture Guidelines

On the left pane click on metrics


Add metric Memory Limit Vs Memory Usage (Max) for 7 days
Add another chart with metric current user sessions average for 7 days
Figure 1 and 2 shows an non-optimized AAS instance which has not been paused during low utilisation
Figure 3 and 4 shows an optimized AAS instance which has been paused during low utilization time

Version 9.1 Published on 5th August 2020 Page 286 of 321


I&A Azure Solution Architecture Guidelines

Pause and Resume Azure SQL Data Warehouse


Pause and Resume Azure Analysis Services
Resize Azure SQL Data Warehouse
Resize Azure Analysis Services

Similar implementation is done for processing Azure Analysis Services cubes using TMSL. For details, refer to Web
hooks for AAS cube refreshes.

Automation & Web Hooks

This section covers the use of web hooks for automation and how they are used for reducing costs for projects.

I&A Landscape has access to an Azure Automation account that is used to provide provisioning services to
projects. Access to these services is provided via shared Web Hooks that you can use. Additional services can be
added on request. Note that the runbooks that provide these services use a copy of your application’s parameter
files held in landscape’s private storage account. If you edit your Overrides Parameter File and the new values are
required by a webhook, you must ask landscape to copy the updated file to their storage account.

If you look in your parameter file and search for ‘Webhook’ you will see various entries that help identifying
resources for your project that can be managed by webhooks.

If you don’t see the above in your parameter files, ask landscape to regenerate them for your project.

In addition, there are some code sample in your ADF that show how to call them:

Theses samples are deployed by default in all newly provision environments in Amsterdam.

These pipelines only work if you trigger them. Running them in debug mode doesn’t work.

WEBHOOK PARAMETERS

In the code samples you will see that the body of each web request is constructed using a dynamic ADF
expression. Two separate formats are supported and note that format1(legacy) is a subset of format2:

Format 1 (Legacy)

Version 9.1 Published on 5th August 2020 Page 287 of 321


I&A Azure Solution Architecture Guidelines

@concat(pipeline().DataFactory,',',pipeline().RunId,',','')

This evaluates to a csv string where the first 4 values have a fixed meaning as follows:

Col Mandat Expression / Notes


No ory Value

1 Y pipeline(). The name of the data factory


DataFactory

2 Y pipeline().RunId The run id of the pipeline invocation

3 N Override The name of an override parameter file


parameter file

4 N Call back pipeline The name of a pipeline in the data factory to trigger once runbook
processing is complete

5+ N Web hook specific Depends on the webhook.

Purpose of the first 2 columns is apparent from the above table. Here is some detail about the remaining columns:

Webhook Parameter - Column 3

You can control which parameter file override to use. A use case for this could be if you have multiple SQLDW
instances. Since the main parameter file only holds the name of one instance you can use an override file that
contains the name of the second. The webhook will use the values in the override file and so act on the second
SQLDW instance.

Webhook Parameter – Call Back Pipeline

Webhooks run asynchronously and so the ADF web task will complete well before the underlying automation
runbook. In a data processing scenario where ADF needs to resume SQLDW before attempting to process its data,
you can use the call back feature to start your data processing pipeline after SQLDW has fully resumed. This
parameter is supported for all webhooks.

Use format1 for all webhooks except AAS cube processing.

Format 2

@concat('{"csv":"',pipeline().DataFactory,',',pipeline().RunId,',,
PL_PROCESS_CUBE_CALLBACK",','"object":',pipeline().parameters.
tmslScript,'}')

This evaluates to a JSON string and the CSV string from format 1 appears in a node called ‘csv’. Another node
called ‘object’ is also created and this can be used for anything. Currently, it is only used for cube processing via a
TMSL script (i.e. a json string).

RESIZING AAS WEBHOOK PARAMETERS

When calling this webhook you must pass the sku in csv column 5.

RESIZING SQLDB & SQLDW WEBHOOK PARAMETERS

Version 9.1 Published on 5th August 2020 Page 288 of 321


I&A Azure Solution Architecture Guidelines

When calling this webhook you must pass the database edition and pricing tier in columns 5 and 6 respectively. In
column 5 pass ‘Datawarehouse’ for SQLDW or empty string for SQLDB.

PAUSE AZURE ANALYSIS SERVICES

If your project requires Azue Analysis Services, you are likely to have one instance per environment - Dev, QA,
PROD etc. You’ll also have an ADF pipeline that allows to pause AAS instances. These pipelines have no triggers
associated with them. i.e. are not scheduled to run by default. Each project is supposed to implement schedules to
run these pipelines according to their needs keeping in mind that AAS is an expensive component and should
remain pause when not in use. The pipeline looks like this:

Notice the URL. Method and Body elements.

URL - All non-prod URLs will be the same - they call the same webhooks. For production, you’ll have a
different URL. If you need, reach out to landscape team to get that URL.
Method - should always be ‘POST’

Version 9.1 Published on 5th August 2020 Page 289 of 321


I&A Azure Solution Architecture Guidelines

Body - If you only have one instance per environment then you don’t need to change the body element. It’ll
automatically identify your instance for the relevant environment based on your parameters file and will
pause the instance. If you have multiple AAS instances in your environment, you need to specify which
instance to pause. The syntax for it is slightly different as you’ll have to specify the instance name.

As an example, Body for a project that has 2 instances of AAS per environment would look like:

@concat(pipeline().DataFactory,',',pipeline().RunId,',','overrides.
aastwo.d.80181.json')

RESUME AZURE ANALYSIS SERVICES

If your project requires Azue Analysis Services, you are likely to have one instance per environment - Dev, QA,
PROD etc. You’ll also have an ADF pipeline that allows to resume AAS instances. These pipelines have no triggers
associated with them. i.e. are not scheduled to run by default. Each project is supposed to implement schedules to
run these pipelines according to their needs keeping in mind that AAS is an expensive component and should
remain pause when not in use. The pipeline looks like this:

Version 9.1 Published on 5th August 2020 Page 290 of 321


I&A Azure Solution Architecture Guidelines

Notice the URL. Method and Body elements.

URL - All non-prod URLs will be the same - they call the same webhooks. For production, you’ll have a
different URL. If you need, reach out to landscape team to get that URL.
Method - should always be ‘POST’
Body - If you only have one instance per environment then you don’t need to change the body element. It’ll
automatically identify your instance for the relevant environment based on your parameters file and will
resume the instance. If you have multiple AAS instances in your environment, you need to specify which
instance to resume. The syntax for it is slightly different as you’ll have to specify the instance name.

As an example, Body for a project that has 2 instances of AAS per environment would look like:

@concat(pipeline().DataFactory,',',pipeline().RunId,',','overrides.
aastwo.d.80181.json')

PAUSE AZURE SYNAPSE (FORMERLY SQL DATA WAREHOUSE)

Everytime a new project environment is created in I&A managed subscriptions with an instance of Azure Synapse,
a pipeline is also created by landscape in ADF that allows to pause the SQL DW instance. Similar to the examples
above, this pipeline makes use of web hooks and comes without a default trigger. Based on the requirements the
projects can trigger this pipeline to pause SQL DW instance.

Version 9.1 Published on 5th August 2020 Page 291 of 321


I&A Azure Solution Architecture Guidelines

The configuration is limited to the same 3 elements, URL, Method and Body. If you have only one instance per
environment you don’t need to make any changes to the above representation of pipeline. If your environments
have multiple instances of DW (rarely the case), you can make use of the 5th parameter in Body element as above
examples. Please note, the URL for non-prod and prod environments are different and projects can reach out to
landscape to know these URLs if needed.

RESUME AZURE SYNAPSE (FORMERLY SQL DATA WAREHOUSE)

Similar to the above implementations, your ADF will have ready-made pipelines for resuming Azure Synapse
instances. Here is a representative example of how it will look.

A regular auditing in place that identifies instances that are left running over the weekends and outside
business hours.

Version 9.1 Published on 5th August 2020 Page 292 of 321


I&A Azure Solution Architecture Guidelines

Section 6 - New Foundation Design - Azure Foundation 2018


What is changing

We are moving from single account single subscription model to multi-account multi-subscription model.

Design Criteria

Access Control

Version 9.1 Published on 5th August 2020 Page 293 of 321


I&A Azure Solution Architecture Guidelines

RBAC

Policies

Overall Azure Hierarchy

Version 9.1 Published on 5th August 2020 Page 294 of 321


I&A Azure Solution Architecture Guidelines

What is in the HUB?

Common Subscription Patterns

Version 9.1 Published on 5th August 2020 Page 295 of 321


I&A Azure Solution Architecture Guidelines

Version 9.1 Published on 5th August 2020 Page 296 of 321


I&A Azure Solution Architecture Guidelines

Section 6.1 - Express Route – Setup and Details


Express Route circuits connect on-premises infrastructure to Microsoft through a connectivity provider. 3 types of
peering are

Private Peering

Microsoft Peering
Public Peering (Deprecated)

EXPRESSROUTE Details:

Express Route connections do not go over the public Internet, and offer more reliability, faster speeds, lower
latencies and higher security than typical connections over the Internet

ExpressRoute circuit represents a logical connection between on-premises infrastructure and Microsoft cloud
services through a connectivity provider.
Multiple ExpressRoute circuits can be setup and Each circuit can be in the same or different regions, and
can be connected to on premises through different connectivity providers

EXPRESSROUTE circuits are uniquely identified by GUID called service key. Service key is the only piece of
information exchanged between on premise network, provider and Microsoft network

Each peering is a BGP session.

Each circuit has a fixed bandwidth (50 Mbps, 100 Mbps, 200 Mbps, 500 Mbps, 1 Gbps, 10 Gbps) and is
mapped to a connectivity provider and a peering location. Bandwidth can be dynamically scaled without
tearing down the network.

Billing model can be picked by the customer, unlimited data, metered data, EXPRESSROUTE premium add-
on

ExpressRoute circuits can include two independent peerings: Private peering and Microsoft peering. Old
EXPRESSROUTE circuits had three peerings: Azure Public, Azure Private and Microsoft. Each peering is a
pair of independent BGP sessions, each of them configured redundantly for high availability

Version 9.1 Published on 5th August 2020 Page 297 of 321


I&A Azure Solution Architecture Guidelines

EXPRESSROUTE – Private Peering

Azure compute services, namely virtual machines (IaaS) and cloud services (PaaS), that are deployed within a
virtual network can be connected through the private peering domain

Private peering is considered to be a trusted extension of core network into Microsoft Azure

This establishes a Bi-directional connectivity between core network and Azure virtual networks (VNets)

Private peering is achieved between the 2 virtual network i.e. on premise private network and azure IAAS
private network.

Connections to PAAS services which supports Vnet hosting can be routed through EXPRESSROUTE circuit.

EXPRESSROUTE – Microsoft Peering

Connectivity to Microsoft online services (Office 365, Dynamics 365, and Azure PaaS services) occurs through
Microsoft peering

Microsoft peering allows Microsoft cloud services only over public IP addresses that are owned by customer
or the connectivity provider.

Route Filter needs to be defined to allow connections for a particular region and service. Similar
configuration as Allow Azure Services. (For example: If route filter enabled for a North Europe, all the IP’s of
North Europe will be whitelisted)

Connections always originate from on premise network and not from Microsoft network.

Requirements:
/30 public subnet for primary link
/30 public subnet for secondary link
/30 Advertised subnet
ASN

Expressroute Unilever Setup

Version 9.1 Published on 5th August 2020 Page 298 of 321


I&A Azure Solution Architecture Guidelines

Azure compute services, namely virtual machines (IaaS) and cloud services (PaaS), that are deployed within a
virtual network can be connected through the private peering domain

Any services hosted within a private network can make use of expressroute circuit for connection to unilever
network.

EXPRESSROUTE circuit is useful in scenarios for Unilever

Data ingestion from Unilever Data Center to Azure.

End user connection to Azure services from Unilever network.

Data ingestion from Unilever Data Center to azure can be made to go through EXPRESSROUTE using
private peering, by hosting the Azure IAAS IR within Azure network and enabling the firewall port to on
premise network

End User connection cannot be made to through EXPRESSROUTE unless public or Microsoft peering is
enabled.

Version 9.1 Published on 5th August 2020 Page 299 of 321


I&A Azure Solution Architecture Guidelines

Section 6.2 - I&A Subscription Design

I&A platform was setup in Foundation design which was created in North Europe i.e. Dublin. With creation of new
foundation design, I&A has started hosting all new products in new foundation design, but the existing projects are
to be migrated to New foundation design.

Old Foundation Design:

Old foundation design is setup in North Europe (Dublin) . Three subscriptions are hosted in old foundation design
Prod, Prod 2 and Prod 3 and all the three are common subscriptions shared with all IT platforms

Prod is the initial subscription created, which was shared with all platforms. When the subscription limits
were reached new subscription Prod 2 and Prod 3 were created.

I&A Tech is not the owner of any of these subscriptions , as these are shared with multiple platforms.

I&A is dependent on EC in order to create resource group or assign permissions which can be done only
with Owner permission.
New subscriptions are created whenever the Limit for subscription is reached.

Networking (VNet, Subnet, IAAS) components sits as part of each subscription and shared between the
platforms.

Why New Foundation design ?

Version 9.1 Published on 5th August 2020 Page 300 of 321


I&A Azure Solution Architecture Guidelines

Advantages of New Foundation Design for I&A:

Migration to the new foundation design; is migration to better managed platform. New foundation design is
available both in Dublin & Amsterdam regions

New foundation design:

Brings network benefits & fixes security concerns vs old “Dublin” design
Old Dublin design has single vNet, subnet and subscription architecture. No segregation between platforms
or at the networking level.
Networking design with no segregation between prod and non-prod is discouraged by InfoSec.
Data protection mechanisms are not granular in old Dublin design since it is a single vNet

New Foundation design implemented

Hub and Spoke Model: Each platform has its own subscription and controls its own environment.

Advantages for I&A

Faster provisioning of components as I&A technology team can create resource groups and IaaS
components – full control without dependency on EC team; faster delivery

Subscription limit issues removed/reduced by segregation of subscriptions per platform. Less number of MS
issues.

Reduction in IAAS infrastructure


No IR or OPDG required
Reduce cost for projects/environments. Estimated reduction of ~2600 Euros/month per project
No single point of failure and performance issued due to IR and OPDG.
No workstation VM’s required:
Reduced cost.
Citrix and DTL connections for quick access using tools hosted in secure environment.
Networking improvements
Connection between subscriptions are to be seamless without any additional components
Connection to on-premise systems through EXPRESSROUTE.
UDL, BDL, Products, experiments are segregated with logical separation for better management.

Multiple Subscriptions in I&A Platform :

In order to avoid the issues faced in Old foundation design, I&A TDA came up with new design of hosting multiple
subscriptions to segregate UDL, BDl, PDS and experiment environments.

Limitations affecting Dublin Subscriptions are:

Hard Limit
250 Storage accounts per Subscription.
RBAC Limit
Number of network calls
Soft Limits

Version 9.1 Published on 5th August 2020 Page 301 of 321


I&A Azure Solution Architecture Guidelines

Core Limit
Resource limit

Solution Based on discussion with MS team:

Cannot keep static number of subscriptions. New subscriptions are required as and when the limit is reached.

As per Microsoft, any subscription should have ideally not more than 40 Concurrent Databricks clusters to
overcome throttling issue due to networking calls made.

Some of the best practices is to, monitor the limits and keep a threshold , when reached new subscriptions to
be created.
Keep threshold alerts to 60% of Hard Limit
Stop provisioning of new resources when the subscription reaches 80% of the Limit
Remaining 20% will be used for scaling the existing solutions in the subscriptions.
Any new applications should be moved to new subscription created
Soft Limit alerts to be configured for 80% , in order to increase the limit well in time.

Dublin Subscriptions (New Foundation Design)

UDL and BDL’s will be hosted in Dublin. Existing products which are in old foundation design will be migrated to
new foundation design in Dublin.

Amsterdam Subscriptions (New Foundation Design)

Version 9.1 Published on 5th August 2020 Page 302 of 321


I&A Azure Solution Architecture Guidelines

Subscription Segregation as UDL, BDL, PDS and Experiment:

UDL SUBSCRIPTIONS:

BDL SUBSCRIPTIONS:

Version 9.1 Published on 5th August 2020 Page 303 of 321


I&A Azure Solution Architecture Guidelines

PDS SUBSCRIPTIONS:

EXPERIMENT SUBSCRIPTION:

DISASTER RECOVERY SUBSCRIPTIONS:

Version 9.1 Published on 5th August 2020 Page 304 of 321


I&A Azure Solution Architecture Guidelines

New Foundation Design Subscriptions and Usage

Subscription Name Purpose

I&A Subscription-01 AMS Prod

I&A Subscription-02 AMS Pre-Prod and UAT

I&A Subscription-03 AMS QA

I&A Subscription-04 AMS Dev

I&A Subscription-05 UDL Prod

I&A Subscription-06 UDL UAT

I&A Subscription-07 UDL QA

I&A Subscription-08 UDL Dev

I&A Subscription-09 Dublin Prod

I&A Subscription-10 Dublin UAT

I&A Subscription-11 Dublin QA

I&A Subscription-12 Dublin Dev

I&A Subscription-13 Dublin DR

I&A Subscription-14 UDL DR

I&A Subscription-15 BDL DR

I&A Subscription-16 Experimentation

I&A Subscription-17 Amsterdam DR

I&A Subscription-18 BDL AMS Prod

I&A Subscription-19 BDL AMS Pre-prod

I&A Subscription-20 BDL AMS UAT

I&A Subscription-21 BDL AMS QA

I&A Subscription-22 BDL AMS Dev

Version 9.1 Published on 5th August 2020 Page 305 of 321


I&A Azure Solution Architecture Guidelines

Section 6.3 - Product migration


OVERVIEW

This document provides guidance on product migration from old foundation to new foundation using azure move
resources feature.

This migration method has been tested and works with standard I&A stack shown below.

As part of OLD foundation, each product has been deployed using 3 RGs per environment:

APP-RG
DATA-RG
STG-RG

As part of NEW foundation, each product MUST be deployed using 1 RG per environment. For details on new
foundation design lease refer to Section 6 - New Foundation Design - Azure Foundation 2018.

HIGH LEVEL VIEW OF PRODUCT MIGRATION FROM OLD TO NEW FOUNDATION

Version 9.1 Published on 5th August 2020 Page 306 of 321


I&A Azure Solution Architecture Guidelines

Product Migration to new foundation is 2 step process:

1. Step 1: Consolidate all resources from 3 different RGs to single RG.


2. Step 2: Migrate consolidated resources from APP-RG to ITSG-RG.

In addition to this, I&A landscape and product team also need to carry out prep-work. This prep-work enables and
readies your product for migration to new foundation.

PRODUCT MIGRATION STEPS

Product teams SHOULD use following migration steps as a skeleton plan and draft detailed plan to complete
migration to new foundation.

Step Team Time Estimation (in


minutes)

Prep work (Landscape and project team)

Identify subcription for migration Landscape One-Time

Review azure policies for hosting in Dublin EC One-Time

Automation account in Azure Dublin with appropriate Landscape One-Time


webhooks

Peering/Firewall routing between DTL subscription and EC One-Time


other subscriptions

Azure DTL setup EC One-Time

Citrix/DTL access for developers Landscape 240

Identify integration runtimes, OPDGs used in each project. Landscape

Validate DTL to AAS connectivity Landscape

Identify project team permissions on RG Landscape

Create blank RG using turnkey without resources Landscape

KeyVault credential copy Landscape

Identify required databricks mount points Project 480

Identify hardcoded credentials Project

Identify UDL/BDL/PSLZ permissions Project

Firewall Rules Project

Validate citrix/DTL access Project

ADF clean-up Project

STEP 1 - Consolidate all resources under APP RG (Landscape team)

Version 9.1 Published on 5th August 2020 Page 307 of 321


I&A Azure Solution Architecture Guidelines

Move Storage Account to APP RG Landscape 60

Move PDS ADLS to APP RG Landscape

Pause and Move SQL DW to APP RG Landscape

Verify all azure resources are moved to APP RG Landscape

Step 2 - Move to new foundation subscription (Landscape team)

Migrate user group memebership Landscape 300

Move ADF Landscape

Reconfigure linked services in ADF that use IR Landscape

Check connectivity for standard linked services Landscape

Deploy service pause & resume pipelines for Landscape


Azure DW, AAS

Update link service for SQL DW connectivity Landscape

Remove OPDG configuration from SQL Landscape


DW link service

Update SQL DW linked service to use Landscape


SPN

Move Storage Account Landscape

Deploy new databricks workspace Landscape

Move PDS ADLS Landscape

Move SQL DW Landscape

Pause SQL DW Landscape

Move SQL DW Landscape

Remove existing firewall and Vnets Landscape

Apply New Foundation Firewall and Vnet rules Landscape


/settings

Move AAS instance

Pause AAS instance Landscape

Move AAS instance Landscape

Generate parameter file Landscape

Change permissions on RG/components Landscape

Validate all resources are moved Landscape

Step 2 - Product Validation (Project team)

Version 9.1 Published on 5th August 2020 Page 308 of 321


I&A Azure Solution Architecture Guidelines

Resume SQL Data Warehouse Project 480

Refresh AAS cubes using automation runbook webhook Project


published as pipeline in ADF

Developers deploy databricks notebook from code repo Project

Developers deploy ADF pipelines from code repo Project

Users and developers validate key application and report Project


functionality

Developers validate devops functionality Project

Decommission old product environment

Delete OLD databricks workspace Landscape

Decommission OPDG VM EC

Decommission IR VM EC

Version 9.1 Published on 5th August 2020 Page 309 of 321


I&A Azure Solution Architecture Guidelines

Section 7 - New Tool Evaluation


Architecture team constantly works on improving the features in the existing tool set , by managing RAID Log and
working with Microsoft product groups. Also evaluating new tool set, which can be added into the Cloud stack. All
the new tools being reviewed will be added as part of New tool Evaluation and the status of each tool will be
published to the users.

Once the tool is approved for usage within the Azure Landscape, it will be moved to Approved tool section with
design patterns published.

Version 9.1 Published on 5th August 2020 Page 310 of 321


I&A Azure Solution Architecture Guidelines

Section 7.1 - Data Share


*P.S: Security Approval still Pending for Data Share and to be TDA approved only post Security Approval

What is Data Share?

Azure Data Share Preview enables organizations to simply and securely share data with multiple customers and
partners. In just a few clicks, you can provision a new data share account, add datasets, and invite customers and
partners to your data share. Data providers are always in control of the data that they have shared. Azure Data
Share makes it simple to manage and monitor what data was shared, when and by whom.

Key Capabilities

Azure Data Share enables data providers to:

· Share data from Azure Storage and Azure Data Lake Store with customers and partners

· Keep track of who you have shared your data with

· How frequently your data consumers are receiving updates to your data

· Allow your customers to pull the latest version of your data as needed, or allow them to automatically receive
incremental changes to your data at an interval defined by you

Azure Data Share enables data consumers to:

· View a description of the type of data being shared

· View terms of use for the data

· Accept or reject an Azure Data Share invitation

· Trigger a full or incremental snapshot of a Data Share that an organization has shared with you

· Subscribe to a Data Share to receive the latest copy of the data through incremental snapshot copy

· Accept data shared with you into an Azure Blob Storage or Azure Data Lake Gen2 account

How it Works?

Azure Data Share currently offers snapshot-based sharing and in-place sharing.

In snapshot-based sharing, data moves from the data provider's Azure subscription and lands in the data
consumer's Azure subscription. As a data provider, you provision a data share and invite recipients to the data
share. Data consumers receive an invitation to your data share via e-mail. Once a data consumer accepts the
invitation, they can trigger a full snapshot of the data shared with them. This data is received into the data
consumers storage account. Data consumers can receive regular, incremental updates to the data shared with
them so that they always have the latest version of the data.

Data providers can offer their data consumers incremental updates to the data shared with them through a
snapshot schedule. Snapshot schedules are offered on an hourly or a daily basis. When a data consumer accepts
and configures their data share, they can subscribe to a snapshot schedule. This is beneficial in scenarios where
the shared data is updated on a regular basis, and the data consumer needs the most up-to-date data.

Version 9.1 Published on 5th August 2020 Page 311 of 321


I&A Azure Solution Architecture Guidelines

When a data consumer accepts a data share, they are able to receive the data in a data store of their choice. For
example, if the data provider shares data using Azure Blob Storage, the data consumer can receive this data in
Azure Data Lake Store. Similarly, if the data provider shares data from an Azure SQL Data Warehouse, the data
consumer can choose whether they want to receive the data into an Azure Data Lake Store, an Azure SQL
Database or an Azure SQL Data Warehouse. In the case of sharing from SQL-based sources, the data consumer
can also choose whether they receive data in parquet or csv.

With in-place sharing, data providers can share data where it resides without copying the data. After sharing
relationship is established through the invitation flow, a symbolic link is created between the data provider's source
data store and the data consumer's target data store. Data consumer can read and query the data in real time using
its own data store. Changes to the source data store is available to the data consumer immediately. In-place
sharing is currently in preview for Azure Data Explorer.

Security

Azure Data Share leverages the underlying security that Azure offers to protect data at rest and in transit. Data is
encrypted at rest, where supported by the underlying data store. Data is also encrypted in transit. Metadata about a
data share is also encrypted at rest and in transit.

Access controls can be set on the Azure Data Share resource level to ensure it is accessed by those that are
authorized.

Azure Data Share leverages Managed Identities for Azure Resources (previously known as MSIs) for automatic
identity management in Azure Active Directory. Managed identities for Azure Resources are leveraged for access to
the data stores that are being used for data sharing. There is no exchange of credentials between a data provider
and a data consumer. For more information, refer to the Managed Identities for Azure Resources page.

Pricing Details

Version 9.1 Published on 5th August 2020 Page 312 of 321


I&A Azure Solution Architecture Guidelines

A dataset is the specific data that is to be shared. A dataset can only include resources from one Azure data store.
For example, a dataset can be an Azure Data Lake Storage (“ADLS”) Gen2 file system, an ADLS Gen2 folder, an
ADLS Gen2 file, a blob container, a blob folder, a blob, a SQL table, or a SQL view, etc.

Dataset Snapshot is the operation to move a dataset from its source to a destination.

Snapshot Execution includes the underlying resources to execute movement of a dataset from its source to a
destination.

You may incur network data transfer charges depending where your source and destination are located. Network
prices do not include a preview discount. Refer to the Bandwidth pricing details page for more details.

Currently, data provider is billed for Dataset Snapshot and Snapshot Execution.

Version 9.1 Published on 5th August 2020 Page 313 of 321


I&A Azure Solution Architecture Guidelines

Section 7.2 - Snowflake


Snowflake is a cloud data warehouse available as a market image in Azure Cloud. Snowflake is a software as a
service (SaaS) application, with clear separation for consumers. No hardware (virtual or physical) to select, install,
configure, or manage. There is no software to install, configure, or manage. All ongoing maintenance, management,
and tuning is handled by Snowflake.

Clone enables branching of database. It copies metadata at a point in time so the clone will appear like the
original, from a user perspective. They only pay for data storage for deviations from the original. Ideal for
testing or if a static copy is required for auditing purposes.
Live sharing feature enables an account to share data with another Snowflake account. Sharer pays for
storage, recipient pays for any compute they carry out. Sharing across regions will be implemented by
replicating data across regions and across cloud providers.
All account types include time-travel in order to “undelete” for 1 day (Standard) or 90 days (Enterprise and
above).
Concurrency : Very good compared to any data base solutions available, along with scalability of the
product.
Cost : As the cost is per second , this can turn out to be a cost optimized solution for large data sets.
Agility: Snowflake requires no maintenance associated with other data warehouses, data is loaded and
queried without indexing, partitioning etc. This can enable a more agile working environment.

I&A Proof Of Concept Overview:

Evaluation Criteria: Unilever engaged with Snowflake to carry out testing of Snowflake’s Cloud Data Warehouse.

Technical Problem with existing solution: Currently I&A uses SQL DW and AAS as source for end user
reporting, this is turning out to be an expensive solution. Main evaluation done here is to see if snowflake can
replace both SQL DW and AAS combined as a compute for End user reporting.

Some of the current challenges with SQL DW and AAS

AAS has a limitation of 400 GB of max data per instance (s9 instance), which costs a lot.

Unilever has business requirement of pre-built reports and self service from end users using PBI
Pre-Built reports are not granular hence 400 GB of cache is good enough.
Self service needs to be done on granular data , with the expectation of good performance. Having
this huge data in AAS is turning out to be very expensive and not a feasible solution due to 400 GB
limit.
Unilever also looked into an option of using SQL DW for Self- service, but there are two issues with it
SQL DW requires a minimum configuration of 1000DW to get good enough performance. But 1000
DWU costs too much when kept up and running for 24/7
SQL DW doesn’t support enough concurrency. Currently only 32 concurrency is supported on 1000
DWU Gen2
With introduction / evaluation of Snowflake, the idea was to replace both AAS and SQL DW with snowflake if
snowflake can give similar performance as that of AAS.
NO limit on concurrency on snowflake.
Cost of snowflake is cheaper as it is per second billing and comes with automated scale up and scale
down ( automated (pause and resume)
Performance is the main criteria, which is evaluated through this POC

High Level Design:

Data from Azure can be integrated with Snowflake database using Blob storage and Snowpipe. ADF cannot be
used as orchestration tool for Snowflake but snowflake comes with its own scheduling capability.

Version 9.1 Published on 5th August 2020 Page 314 of 321


I&A Azure Solution Architecture Guidelines

PBI has connector to Snowflake but currently snowflake requires OPDG hosted in IAAS VM to make the connection
possible. No gateway solution is being built and will be ready very soon.

Note: Microsoft is working on a non-gateway solution to onboard the new Snowflake ODBC driver into
Power BI Service

POC COnfigurations:

Snowflake : Medium – Clusters(min: 1, max: 4) running in West Europe (Amsterdam)


SQL DW : (DW200c 1000c Gen2) West Europe (Amsterdam)
AAS : (S2:10 cores) West Europe (Amsterdam)
PBI Gateway: Virtual Machine (Data Gateway): West Europe (Amsterdam)
Standard F8s (8 vcpus, 16 GiB memory) / Standard DS4 v2 (8 vcpus, 28 GiB memory)
64 bit Windows Server 2019
PBI Service / Tenant is hosted in North Europe (Dublin)
Data Set:
Tests were carried out on a representational data set that was made available from one of Unilever’s
functional systems.
Bowser used: Firefox/Chrome/IE 11.

Proof Of Concept (POC) Outcome:

Version 9.1 Published on 5th August 2020 Page 315 of 321


I&A Azure Solution Architecture Guidelines

ANALYTICS & INTERACTIVE WORKLOAD

Types of scenarios workloads:

Interactive:278,015,760 rows (Example: Select * from table)

Analytics: 278,015,760 rows (Example: Select SUM(col1) from table group by(col2))

The differences between these two warehouses are qualitative and is caused by their design choices:

Snowflake emphasize ease of use, cost effective & faster in compute.

SQL DW emphasize flexibility and maturity in working with AAS, Power BI & other Microsoft Azure Tools.

Conclusion:

Snowflake is approximately twice as performant compared to Azure SQL DW for Interactive and Analytics
workloads

SCENARIO 1(EXTRACTION OF 1 MILLION ROWS FROM COMPUTE) USING NEW ODBC DRIVER:

Workload 1,000,000 rows loaded into the Power BI table component in the Dashboard.
No calculations/aggregations or optimizations were done in the Power BI Dashboard

Conclusion:

Based on the above results it is evident that Power BI performs slightly better with SQLDW as a backend
compared to Snowflake. But AAS performs better when the number of concurrent connections are higher.

SCENARIO 2 (CALCULATION AND AGGREGATION ON THE COMPUTE LAYER AND RESULTS EXTRACTED IN TO PBI) USING
NEW ODB DRIVER

This dashboard involves a join between two tables (278 million records and 1.3 million records).
Materialized views were created at Snowflake on the above tables to derive the data for the Dashboard.
All the aggregations were done at the Snowflake end. ~50 records were returned to Power BI

Version 9.1 Published on 5th August 2020 Page 316 of 321


I&A Azure Solution Architecture Guidelines

Conclusion:

Snowflake performance is good when all the processing is pushed to underlaying compute. SQL DW and
AAS is giving the similar results as that of Snowflake

Cost Comparison:

Cost comparison based on the workload used for the pilot (Interactive & Analytics )

Cost comparison on general pricing of components

Costing Calculation :

Snowflake Compute is charged based on the cluster up time, which can be set while creating the warehouse.

Snowflake storage is charged based on compressed data at $23/TB/Month vs Azure at $110.8 /TB/Month

Snowflake is designed to scale up, but more importantly scale down and suspend instantly and automatically.

[Above calculation is based on the snowflake cost details provided from the below site

Reference: https://www.snowflake.com/blog/how-usage-based-pricing-delivers-a-budget-friendly-cloud-data-
warehouse/ ]

SQL DW : DW 1000c/Gen2 - indicative cost : £ 10.83 /Hour; DW 2000c/Gen2 - indicative cost : £ 22.94
/Hour;

Version 9.1 Published on 5th August 2020 Page 317 of 321


I&A Azure Solution Architecture Guidelines

AAS S2 indicative cost : £ 3.026/Hour

Snowflake Medium consumes 4 credits which is compared to SQL DW 2000DWU

Conclusion:

Snowflake is comparatively still cheaper than SQL DW, due to auto scale down and auto suspend
capabilities.

Snowflake Storage cost is very cheap compared to SQL Storage cost as it uses ADLS as storage.

Overall POC Conclusion

This product has very good credentials regarding performance, concurrency and simplicity. At this point of time
(November 2019) Power BI performs slightly better with AAS as backend compared to Snowflake.

Snowflake is comparatively a lot cheaper than SQL DW, for the short runs as charging is done only the
compute used and per second. If any job is running continuously for an hour, then SQL DW is 50% cheaper
than Snowflake.

AAS cost is very much similar as that of Snowflake, but AAS has a limitation of 400GB at its highest
configuration Tier and concurrency limitations which is not the case with Snowflake.

Snowflake Storage cost is very cheap compared to SQL Storage cost as it uses ADLS as storage.

Snowflake provides faster query on top of the large data with unlimited concurrency. This could be a decision
point as most of the self service reporting is limited on AAS and SQL DW layer.

Microsoft is working on a non-gateway solution to onboard the new Snowflake ODBC driver into Power BI Service,
which is expected to be available December 2019. Unilever may carry out another set of testing for performance
once the non gateway solution is available.

Unilever needs to conduct the similar analysis for a production workload / live application before taking the decision
on moving to Snowflake.

Refer Snowflake POC Details for more details

Version 9.1 Published on 5th August 2020 Page 318 of 321


I&A Azure Solution Architecture Guidelines

Section 7.3 - Synapse Analytics


Synapse Analytics is a fully managed analytics service or integrated platform that accelerates the delivery of
BI, AI and Intelligent Applications for enterprises. Also called as Arcadia initially, is also the Next generation of
Microsoft’s data warehouse product - unifying and simplifying data integration, data warehousing and big data
processing capabilities at scale.

What is Synapse Analytics?

Synapse Analytics is a fully managed analytics service that is equipped with E2E tools from ingestion to BI
reporting at one place in a single workspace.

Tools available in Synapse Analytics

ADF – Ingestion

ADLS – Storage

Spark – Compute & Analytics

SQL DW – Data modeling and database

Tools Planned

Power BI
Azure Analysis Services
Azure ML

Benefits to Unilever

Easy Management: Consolidation of components per project under one workspace. E2E Analytics tools
used in Unilever today like ADF, ADLS, Spark, SQL DW will be hosted within a single workspace.

Integrated Dev Ops environment: Integrated development environment with web development tools.
DevOps teams can access and build modules using the web development environment instead of using
specific dev ops tools like VSTS, SSMS etc.

Easy deployment: One place to manage all resources and deploy it from one workspace to another.

Secure Environment: Greater security and access controls at the workspace level. Need not store or share
the credentials of each component with user. Data and compute lies at one place managed through access
controls applied at workspace level. No download and install of dev Ops tools required.

Centralized Monitoring & Alerting: A central space to manage and monitor all the jobs for a workspace.

Additional features: On-demand SQL on top of ADLS, On demand Spark on ADLS. Users can run any
query (Spark or SQL) on ADLS

Customer Value with Synapse Analytics

Workspace
Perform all activities for analytics solution
Secure and manage lifecycle
Pay only for what you need
Data analytics at-scale
Relational and big data processing with Spark and SQL analytics
Serverless and provisioned compute

Version 9.1 Published on 5th August 2020 Page 319 of 321


I&A Azure Solution Architecture Guidelines

Batch, streaming, interactive and ML workloads


Elimination of silos
Process data in data lake in various formats
Interoperability between multiple processing engines
Friction free with integrated platform
Unified security model
Eliminate friction in provisioning, billing and other platform capabilities
Single pane of glass for monitoring and management activities
Highly productive collaborative experiences
Web tooling experience for data integration, preparation, analytics and serving
Code-first and code free-experiences
Deeply integrated with Data Lake, Data Governance, Data Sharing, Power BI and Machine Learning

Decision on Synapse Analytics:

Synapse Analytics vision is great and will be very beneficial for Unilever, but at its current state the product
is not completely mature. Only General Availability component as of April 2020 is SQL DW Gen2/ Synapse
database. Many of the features are still in private preview.

Unilever is one of the customer conducted the review of the product and provided valuable feedback on the
improvements to product. There are lot of features missing in the tool. Unilever is working with Microsoft to review
the missing features.

Version 9.1 Published on 5th August 2020 Page 320 of 321


I&A Azure Solution Architecture Guidelines

Features are not mature: Not all features available in ADF, Spark present in Synapse Analytics preview.
In its current form, lot of work required to have a complete Integrated Dev Ops environment Tools like
AAS and Power BI and ML tools are missing.

Easy migration: One click Migration from current process to new integrated platform is missing, in order to
adopt the tool for existing environments.

Security roles and controls. Security controls and roles are not mature enough in Arcadia. Expectation is to
manage connections between each component internally and no credentials needs to be shared with User.
Though this is the vision , currently it is not available.

Azure DevOPs integration: DevOps integration is planned sooner, currently not available.

Note: Only SQL DW Database or Synapse database is approved for usage in Synapse Analytics. Once all
the features are reviewed and security approved, Architecture team will make the product available in I&A
Landscape

Version 9.1 Published on 5th August 2020 Page 321 of 321

You might also like